Harvesting from the open web

The Joint Committee on Legal Deposit has agreed as follows:

Library harvesting is a process used to collect content and metadata that is available without access restriction on the open web. The deposit libraries will use automated web crawling software wherever possible, especially when collecting for the 'UK web archive', but may also use manual or other methods of downloading content and metadata when necessary.

 

Web crawling

A 'seed list' of domain names is programmed into the web crawling software as Uniform Resource Locators (URLs). The software uses an initial URL, typically the home page, to initiate the process, requesting a copy of that page. It automatically follows links from the home or root webpage to the next levels down within the same domain, issuing a separate request for each page or file identified. The target website responds automatically, delivering a copy of the page or file.

 

No impact on target website

The web crawling software is also programmed with 'politeness rules' and parameters designed to ensure that there is no harmful impact upon the performance of the target website. These may include a limit on how many levels are crawled or how much content is requested from an individual website.

Also, when multiple requests for different pages and files are issued to the same website, the software is programmed to leave an interval between each request, to safeguard against using up too much bandwidth or overloading the website server.

 

Crawling software identification

The deposit libraries' web crawling software uses standard automated protocols to identify itself and to inform the publisher's website manager (via a 'user-agent string' submitted to the web server's log of server requests) on each occasion that a page is harvested. The website manager or publisher may choose whether or not to use this information, but there is no requirement to make any change to their 'robots.txt' permissions files or take any other action.

 

Login facilities

Where the web crawling software encounters a login facility, it cannot access any material behind the login facility without the appropriate password or access credentials.

 

Electronic legal deposit

 



Speak me