Web archiving

What we archive

In collaboration with the other legal deposit libraries, we archive websites across the UK web domain. This includes all '.uk', '.scot' and other UK domains, plus selected content from domains such as '.com' that meet legal deposit criteria.

The National Library of Scotland prioritises collection of Scottish websites and related web content.

Web archiving is carried out via an annual whole domain web crawl, supported by selective harvesting across the year.

What we collect under legal deposit is determined by the 2013 regulations, including what constitutes a 'UK' website, and the exclusion of certain types of content, for example audio or video. Incidental audio or video content will be archived but websites which are primarily video or audio, such as YouTube or Spotify, are excluded.

Websites or web content that are restricted by a login page are currently not crawled. In such cases we will contact the publisher separately to agree arrangements for their deposit in accordance with the legislation.

We aim to collect as accurate a representation of the original website as possible and will attempt to harvest all the resources associated with a website including HTML, images, CSS and associated scripts. However, certain content may not be gathered due to technical limitations, for example streaming media, dynamic, or interactive content .

Our crawling protocols

We use web crawlers such as Heretrix to harvest and archive websites and aim to keep impact on crawled websites to a minimum. The crawler's User Agent will identify itself 'bl.uk_lddc_bot'.

The crawler will generally follow the robots.txt exclusion protocol. However, in certain circumstances we may choose to overrule robots.txt in order to harvest the content successfully, e.g. to collect Javascript, CSS excluded by robots.txt

Robots META tag exclusions are also normally respected.

At present we are not able to interpret 'rel=nofollow' attributes due to technical limitations of the crawler.

What publishers can do

Adding details of our crawls to robots.txt can stop further crawling. Similarly, blocking our IP will stop further access from that IP address. However, if you prevent the crawler from harvesting you will introduce barriers to us fulfilling our legal obligations and preserving your web content for posterity, in line with legal deposit legislation.

Instead, permitting our crawler greater 'allow' access to your site will help us to create a better archived copy of the website, especially if your default policy is quite restrictive.

Wider access

Legal deposit regulation limits access to archived content to the reading rooms of the legal deposit libraries. To make the archived content more accessible we seek permission from website owners to make their archived content publicly available. Open licencing can make this more efficient. Please consider agreeing to public access requests, or publishing under an open licence such as creative commons, if this is in line with your publishing model.

You can also make your website more archive friendly. The Archive Ready site is a useful tool for self-assessment and includes some advice about archivability.

Following best practice in content design, accessibility and web standards also helps to make your website easier to archive and more discoverable by search engines and end users.

Overall, how satisfied are you with this page?

Very satisfied

Satisfied

Neither satisfied or dissatisfied

Dissatisfied

Very dissatisfied

What are you looking for? (optional)

Limit is 1200 characters

We will not respond directly to comments left here, so please do not include any personal information in your feedback.

What we archive

Our crawling protocols

What publishers can do

Wider access

Give feedback about this page