National Internet Archives

By Bryan Rasmussen
June 20, 2008

OReilly News has an interview up with Gordon Mohr of the Internet Archive.

One interesting thing about the internet archive is that they are not the only organization archiving the internet, or portions of it. Of course it's likely that national security organizations and big companies may be archiving for their own purposes but also a lot of national libraries and archives are archiving their own sections of the internet, for example www.netpreserve.org is an organization of the national libraries of Australia, Canada, Denmark, Finland, France, Iceland, Italy, Norway, Sweden, The British Library (UK), The Library of Congress (USA) and also Internet Archive.

Whether or not all these libraries are currently archiving they have at least shown a marked interest in doing so by their membership. Denmark, in order to find out the level of archiving going on, sent a survey around to 95 different national libraries to which they received 39 responses (so quite a few more than just the ones in the netpreserve organization), the results were that 58% were harvesting (or planning to harvest) the entire national domain, 93% collect (or plan to) selected websites within the national domain, and 42 % also harvest (or plan to harvest) websites outside the national domain.

Now one thing that is sort of scary about this is, robots.txt does not have the force of law.

The Internet Archive obeys robots.txt of course (lucky for you if you have access to it on your site, otherwise not so much) and they will also agree to remove things at the domain owners request. Other libraries might not be so accommodating, specifically the Danish netarchive might not be so accommodating, lets look at some stuff they say - the following is from the already linked survey report:


Briefly about the Danish experience in general: we have been harvesting the top-level domain .dk since July 2005 when a new legal deposit law went into effect. We aim at preserving the Danish part of the Internet as part of the cultural heritage for future generations to experience; however, alone we will not be able to duplicate in entirety the typical Internet surfing of today unless we can provide access to the other parts of the network in other net archives. In Denmark, the Legal Deposit Act allows us to harvest materials published within the .dk top level domain as well as materials published from other Internet domains which are directed at a public in Denmark, so we harvest the entire national domain as well as selected websites outside the domain. We have found about 30.000 websites outside .dk that are aimed at a Danish audience, primarily sites with Danish text but also sites belonging to Danish companies or institutions, or to individuals (musicians e.g.) who are domiciled in Denmark

And this is from the Danish FAQ - they don't have it in English so I will translate as well as I can:

3. Why does netarchives crawler ignore robots.txt? On a great number of websites robots.txt directs search machines crawlers around material which is absolutely necessary to recreate the material to recreate the experience a person has as a user of the internet in 2006 (note: maybe removing date references from this kind of document would be a good idea, or updating the faq) Experience shows that if we collect while respecting robots.txt we end up missing a large amount of vital data, for example newspaper websites, but also 10,000 private website which are seen as giving an important contribution to the Danish cultural heritage. Statistics from July 2005 shows at there are under the .dk domain at least 35.000 robots.txt files - netarchive does not have the resources to evaluate each of these. After the same principle the netarchive has the possibility of setting aside HTML meta-tags.

So there you go, you just may be getting archived whether you want to or not.


You might also be interested in:


Popular Topics

Archives

Or, visit our complete archives.

Recommended for You

Got a Question?