Gordon Mohr Takes Us Inside the Internet Archives

By James Turner
June 18, 2008 | Comments: 8

You may also download this file. Running time: 00:38:30

James Turner: This is James Turner for O'Reilly Media. I'm talking today with Gordon Mohr, the Chief Technologist for Web Projects at the Internet Archive. Good day, Gordon.

Gordon Mohr: Hello.

JT: So why don't you start out by telling me a little bit about your background and how you came to be at the Internet Archive?

GM: Well I'm a Software Developer and have been involved with a number of projects. One of the interesting web projects I was first involved with was an early object-oriented application server that was done in Smalltalk back in 1995-1996. Since then I've done a lot in the Java world, including an all-Java Instant Messaging Program that was called Ding and it was from a company in Texas called Activerse, up through about 1999, and a collaborative community catalog for digital media, especially media that might be found on peer-to-peer networks that was called Bitzi and is still available at www.bitzi.com. So a lot of my interests were piqued when I heard that the Internet Archive was looking for someone to be the lead on some new Java development projects related with web archiving. It touched with what I had already done in the object-oriented world and what I was interested in with the web and just the large-scale of touching anything and everything that might be out there on the web.

So about five years ago I joined the Internet Archive and we've created a number of Open Source projects since then and yeah; that's [Laughs] what I'll be talking about at the O'Reilly Open Source convention.

JT: Why don't you give us a little bit of history of the Internet Archive itself and where it came from and where it is today and how it's funded and all of that good stuff?

GM: Sure; the Internet Archive is a non-profit library, specifically a non-profit digital library. Historically we've had a special focus on things that were created digitally and the web. It's one of our biggest and best-known collections. But we've been increasingly involved in digitizing other media--audio, video, and especially now books. The theory behind the Archive is that given what technology now makes possible, we can give everyone access to all of human knowledge. There really shouldn't be any barriers once we work out some of the technical and funding issues; it's within our reach.

Starting in 1996 the Internet Archive, which was founded by Brewster Kahle, started collecting the web and for a while it was a dark archive. It was crawling the web just like search engines except it wasn't running a current search engine. It was just storing the material for the future. I believe it was in 2000 or 2001 that it was first opened to the public via our well-known Way Back Machine. As it stands now, we have about 11 years starting from late 1996 of captured websites; it's well over 100 billion captured URLs at specific dates and is well over a petabyte of compressed data. That is again just one of the products at the Archive; we've also been a major sponsor and operator of book scanning efforts in partnership in the past with both Yahoo and Microsoft and libraries around the world.

JT: Underneath the covers, how does it actually work? On a day-to-day basis what is it doing and how does it decide what to archive and how often?

GM: Well that's different for different projects and I'll speak the most about the Web Archive, which is where I have the most involvement. Specifically, the bulk of our content still comes from a sister organization, a company called Alexa that became part of Amazon a few years ago, which does web crawls--bulk web crawls like a search engine, broad surveys of the web. And then they donate that material to us for the historical record. So the first thing that happens is the data comes in the door via a donation from Alexa. After that and after we've indexed it, it becomes part of the public Way Back Machine, which anyone can visit and if they type in an URL they can see the list of dates that we have a capture of that URL and then they can begin by clicking on one of those dates to browse the web from that point at that time. And you can even continue to click on each page and we will show you the next best match that we have for that URL and that time period. That involves a fairly large cluster of machines, which each have a tiny portion of the index of our current holdings and are doing a lookup, primarily by the key of URL and date and then finding where we've stored that on our cluster of machines and excerpting that and proxying it out to the user, all in their web browser to create the, well, like you're browsing the past experience.

JT: So really the decision about what and when things are going to be indexed is really in Alexa's hand?

GM: Yes; we have some influence over that but the bulk of our collection is due to their ongoing survey crawls. We also have a lot of projects now days where we've partnered with other organizations, most notably national libraries and national archives, to do targeted collections based on things of interest to them and their users or things of current affairs interest. So we work an awful lot with the Library of Congress in continuing crawls of US Government sites, of news sites, of political sites to understand what's going on in the world. We've done large crawls for the National Library of Australia, the National Library of Ireland, the National Library of France, the National Library of Italy, and each of these also helps to grow our collection. Those crawls are the crawls that are done inside the internet archive on a sort of project by project basis and some of those happen on an irregular basis, just when they're asked for; others of them happen once a year; some of them happen as quickly as once a week or once a month.

JT: Now, are those crawls of their sites, or crawls of things that are of interest to them?

GM: Generally those are crawls of some scope that they've decided is of interest to them, so it might be a hand-picked set of sites, anywhere from dozens to hundreds or thousands, which come from their subject experts especially in the case of libraries, or it could be something even broader like, we would like everything that's .au and so we do that for the National Library of Australia. We will be coming up on our fourth year of doing that for them in this fall.

JT: As far as when there's breaking news or something hot, for fun I went and looked back at various sites for September 11, 2001 for example, and the first interesting thing is there are no sites before the planes hit, which I suspect was just a timing of which ones that got crawled. But certain sites you've got 12 entries for that day, for example CNN and the Christian Science Monitor, but there is nothing for I believe www.cbsnews.com. So how does that functionally end up happening?

GM: That would be both the years before I joined the Archive and something that was part of Alexa crawl operations. But I do know that as of the events of that day crawling was sped up, and so some of the old cycles that used to be something like every few weeks or every couple of months went into much faster cycles for some news, some hand-picked news sites. So I think you'll see especially that--well there were a lot of gaps in times leading up, a lot of major news sites immediately went on a faster collection policy as of that moment, but as far as specific sites or something like that what we have now are the results of the crawling. We don't have the full visibility into all the selection choices over the years.

JT: But looking forward for example, as say the Beijing Olympics come up, is it reasonable to expect that sports sites will start, especially maybe the Olympic sports sites or the covering sites for the network that's covering that, the Olympics would be on a faster cycle during that period?

GM: Absolutely; and that's by a couple of different ways in which that could be covered a whole lot more, one of which is simply anyone's broad crawl including Alexa is very sensitive to sites that are highly in-linked. So things that are constantly changing and highly in-linked will tend to get more visits. Alexa, as well has--and has always relied a bit on their toolbar, which shows them a view of what some sub-set of the net is visiting and those sites that tend to be popular tend to get deeper visits and more rapid visits. We'll also be covering that just as a matter of current events. I think one or more of our library partners will be interested in doing focus crawls based on Olympic sites that they've chosen, very specifically like the official site or major sports news sites. As well, we have a program now called Archive It, which works with smaller libraries, especially things like state libraries of the United States and university libraries where they for their own research purposes or their own archival purposes can use a web-based interface to select sets of sites and put together time-based collections for whatever period of time they find interesting. So they can set up daily, weekly, monthly crawls around these sorts of events; and so I would expect though I would have to look into the details of it, I would expect that one or more of our partners who do that will probably be doing things around the Olympics, around issues with China, around political issues with China, around anything that's happening in Asia like politically or with various natural tragedies that occur.

JT: So why don't we go really under the covers; give me a sense for the scope of the server farms running this--what they're running for software, how many disks?

GM: Well the Archive as a whole has thousands of machines. I don't know exactly the number; I mean I could check on it and get back to you. For our Special Collections, for our partners, we have about 11 racks of machines each with 40 1U machines in each of those that have four hard drives. And that's just Special Collection data, things that have been requested by partners. The web Archive as a whole, as I mentioned that historical collection going back to 1996 is somewhere in the 1.2 petabyte size now days and that's spread out over an even larger number of racks of the machines. So we do think in batches of 40 machines really and rows of batches of 40 machines and so the--yeah; the Archive has a pretty large and always-growing cluster. A new rack is coming in [Laughs] at the very least monthly and sometimes more often.

JT: So it sounds like from what you've said there you're not really dealing with like a SAN-based architecture where you've got some massive appliance out there doing it; it sounds like it's very distributed?

GM: Correct, absolutely; it's all commodity hardware. It's all open-source or in-house developed software and in general it uses plain old disks, individually addressable on individual server machines, which we've acquired at the cheapest cost we can. Even for a matter of redundancy we tend to use full mirroring in pairs of disks or pairs of machines rather than other--.

JT: So instead of RAID5 or something like that you'd be doing a RAID0 instead?

GM: RAID5 is out.

JT: RAID1 actually.

GM: Essentially no RAID but application level mirroring of files between banks of machines.

JT: Backup isn't so much backing it up to another media; it's just having it well spread, and is there any geographic redundancy?

GM: Right; right there is no backup to a different media, just other spinning disks and the chief geographic distribution that we've achieved is via a partnership with the Library of Alexandria in Egypt. On two occasions in the Internet Archive's history they've been delivered a complete copy of the Web Archive as of that moment. At one point this happened in 2002 and then again it happened as a delta update in 2006. And so the best bit of geographical backup that we have is the Library of Alexandria, Egypt, which is a really good mirror up to 2006 and that sort of massive delivery will probably happen again sometime in the future.

JT: There's no irony intended in the fact that one of the chief repositories of the world's knowledge is the Library at Alexandria?

GM: I think it's deeper than irony; I think it's meant to send a message that people remember and we had better not repeat the mistakes of the past. I know that the story of the Library of Alexandria, the ancient Library of Alexandria, has always been an inspiration especially to Brewster, our founder. I think that's the origin of--or one of the origins of the name Alexa and so it is in our minds to take the long view and be both something that's as useful and as interesting and as comprehensive but for the new digital internet area, but also something that learns from what happened there and tries to share the knowledge and spread it a little bit more other than having it all under one roof.

JT: When you do one of these massive data drops, I mean, is this the proverbial 747 full of hard drives?

GM: Yep, absolutely; in fact in both cases it was in fact an entire set of machines, active machines with the data and the software shipped over and then plugged in for permanent installation.

JT: You mentioned that you use a lot of Open Source and in-house developed software. I assume the underlying operating system is something Open Source(y)?

GM: Yes, yes; what we've moved over the years from a Redhat version to a brief use of something that was a pure Debian to now using almost exclusively Ubuntu.

JT: What are your major day-to-day technical challenges? I know that I've gone to the site and occasionally you'll get one of these "we're having a little bit of a brain fart right now" moments on the site? What are really the things that would keep you up at night or the things that cause that type of brief down time to happen?

GM: Well with our data center, given the size of it, and the fact that we are running it on a very inexpensive basis, there's always some part of it which is in transition. There's always some data being moved; there are always some machines that are needing disks swapped. There may be a rack that is powered down and being relocated and so forth.

And in general, sometimes--well many times, when you're visiting the Way Back Machine many sites will work but then another site that you go to look at will not work and that's because some of the machines responsible for a range of the index or for the content may be temporarily off-line, maybe--and so it's always worthwhile to check those again on a scale of hours to a day or two later, in which case it--you may get a better result.

Sometimes this is also a question of whirlwind software updates and transitions; our recent transition between two versions of Ubuntu took a large portion of the Way Back Machine index offline for a few days. So those are the sorts of things that bite us. Sometimes they're worse; sometimes they're sort of power grid fluctuations that can affect the--all of San Francisco [Laughs] and then things aren't fully up again until someone goes to the data center and rectifies whatever didn't come back up.


Hear more from Gordon Mohr and other leaders of the Open Source community. OSCON 2008 is happening July 21-25 in Portland, Oregon. Register today!



JT: Obviously intellectual property issues are becoming a big issue all over both the computer industry and society. I'd like to approach it from a couple of different angles with you. The first one is that it strikes me that you would provide a good initial resource for, say, a lawyer who was trying to prove out a company's behavior over time or that something had really been said on a website. Do you get subpoenas for historical records of websites?

GM: We do on occasion; we are very popular among people who are doing all sorts of legal research, including things like patent prior art research. In general, people, since what we have is available to the public it's not necessary to subpoena us. You can visit our website and view it. So that is a--something that comes up occasionally but not very often.

JT: Is it that kind of thing that somebody could in court point to as real; I mean do--have juries or courts considered it to be a valid source of evidence for a case?

GM: Occasionally when it has been discussed in court I know that people have testified as to the processes that are involved, so that the court can make the determination as to that, but it's not--that's not an area that I have any real expertise in.

JT: Something you might also get hit with from time-to-time, I would assume that like anybody else who stores records of what's on the internet there's the potential for someone to come at you with a DMCA takedown. Do you get those?

GM: Sure; yes we do. We do; and we respect them. We are a library; we are a non-profit library. And we have a gigantic cache of information--all of that information, which was public at the time we collected it. So we operate under a tradition of sort of openness, and if it was okay to collect, then it's okay for us to show in a non-profit manner. But if people do have concerns we address them. Our policy is on the website. There are a number of ways you can prevent yourself from being collected. There are a number of ways you can get content that you are the legitimate publisher of removed if need be, after the fact.

We aren't really interested in adversarial collection; we're not trying to put anyone's contents up against their wishes. So the combination of following the internet standards like robots.txt for avoiding the collection in the first place and giving people a number of options for--if they are the legitimate publisher, excluding it if it was collected or is in our collection inadvertently has caused us to not have big problems in that regard. I think most people see as what we see ourselves as--a public resource and not something that's trying to misuse anyone else's content. So people are understanding and people are nice with us.

JT: Do you ever find that you get what might be commonly considered to be abusive DMCA notices and how do you deal with those?

GM: It's not too common; I don't directly deal with those. That's handled by someone else and we have excellent legal counsel, especially the EFF helps us quite a bit when we have to evaluate these issues. So I haven't seen it as a big issue. I am sure there's some crazy ones that come in but if we can make a reasonable determination one way or the other using our legal counsel we do so and I'm not sure of any that I feel like needed highlighting.

JT: Yeah; I'm not looking for examples. And the last piece of the IP or not even so much IP as just content question and then we'll move on--I, out of curiosity, put in some fairly well known adult sites and you can see an interesting history of their front pages going back for you know 10 or 15 years and some of them have fairly explicit stuff on their front pages. Do you consider yourself covered by the common carrier aspects or how do you deal with that issue?

GM: As far as the legal aspect it would really be something I'm not qualified to comment on. As far as just the user of the internet and someone who works on the collection anyone will tell you there's a lot of porn on the internet, so there's a lot of porn that gets collected when you're archiving the whole internet and we want to provide an accurate picture of what the net is like. So as far as we don't review things that are in the archive; we don't apply our own standards of things--our own standards of what should be in there or what shouldn't be in there. So yeah; as far as if somebody wanted to raise a stink, what their grounds would be and what our grounds would be that would be somebody--that would be a question for somebody else.

JT: Moving back into the technology, the net and the web--the web, the net, the internet have really started to metamorphose into a much richer environment and whereas before archiving the web for example would have been snapshots of HTML now your typical web page is full of Flash and now you're getting into Flex and Silverlight; how is the wave that the web is changing in the Web 2.0 era making it harder or easier or just different for you to keep an accurate snapshot?

GM: In a few ways people's advanced web techniques have helped us when they've become more conscious of the importance of appearing in search engines, so they make their content discoverable or they provide alternative accessible paths--simple paths to content to make sure that search engine crawlers find it. So occasionally when people are, even if they're implementing the advanced features, they remember they have to accommodate the search engines and the low accessibility users and they provide us with alternate paths so we can still archive a useful image of their content.

However as you seem to suggest, a lot of the rich interactive aspects of the web, and this includes Flash and Javascript, can make it harder to crawl things and even if they are effectively collected and harvested you may not have an image which reflects the way it looked or the way it was important to the person who originally used it because we don't have any archives of the backend server functionality. And we may not be able to replay the content that was collected that's rich content from our Way Back Machine in a way that approximates the way it originally looked. So there are some challenges there.

We've been working to make sure that we do a better job of extracting links from Javascript, extracting links from Flash, so we at least get all of the static resources that are referred to there. We've kicked around some ideas for how we might create a record of how people are using the rich applications even if we can't reproduce them entirely in our archives. But that would be a much more manual process and so we are hitting some walls as far as what technology can easily automate for us.

There was certainly a great era for a while, and I'd say we're still in it, where some relatively straightforward software can collect a great portion of the net, a great portion of the world's online knowledge. To some extent these gardens, these walled gardens of needing to log in before you see content or going for a very rich interactive desktop-like experience mean it's no longer the case that a simple approach collects them and represents them for the future. And so that's a challenge we deal with and that's a challenge all of our partners are dealing with; the various libraries and archives around the world are running into this somewhat.

JT: Another related area, there's a lot of content now being created on the internet in things that are nothing like the web. One example for example might be all the content that's now available in Second Life. Is there any thought about getting into those types of environments?

GM: [Laughs] Funny you should mention that because in fact that was the topic or one of the major topics of a somewhat whimsical presentation I made at the National Library of Australia back in April. It was about what these new challenges are and one of them are--is something like Second Life and another is something like a social network, which you only see if you log in and even then you only see a vision from your perspective. And so I think our community has started to think about that and ways in which people might be able to opt in to let these spaces that are otherwise either closed or difficult to collect or require collection via different protocols and figure out some way that they could opt in and let some historical record be created.

Of course there are gigantic privacy issues involved; I mean people may not expect their actions and their creations in some of these spaces to be archived, and those would need to be handled very delicately. But there are other people who would want to have a record at least for themselves or possibly for posterity for their own future selves or their children and offspring of what life was like in these new environments and they may turn out to be very important in the future. So the early days will be incredibly interesting to people. So I wouldn't say we've done a lot of work in it but it is a topic that's coming up with increasing regularity; it's a question that's being raised and some of the ideas are well can we generalize the idea of an automated agent either independent or as a sort of person looking over the shoulder of some real participant and have them collect an interesting record of these places. And at some point maybe it just becomes like a screencast or a video recording of the environment--of interesting experiences in these environments.

But at another level it might be able to take a snapshot of these worlds at one moment in time or at least a snapshot of all the architecture in these worlds at one moment in time or of the extremely busy public places in these worlds.

At some point it even begins to overlap with the fact that as we are recording the real physical world at ever-greater resolution with things like Google Maps, satellite photos, StreetView, automatic collaging of pictures from multiple sources or video feeds from multiple sources of real places, the same thing can apply to virtual worlds. It almost seems like the two worlds are meeting and our ability to record them and our ability to sense them at a very precise level and record that potentially forever.

JT: And I'm not sure I want to think about what posterity is going to think about a recording of my Twitter feed.

GM: [Laughs] And yet the thing that we have to face is people who are online is this stuff may never go away; it does concern me. And yet you can look back 10 years, 15 years and find things that are in our Web Archive or in Usenet Archives and it's--the record is going to be there. The open question is how the future will interpret it.

JT: There was actually a fairly famous case in New Hampshire, where I live, of a person who was elected to the State Senate; actually the State House and then people started pulling up his Usenet Archives about killing police... He wasn't in the House very long.

GM: Right; so and oh, there will be much more of that. And there will be as well the issue of whether or not the current generation and the younger generation have a whole different view of this because perhaps there will reach a point where everyone has equally embarrassing material on the net from when they were young and young and rash. And so perhaps it will become a part of what people write-off as expression, as contextually explained. You never know; interesting days ahead I think.

JT: You know your My Space page from when you were 20 will be the equivalent of your really embarrassing high school photo. One more technical question before we get off it entirely; do you make any effort, now that there's a lot of exploits of Javascript and things like that out wild on the net, do you guys make any efforts to strip that out of the content that you archive?

GM: Currently no; much Javascript does not work properly if it wasn't collected completely or it was dependent on its original location in the Way Back Machine, but a lot does. And so visiting a potentially malicious site in the Way Back Machine is currently at a par of risk with visiting it on the live web. It is something that we might want to integrate in the future with things that mark dangerous sites something like site blacklists and proactively disable Javascript on some sites for people, but right now our focus has been represent the web exactly as it was and that includes the dangers that it was too. We can hope that in fact by the time people are viewing things in the Way Back Machine the software vendor has caught up but maybe that would be an overly optimistic--.

JT: You may be the equivalent of the Smallpox repositories?

GM: Archive--yes; well and if in fact exploits from--nothing appears in the Way Back Machine currently that's younger than six months old. So it's always at least a bit of time in the past, but if in fact an exploit from years and years ago can wake up and cause new havoc years from now it--there would be a risk that archives could provide that. And it might be something that we have to either quarantine or warn people about or try to retroactively apply the things that we've learned today about the dangers from a few months ago, like maybe take those blacklists, the things that drive the blacklist toolbars and apply them; archive them first and apply them to the relative periods of time.

JT: You're going to be speaking at OSCON at the end of July. Can you give us a little bit of a preview of what you're going to be talking about?

GM: Sure; I'm going to talk chiefly about the Open Source projects that I've overseen and developed here at the Archive in the past few years. And they together create a full tool-set for creating your own web archives that are at least as capable, or possibly even more capable, than what the internet archive provides to the public. Those projects include the Heritrix Open Source web crawler, which is an archival web crawler, especially suited to getting a complete copy--perfect copies and really comprehensive copies of large segments of the web. So once you collect things with the web crawler, well you've got a bunch of inert files on disks and that isn't particularly interesting. So our next project after that is called Way Back, and it's an Open Source version of the Way Back Machine. And it lets you take collections of data, possibly spanning over many disks or many machines, index them, and then provide a browsing interface to them just like our public Way Back Machine.

Finally, there's a project that we have called Nutchwax, which is a series of add-ons that allow you to use Open Source Nutch full text index to index your web archives. There are a few extra wrinkles when you're dealing with archived web content because you may have the same URL at multiple dates and other changes to what you'd expect. Another thing is that you won't be sending people who have search hits out to the live web, but you have to serve them all the content that you've collected and indexed. It's like having a Google where you never visit the live sites; you always visit the Google cached versions and there are many cached versions.

So if you put those three things together, Heritrix for collection, Way Back for browsing access, and Nutchwax for full text search access, people can create using this completely Open Source stack, their owned focused web archives. If they're willing to put a little bit of extra work in it they can also create their own gigantic web archives, and this was an important part of us kicking off these projects a few years ago. It was to make sure that anybody who wants to do this enjoys the benefits of the best knowledge about how to do it.

There were a lot of the crawlers that each were good in their own way but didn't include what an archivist needed, like saving all the HTTP headers or saving all the different content types. There were different ways to view the content but they were often--didn't work so well on challenging content, required you to browse around say a local directory tree of content, and so we provided something which allows you to run it as a web service so people can browse. And it's been very important for our partners around the world, these national libraries and archives; many of them use this software to run their own crawls and build their own archives.

The whole world is opening up to the idea that you know the web is our printed culture now; it's our media, and if we don't save it--if we don't take the effort to save it, it will disappear in a way that magazines and periodicals and books--you can stack them up in a closet somewhere and they won't just disappear. The web, if you don't fetch it by the time the company goes out of business or the content is yanked it could be gone forever. So we've built this stack of things; I'll be speaking at the OSCON about what each of them does and how to use them, and I'll be showing the creation of a really small focused archive using those tools as an invitation for people who have things they want to do, somewhat related to archives, or somewhat related to web-crawling or somewhat related to browsing sets of web content, as an invitation for them to get involved with the projects, to use the projects, and just know that this resource is out there--that it's not something that can only happen inside the internet archives' big computer cluster. Anyone can do it.

JT: So we've been speaking today with Gordon Mohr, the Chief Technologist for Web Projects at the Internet Archive. You can see Gordon speaking at OSCON at the end of July. Thank you very much, Gordon.

GM: Thank you very much James; it's been a pleasure speaking with you.


You might also be interested in:


8 Comments

Very interesting interview, thanks for this. Is there anything users can do at a page level to prevent being indexed? Many people writing and contributing online don't have robots.txt level access to prevent indexing of their own blog pages.

you can always choose to make your blog "non public".
then ppl need a login/pass to browse.
when its public, it gets archived. pretty simple.

You can have something be public and not get archived by Internet Archive -- but you have to have access to the robots.txt file. I'm wondering if there isn't some exception that can be put in the content itself. And if there isn't something that exists that performs that function maybe the Internet Archive can lead the way by asking us to include a tag or some metadata in the page itself that means don't archive me.

Then again, I can see commenters playing havoc with something that simple. Yeah, it's a bad idea...

Non-Archived:

What you describe already exists. You can put this in your HTML header:

<META NAME="ROBOTS" CONTENT="NOINDEX, NOARCHIVE">

This is an extremely long but very interesting read. Gordon does a great job of thoroughly answering James' questions, and I found the questions James asked to be the exact ones that I was wondering about! Very well done, both of you.

I agree with Esd. Very enjoyable read.

James thanks for the interview.

One question - did you ever follow up with Gordon on the number of discs the IA is currently running? If so, can you pass that along?

Its not okay to take images from copyright holders and use them if you are not making a profit...this is a common myth. If someone is the copyright holder they have the right to where an image is used, it's not common domain.

Popular Topics

Archives

Or, visit our complete archives.

Recommended for You

Got a Question?