Episode 34 Transcript | Untangling the Web

Brewster Kahle: We’re now going backwards and digitizing books, music, video. And we really want an open library system as opposed to a commercial answer to the whole thing. There’s so much wonderful things that are just not being read because they’re not that available, and people are going to read whatever it is they can get their hands on. Misinformation can be rife and just published out the wazoo.

Noshir Contractor: Welcome to this episode of Untangling the Web, a podcast of the Web Science Trust. I am Noshir Contractor and I will be your host today. On this podcast, we bring in thought leaders to explore how the web is shaping society, and how society in turn is shaping the web. My guest today is Brewster Kahle, who you just heard talking about his vision to create an all-encompassing online archive.

Brewster has spent his career intent on a singular focus: providing universal access to all knowledge. In 1989, he created the internet’s first publishing system, called Wide Area Information Server (WAIS for short), which he later sold to America Online. In 1996, he co-founded two sites to help catalog the web: Alexa Internet, which he sold to Amazon, and the Internet Archive. The Internet Archive is one of the largest libraries in the world and now preserves 99 unique Petabytes of data, books, web pages, music, television and software of our cultural heritage. In 2001, Brewster implemented the Wayback Machine, which allows public access to the World Wide Web archive that the Internet Archive has been gathering since 1996. Brewster was elected a member of the National Academy of Engineering in 2010. He’s also a member of the Internet Hall of Fame, a fellow of the American Academy of Arts and Sciences and serves on several boards. Welcome, Brewster.

Brewster Kahle: Thank you. It’s great to be here.

Noshir Contractor: This podcast, of course, is Untangling the Web. But when I think of you and everything you have been doing in your career, I think of you as somebody who’s contributed to help us rewinding the web rather than just untangling it. And in that spirit of the Wayback Machine, I want you to take us back to 1992 when you first came up with the idea of WAIS. Tell us what prompted that. And it’s important to note that in many ways, WAIS was a precursor to the World Wide Web.

Brewster Kahle: Absolutely. The idea was of the internet, the opportunity was to build the library, well, of everything. Could you take the published works of humankind and make them available to anybody, but not just anybody, but any computer. Could we go and mush people, networks, and computers together? This was sort of the dream back in 1980 to try to figure out, how do we go and build this? First, we needed to go and build computers that actually could handle this. And Danny Hillis at MIT who I worked with had a great idea called the connection machine of making a supercomputer out of lots of little computers. And so I helped build that to go and try to make it so we could go and handle building the library of everything. And then I built WAIS and did that in 1989 and then made it publicly available for free on the internet in 1992, as you point out. The idea is to try to get publishing to go so that you don’t just have one big database of everything, you wanted to have people be able to have their own information in lots of different servers, a decentralized system. And that was the idea of WAIS. WAIS was kind of the search thing at that time.

Noshir Contractor: So as you looked at it, the library that you were building was a distributed search and document retrieval system where documents could be distributed all over the internet. And what you were providing was an indexing system, in the parlance of library talk, and you were trying to see how one could search for these documents anywhere on the web and then how one could retrieve it. At the same time that you were thinking about WAIS and how it fed into the World Wide Web, you also were thinking about a different product called the Alexa Internet.

Brewster Kahle: So WAIS helped get people online and made everyone able to become a publisher. And could you even, you know, control the distribution of your works? Could you even charge for it? We made the first subscription-based service on the internet. We made the first ad-based system on the early web to try to help make that all work. But once we got kind of the commercial side going by ‘94, ‘95, then the idea was, we could turn to build the library. So Alexa Internet and the Internet Archive started on the same day in 1996. And one was a for profit, and one was a nonprofit. And the for profit, Alexa Internet, was to catalog the web. So we could start crawling the whole World Wide Web and trying to find related links. The thought was that the search engines were going to give up steam, that the keywords weren’t going to be enough to get you the right document out of billions. Well, I was kind of wrong, because Google has done such a fabulous job. But we do really need some of these other things like related links, like if you’re looking at a webpage, tell me am I, like, is this crap or is this good? What else have people said about it? How long has it been there? If I’m looking for other things like it or maybe other points of view, what can I go see? Maybe it’s actually now that we have disinformation being broadcast so widely on the internet, that this technology that Alexa Internet was really designed to do was important. And the idea was also to go and leverage the link structure of the web and the usage trails of the web. And the idea then, also, for profits don’t last that long. So we said, okay, let’s go and build a contract into the soul of Alexa Internet that all the data that was collected would be put into this new nonprofit called the Internet Archive. So every day since 1996, it’s been donating data to the Internet Archive.

Noshir Contractor: If I recall correctly, they announced that they’re shutting down Alexa in May of this year.

Brewster Kahle: Oh, so sad. Yes. But it was a good 25 year run, which is a lot longer than most tech organizations, commercial ones, last, but nonprofits tend to last much longer.

Noshir Contractor: One of the ways in which I first encountered Alexa was basically as a way of understanding web traffic. A lot of people who do web science research would use Alexa data, you can know a little bit about the status of a website, how well trafficked it is. And so that’s one piece of metadata that you might consider when you’re looking at a website. But as you point out, Alexa was also archiving the web. And when you say it was doing that every single day, help me understand. Does it take a snapshot of the internet every single day? Does it sample it and say, I’m going to do every part of the internet once every week or month? How does that work?

Brewster Kahle: Let’s take the Internet Archive, this post-Alexa internet. What we do is, we have many different crawlers that basically go through the web – each one have different mandates. There are about 3,000 crawlers that run on any particular day, There are about 900 organizations now – libraries, archives, museums – that work with the Internet Archive, where they go and state particular mandates to these crawlers, that they say they want this particular subject area, they want this particular language, they want this particular whole country domain. They want it this deep, they want it this often. And a total of over a billion URLs every day get archived by the Internet Archive, to just try to keep up with what’s going on out there. Then we index it to make it available in lots of different ways, including the Wayback Machine.

Noshir Contractor: So tell us a little bit about how the Wayback Machine sits in some ways on top of the Internet Archive.

Brewster Kahle: We found that the average life of a webpage is about 100 days before it’s either changed or deleted. So we basically needed to try to keep up with that and then make all the out-of-print web pages available to people. So the way that the Wayback Machine works is completely simple. Fundamentally, it’s a line and a file for every URL we have. And it’s sorted based on the URL and the date. And every time somebody wants to look up a URL, we go and binary search this, well, multi-terabyte file to be able to find the most relevant page for that user. Or every GIF, every JPG, every JavaScript file is indexed in this way. And by running it in a parallel computer, much like the Connection Machine, we’re able to go and pull these out at 1000s of times per second, for the millions of users that use the Internet Archive’s resources every day.

Noshir Contractor: As you know, there have been several movements around the world, especially from the European Union, to legalize the right to be forgotten. And I imagine that the archive might make it difficult for people to have the right to be forgotten. What are you doing in the archive in terms of addressing this issue?

Brewster Kahle: Oh, yeah, a lot of the web was not really meant to be publicly available always. And so we take requests from people to remove things from the Wayback Machine, and those come in all the time from users, and you can write to info@archive.org and, you know, say what URLs or domain name, and then you have to try to prove that you own that so you can’t delete microsoft.com or something like that, and then it’s removed. And that seems to work pretty well.

Noshir Contractor: I was struck by a comment that you wrote that for the cost of 60 miles of highway, we can have 10 million-book digital library available to a generation that is growing up reading on screen.

Brewster Kahle: You know, being brought up during the tail end of the hippie generation, right, so the utopian “let’s build a better world,” I took that all very seriously and being a technologist tried to figure out what could we do. We thought, let’s start with what became the World Wide Web. But then also, let’s do television, radio. So we’re trying to get good at those. But we’re now going backwards and digitizing books, music, video. And we really want an open library system as opposed to a commercial answer to the whole thing. There’s so much wonderful things that are just not being read, because they’re not that available, and people are going to read whatever it is they can get their hands on. And this next generation is going to learn from whatever they can get. And it’s, a lot of it’s crap. Misinformation can be rife and just published out the wazoo by anybody with some budget, because a lot of the good materials are locked up behind paywalls, are still in print, or just, they haven’t really moved into the bigger picture of the opportunity of the internet. And so we’re gonna want to put the best we have to offer within the hands of our children.

Noshir Contractor: So it sounds like while you began by trying to create an archive of the internet, you’re now moving more towards creating an archive on the internet.

Brewster Kahle: It’s a good point. Absolutely. We’ve got maybe five or 6 million books that have been digitized. And we’re starting to do periodicals. First going and digitizing these for the blind and dyslexic. Then we make it somewhat available, you know, to, for instance, machine learning researchers, but also through borrowing, interlibrary loan, controlled digital lending, those sorts of things. You shouldn’t have to be at Yale to be able to see some of these good works.

Noshir Contractor: At the end of the day, even what is digitized is being supported on some material resource, whether it’s a disk drive or something else. I recall reading that you were inspired by the Global Seed Vault idea of trying to keep one physical copy of perhaps every book. Now maybe it’s not a physical copy as in papyrus or paper, but a digital storage record. And talk a little bit about the fragility of all of these different media that we have, starting with paper, but including many of the servers that you have and how often you have to be careful to make sure that those servers don’t get obsolete or die.

Brewster Kahle: I mean, it’s such a problem. You see these beautiful pieces of papyrus from 5000 years ago, it’s great. But it seems like it’s getting shorter and shorter. So some of these new technologies like microfilm and microfiche, they were reported that they could last 500 years. And so we’re starting to collect the microfilm and microfiche not only to preserve the microfilm, microfiche, but to then also digitize it. So we’re moving forward, but we’re always keeping the physical materials. The Internet Archive works with other libraries that have these large physical archives to keep these.

Noshir Contractor: You have also argued that the value of digital archives is not just historians, but also to help resolve common infrastructure complaints about the internet, such as adding reliability to 404 document not found. Tell us a little bit more about what you see as the value in that space.

Brewster Kahle: Yeah, at least let’s fix some of the bugs on the web. The 404 document not found is just bad engineering. We made a little extension that you can add to your browser such that if any of a number of errors come up, then we’ll probe the Internet Archive Wayback Machine and see if it’s got it. I think, also, the big opportunity is thinking at scale. My friend Jesse Ausubel put it: humanity got a long way with a microscope; What we need now is a macroscope, an ability to step back, understand the bigger trends. There’s a great interface on top of the Television Archive that’s just the transcripts that were taken by a fellow named Kalev, and he made GDELT, and you can go and do queries to find out terms, how much were they on one cable channel versus another over time, and you can start to see biases in these bubbles by stepping back and getting a bigger picture of what’s going on. I think that’s absolutely critical. People are very good at getting excited about some tweet or blog posts or Facebook something or other, some cable news, dramatic whatever. And it’s difficult to put it in context. If there would be a wish that I’d have for the next 10 years of web science and the like, is let’s build context into our online experience. So that’s not necessarily fact checking. It’s, what were the debates around it? What’s the information around it? It’s all the sorts of things that scientists in the academic publishing used to know about before the paywall sort of took over. This, this whole approach, I think we need to bring that to a much broader population.

Noshir Contractor: In many ways, what you’re talking about is a much more nuanced version of what we sometimes call metadata, that is, data about the data in this particular case.

Brewster Kahle: So Bill Dunn was one of my mentors. He did the electronic side of Dow Jones. He was the first purchaser of a Connection Machine to go and do full text search. And he had this saying back in the mid-80s that the metadata is more important than the data itself. And that’s what Google leveraged with their anchor text and PageRank. It’s what Alexa did by looking at user trails to be able to find: people that like this webpage, what web pages did they like even more? The importance of the internet is not the computers at the edges, it’s that we’re all connected.

Noshir Contractor: So you mentioned GDELT as being an example of a repository that has become incredibly helpful to scientists, including web scientists, to study ways in which information is flowing and how it impacts public opinion and so on. Given that you have built this incredible Internet Archive, and given that there are so many people who are using it to help understand society today and how it has been in the past, can you share with us some of the most interesting insights that you have learned from what you or others have found by using the Internet Archive as a way of studying society?

Brewster Kahle: Oh, I wish I had more time to go and study society. Mostly I’m just a librarian building the darn thing. We went and studied all of the political ads in the United States to understand the wash of money that’s going over the media system based on Citizens United and other decisions by the United States to allow corporations to pay for politicians. And it’s fascinating to see how much money and just the barrage of ads that you would get if you were in a battleground state. I mean, you couldn’t flip channels fast enough to not be seeing an ad at all times. So there are these things you can kind of see by stepping back. Let’s see. The World Wide Web made it so that you could take unpopular websites and they could become popular, which is a really good sign for going and having an ecosystem that’s alive. When you get too much either regulation or too many monopolies going and controlling things, a lot of that will slow down and stop. And I’m very excited about the decentralized web technologies. Let’s see another round or two of these to go and put people back in charge of some of these technologies rather than just these very large corporations that have started to take over whole media types. Let’s build open systems that lots of people can play. I like games with many winners.

Noshir Contractor: It sounds a bit like history repeating itself, because when the World Wide Web and WAIS and other technologies were spawning, it was also in response to corporations at the time, think of companies like AT&T, for example. The mantra was decentralized. Now we are saying we want a new wave of decentralized technologies. So was there a cycle where this decentralization gave way to a certain level of centralized authority that we now need to renew our efforts at decentralizing the web.

Brewster Kahle: We certainly need to renew our efforts. But the decentralization never needed to come. If you actually had government antitrust law that was actually, you know, used as much as it was before 1980 when things started to collapse in terms of antitrust, then I think we would have had an ecosystem without having to go through revolutions. And I’m hoping that we invent something better.

Noshir Contractor: Speaking of inventing something better, I was fascinated by a recent blog posting by you titled “Imagining the Internet: Explaining our Digital Transition.” My understanding here is that you have talked about the different metaphors that we have used to talk about the internet from the time it began. Tell us a little bit about what those metaphors are and how you see us trying to imagine the internet of the future.

Brewster Kahle: If we’re trying to, you know, look forward, I find looking backwards and seeing the trajectory we’re on to try to understand where we were going might be useful. So what I did is I went and tried to look at what was the metaphors that people had for the internet and tried to track that change over time, if you will. So the first one, I would say, would be the library. But then it moved and it started to become other things, like, just portrayals of a raw network. I would say then cyberspace was a term that people use. So it was far away. Then it started coming home towards being a frontier, the Electronic Frontier Foundation. It was a wild west and it had to be navigated. Then there was information superhighway. We moved to surfing. So now, it’s not just some place out there. But now you can experience it, you can ride on it, you can use it. Then I would say the next one was the Facebook, right? The idea of the Borg. Your cell phone was glued to your face. So now, where does it go from here? I would say, the thing we’re wrestling with around now is algorithms. If that’s where we are, then what happens next? And I would say machines are starting to not need us anymore. [The machines sort of detach.] In The Matrix in ‘99, Agent Smith has this terrific rant about people being a disease.

Noshir Contractor: That does paint a somewhat dark picture of where we are headed.

Brewster Kahle: I mean, you can’t see a movie these days without it being frickin’ dystopian. People are anxious. They are really worried about what’s going on. They are not feeling in control. I would like to make it so people have a feeling that they’ve got some level of control of what it is they’re reading, what it is they’re writing, where it’s going, their privacy, their sense of self, their friends. And we have done almost everything we can to strip that away from them. I do like the Alan Kay “don’t predict the future, go and invent it.? We as technologists should do a better job than just go to, “Hey, let’s go make a ton of money and be like a rich internet mogul.” Let’s leave a better environment for people to be the most they can be, they can be creative and feel safe and achieve and build and grow. That’s what our technologies and our internet should be for.

Noshir Contractor: And that’s a wonderful place to end this conversation. A very upbeat note, very inspiring. Brewster, thank you so much again for joining us and for all the work that you’ve done in helping us to understand the archive of the internet and to, as I said, to rewind the web and the Wayback Machine. I would certainly recommend folks take a look at the blog entry that Brewster has just been summarizing at brewster.kahle.org as well as play with the Wayback Machine if you haven’t. It’s a lot of fun and somewhat embarrassing to go back and see what kind of websites we created back in the ‘90s and also in the early part of the century. So thank you again, Brewster so much for joining us today.

Brewster Kahle: Thank you very much, Noshir.

Noshir Contractor: Untangling the Web is a production of the Web Science Trust. This episode was edited by Susanna Kemp. I am Noshir Contractor. You can find out more about our conversation, whether you are listening to us today or via the Wayback machine decades from now, in the show notes. Thanks for listening.