Episode 18 Transcript | Untangling the Web

Matt Weber: The Internet Archive has about — last count — nine petabytes of archive data, I wouldn’t be able to begin to tell a student how to begin cracking open that repository, we don’t really have the tools for that yet. So in many ways, we’re still developing the technology to be able to look at some of these questions at scale.

Noshir Contractor: Welcome to this episode of Untangling The Web, a podcast of the web science trust. I am Noshir Contractor and I will be your host today. On this podcast we bring thought leaders to explore how the web is shaping society and how society in turn is shaping the web.

My guest today is Matt Weber — you just heard him talking about some challenges and questions related to web archiving. Matt is a faculty member in the Department of Communication at the School of Communication and Information at Rutgers University. With more than a decade of experience researching information, ecosystems, organizations and communities. Matt focuses on the use of large scale web data to study processes of change. Some of his current areas of focus include work and algorithms and knowledge, public policy processes, production of media and the science of communication within these information ecosystems. In addition, Matt has been an active member of the web science community. He’s the program co-chair for the ACM 2021 Web Science Conference, and delivered a keynote at this year’s conference just earlier this week. Welcome, Matt.

Matt Weber: Thank you, Noshir. It’s a pleasure to be here.

Noshir Contractor: To get us started, take us back to when you first got interested in looking at the web as a changing process. What got you interested in this?

Matt Weber: I came into academia, having been a marketing professional working in the media industry, and worked at an ad agency for a number of years, and then ended up working at Chicago Tribune Tribune Corp, and saw firsthand how badly companies were responding to shifts in communication technology and shifts in news media production. So come 2008, I found myself as a graduate student having the opportunity to head over to Oxford to take part in their summer doctoral program. And it happened that that year the Web Science Trust sponsored the summer doctoral program at the Oxford internet Institute. And so that experience was my introduction to a lot of the core ideas of interdisciplinary research that are central to web science scholarship. And so from very early in my career, I started to see these intersections between the broader questions that I wanted to ask about the interplay between technology and information ecosystems, and some of the core questions that were being asked by scholars who are working in the area of web science.

Noshir Contractor: A lot of the work that you’ve done is based on the assumption that the web is not a fixed entity, and that in fact, it is not just growing, but also in many ways, losing a lot of what is on it. The web itself is extremely ephemeral. And you note in your work that in a study of 10 million web pages, researchers found that the average web page remains live for barely three years. And that a study of Twitter data, focusing on major social events found that 11% of relevant tweets were not available after one year and 27% were not available after two years. How does this ephemeral nature of the web and Twitter affect our ability to understand society?

Matt Weber: So many of us who study topics related to the web look at single snapshots of instances that we record when we engage in our scholarship. And it’s rare that we look backwards in our research or that we have the opportunity to look at larger periods of time we engage in web based research. A lot of that has to do with the availability of data, a lot of that has to do with access to data.

This is a question that I confronted when I first started my research. I came into graduate school and I wanted to study how news media had been adapting to web based technology. By that point, most newspapers already had web pages, I didn’t have a time machine, I didn’t have a way that I could go back in time and record what I wanted to see. So go back to that opportunity, I had to be at summer doctoral program, it happened that the web science group had brought in a speaker from the Internet Archive. Well, I found my time machine.

In that moment, I found this resource that had for at that point, 12 years been archiving all of the available web data they could get their hands on. And so I spent a lot of time as a graduate student, building the foundations for providing researchers with access to open up archived web data, and gain access to these larger swaths of data that previously had been made and accessible.

Now to answer your question about the ephemerality of the web and web based technology., that’s really key. Having archives and having repositories allows us to examine change over time. And it allows us also to see what has been lost over time.

Noshir Contractor: What does it mean to create an Internet Archive? Is it taking a snapshot every day, every minute? How does that work?

Matt Weber: When we think of a term, Internet Archive, many of us at first glance, think, Oh, I’ll be able to go and look at the web page and just replay it across time. And the reality is far different from that. The Internet Archive itself — archive.org — the initial example of what an internet archive is, was started with the idea that this would become the library of the world, it would become a online home for all web content, digitized music digitized books, this would be a go to resource for anyone looking for free information, free knowledge on the web. In many ways, that’s what the Internet Archive has become. But the concept of internet archiving is much broader than that. The Internet Archive is an example. But then look at other organizations like the Library of Congress, the British Library, countless national libraries across the globe, that have all created their own separate repositories to archive either their national web domains or specific subdomains of content that are of interest. The question within each of those domains is what is actually being preserved? Each librarian or each archivist who is working within the library has to make a decision about how often do we record webpages? How detailed are those records going to be? How much of the content are we going to preserve? How accurate is the replay going to be? And what we find is an incredibly wide variance across these libraries.

Noshir Contractor: I want to turn to this question that given that we have the data and given that we now are able to roll back and essentially play like a movie, how the web looked at at different points in time? What are the kinds of questions that you think we can begin to ask now in ways that we weren’t able to do before?

Matt Weber: Before I answer the question about the kinds of research that we’re able to address this type of web archive data, I want to point out that one of the fundamental problems we still have is that even with all of the advances in technology, we still are not very good about accessing this type of archival web data at scale. The Internet Archive has about — last count — nine petabytes of archive data, I wouldn’t be able to begin to tell a student how to begin cracking open that repository, we don’t really have the tools for that yet. So in many ways, we’re still developing the technology to be able to look at some of these questions at scale. There are a number of researchers right now, including myself who are working to tackle these challenges. There are obviously a whole host of questions that we can start addressing. For me, I think one of the most fascinating things is to be able to look at different aspects of our information ecosystems, the environments within which we live in and operate in today to understand how those ecosystems are evolving over time.

Noshir Contractor: Can you share some insights about what are the kinds of things we are able to learn that we were not able to do before we began in web science to study the internet archives?

Matt Weber: One area where I’ve been looking for the past five, six years is at the growth and demise of various aspects of our local news ecosystem. And by looking at scale and leveraging web archive data, we’re able to unpack specific findings such as the connection the percent of minority residents within a community has to the overall health of the information ecosystem in a local community. For instance, we see that as there is a greater and growing Hispanic population, there is a pullback on the part of corporate news organizations in terms of the amount of content that’s being provided to that community. We see more niche newspaper outlets coming in to fill the gap. And that’s a story that without being able to look back through the repositories that we’ve built up, we would never have been able to detect and to pull out.

Completely different topic, we have a repository built around social media data, and news media data tracking the events around both Superstorm Sandy and Hurricane Katrina. And you’re able to see how community partners were able to work together in very niche micro clusters to basically fill the gap of information in the early weeks, early days after each of these disasters, to create a nexus for information for the communities that were affected.

Noshir Contractor: As you look at these insights that we get from people studying past events on the web, are there ways in which the insights helping us come up with actionable ways of doing things differently moving forward?

Matt Weber: When we talk about web archives, the term archive is maybe a misnomer in many ways, because it implies something that’s been archived, stored and put away. We’re talking about data that’s contemporary today, and then chronologically works backwards. And so much of the web archiving research that we’re talking about, is present to modern day, but allows us the ability to then look at the evolutionary path going backwards. And so again, I’ll come back to my own work, right now, looking at local news information ecosystems — we’re leveraging that work today to advocate for policy solutions in a number of states that are looking to create new models to support local media environments.

Noshir Contractor: You mentioned earlier about some of the challenges that we face that there are technological challenges associated with being able to navigate a study of the web. Yet, the combination of new tools and metadata formats have demonstrated that some of this analysis can be conducted more affordably. Tell us how hopeful you are about this, this trend towards making these data more accessible.

Matt Weber: So let’s start first, with advances in computing. We’ve seen significant progress in our ability to work at scale on the web. And web science being the interdisciplinary home that is, is a perfect venue to be talking about this type of scholarship and this type of education. With regards to research, increasingly work that took a supercomputer, work that took a computing cluster, can today be run through Amazon Web Services on your laptop.

Add to that some of the more programmatic technological advancements. All of that goes to say that we can work with larger sets of data in a fashion that is much easier than it was even two or three years ago,

The challenge with web archive data is still in translating between say, your library and the work file format. There are groups out there that are making a lot of advancements in this area. And I think in the next three or four years, we’re going to see even more gains, they’re going to make this research much more commonplace.

Noshir Contractor: That sounds exciting. We’ve been celebrating all the incredible insights we can get from looking at the Internet Archives. But do you also have examples of concerns about limitations, things that may be lost in a biased or systematic fashion? That might then in some ways limit the confidence we have in our inferences based on looking at the Internet archives?

Matt Weber: That’s a fantastic question. I think it’s a fundamental problem with web archiving, that hasn’t been fully addressed. The process of archiving works very much like a network, you start with a few central nodes. And you archive out from those central nodes, you pick your starting point and say who’s linked to these nodes, and then who’s linked to those nodes, and you continue to crawl on. And so this sets a dominant hierarchy for what is going to be archived, what is going to be stored, if you are a niche community on the web — if you are a, say, a group of minority-serving newspapers in Newark, New Jersey that has a very small web presence, but a very strong impact in your community — if you’re not connected to the main network of media organizations, and the main network of information websites in the state, most web archiving platforms will miss you, will simply skip over as if you never existed. And so when the researchers then go back to use this archive, to leverage this archive, from their own work, if they’re not aware of these gaps that may exist in the archive, the presumption will be that those websites never existed, that those communities never had access to information that was being provided. Even though there was a very robust community there. And I say this, I use a Newark example, because that’s exactly what happened in own work.

Noshir Contractor: And I imagine that everything that you’ve just described, if it is a problem in New Jersey, I can only imagine how much more of an issue that is, in other parts of the world, in the global south, which have a much weaker digital footprint in some ways. To what extent do you see that as a limitation in terms of our ability to make inferences?

Matt Weber: You mentioned the global south, we travel around the globe and pick your example, let’s go say into India and look at news provision in India. And a lot of that is either A. still happening through printed paper or B. happening on technological platforms that skip what we know to be the mainstream web. So a lot of news dissemination via apps like WhatsApp, that are increasingly community based platforms for spreading news and information through a community. And all of that is overlooked by this traditional web archiving type of technology.

As web scientists, as researchers studying in this domain, we have to be increasingly attentive to multi method research that allows us to more accurately represent the gaps that may exist and dominant modes of data collection. Unless you engage with the communities and talk to people living in these communities to understand how they’re getting access to information, how they’re getting access to news, you wouldn’t understand where those gaps were. And so more and more today, when we have greater access to data at scale, we simultaneously need to be leveraging partnerships with other scholars, partnerships within communities, to better identify where the gaps exist in the data that we’re relying on for our research.

Noshir Contractor: How concerned are you that as we move towards platforms like WhatsApp, for example, that those platforms which are often in some way shape or form walled gardens? That archives may not be necessarily tapping into what is happening within these sort of private spaces?

Matt Weber: We were deeply concerned about walled gardens a decade ago when newspaper companies started putting up paywalls and limiting access to certain types of content unless you were a paying subscriber. At a very different scale, and a very different level, we’re having a similar conversation today, when we talk about walled gardens, we’re talking about information that we can’t access. Now, some of that today is happening because of increased concerns on the part of consumers around privacy. Part of the shift to messaging and information dissemination on platforms like WhatsApp comes from an increased desire on the part of consumers to have privacy and the information that we’re sharing. This creates a lot of challenges for us as scholars, in terms of the information that we study in the information that we hope to have access to in order to better understand the social lived world that we engage in, day-to-day. There are no ready answers for this. But the lessons that we’ve learned from the past two decades of research, examining web data, living in the world of web science has prepared us to better tackle these questions going forward. Hand-in-hand with that, we talked about Twitter as a great example of a company that’s opened up data, we’re seeing pressure on other companies like Facebook to make some of their data more available. And so we are seeing some forward progress in terms of opening up other platforms.

Noshir Contractor: You brought up the issue of privacy amongst individuals as being a major driver of these moves to platforms that provide walls in which people can discuss, which brings me to one of our closing points here, and that is article 17 of the GDPR — the general data protection regulation — talks about the right to erasure or the right to be forgotten. To what extent do you see this right to be forgotten, and giving people the ability to go back in time and delete some of the information that has been aired about them as a serious concern in terms of being able to study the archives?

Matt Weber: The joke has always been something along the lines of: Be careful what you say on the web, because once you say it, it’s impossible to delete. Web archives are the embodiment of that challenge. Once content has been stored into a web archive and preserved in some fashion or other, it’s technically very hard to go back in and scrub every mention from every archive. We as researchers have to be very careful about what we share in terms of personally identifying information when we go back and use web archives in our scholarship.

And on the other hand, the archivists themselves also have an obligation to better understand how we can work across web archives to adhere to standards, like what GDPR has set forward. With regards to the right to be forgotten. I expect that that type of legislation will continue to grow over the next decade. And we will have to find other ways to make sure that even going back in time, you have the right to have your information removed from these types of repositories.

Noshir Contractor: Given how much you have been thinking deeply about web science and the web over the course of your academic career already, what do you see as some of the most challenging issues that you or others within the web science community need to be placing more of an emphasis on than we currently do?

Matt Weber: In this conversation alone, we’ve hit on a number of the key themes that are pressing issues for the web science community at large, but also for the group of scholars who are thinking about the data that we have available to address these questions. And I would include web archiving in that set of data, we need to do a better job of thinking about how we address privacy in the data, privacy rights and the data that we are accessing. We need to do a much better job of making sure that a diverse set of populations are accurately represented in the data that we’re using. I think both of those fronts, there are decade’s worth of unanswered questions that I know many of us are working to address. Those are critical areas right now.

Noshir Contractor: Wonderful. Well, again, I want to thank you so much, Matt, for taking time to talk with us. And you’ve done so much in helping us recognize the importance of looking at that web as an ephemeral changing dynamic process and telling us about how we can learn so much about society by not just looking at a snapshot of the web at one point in time, but by essentially rewinding and playing back how the web has constantly been changing over the last few decades here. And I thank you again, for both your research and your engagement with the web science community. As I mentioned, you’re the program co-chair for the ACM 2021 Web Science Conference. And we all look forward to listening to your keynotes that you will be delivering on the 21st of June. So thanks again, Matt. And we look forward to learning more about your research in the years ahead.

Matt Weber: Thank you Noshir, I really appreciate the conversation.

Noshir Contractor: Untangling the Web is a production of the Web Science Trust. This episode was edited by Molly Lubbers. I am Noshir Contractor. You can find out more about our conversation today in the show notes. Thanks for listening.