Fil Menczer: Astroturf is alive and well, unfortunately, and it’s getting more sophisticated and harder to detect. And so in some sense, it’s job security. There’s no shortage of research challenges, you know, even 10 years later to try to identify this kind of manipulation.
Noshir Contractor: A decade ago, Fil Menczer was studying digital astroturfing right as it was ramping up online, and he’s continued with that work. But that’s not his entire breadth of research. Fil is a distinguished professor of Informatics and Computer Science at the Indiana University School of Informatics, Computing and Engineering. He’s also the Director of OSoMe — not just the word awesome, but the Observatory on Social Media. Shortened, it becomes OSoMe, pronounced “awesome.”
His research spans web science , computational social science, network science, and data science. He focuses on analyzing and modeling the spread of information and misinformation in social networks, and detecting and countering the manipulation of social media. Besides all his professional activities and accomplishments, Fil has been an early fan of the web science movement, and in fact organized the Web Science Conference in 2014. Welcome, Fil.
Fil Menczer: Thank you very much for having me, Nosh.
Noshir Contractor: Let me start with something that I know you spend a lot of time thinking about and uniquely positioned to help kick us off here. How do you think social media can be manipulated for the spread of information?
Fil Menczer: Essentially, you know, social media are platforms that let people communicate and share their opinions and their thoughts. And also everybody has a responsibility in also spreading other people’s opinions that they agree on. So in some sense, we’re all editors, but we don’t all have, you know, the ethics and the experience and the skills of journalists. we’re vulnerable to being misinformed, we’re vulnerable to spreading misinformation ourselves.
On top of that, platforms have all kinds of mechanisms that they use, for various reasons, often very good reasons. For example, trying to figure out what’s interesting and making recommendations about who to follow or who to friend, or what to pay attention to. And our research shows that all of these mechanisms have some unintended consequences. So for example, showing people how many people have liked the video makes them more likely to look at it. And that’s something that can be gamed. Or recommending, a friend of a friend might accelerate the formation of echo chambers, where you are exposed to less diverse points of view, and perhaps even more vulnerable to you know, to be manipulated.
And then on top of all of that, platforms generally have API’s — application programming interfaces — which are ways in which one can write code and programs to interface with these platforms. On the one hand, this is wonderful because it allows us to collect data and do research. It also allows different people to come up with new applications, new ways to use this data. And those are good applications. But at the same time, it also allows bad actors to manipulate the platform by creating fake personas by impersonating people, by creating the appearance that many, many people are sharing your opinion, or angry or happy or supporting an idea or attacking a candidate. Where in fact, this is all the work of maybe one single entity. And so people can be tricked. Because our natural you know, cognitive and social biases tend us to trust things that come from our friends or or pay attention to things that look like they’re getting a lot of attention. And those things can be gamed.
So it is really easy actually to create social bots, that’s the term that we came up with several years ago to identify these inauthentic accounts. And then those accounts can be used to game and manipulate and also to amplify the spread of misinformation. We’ve shown that in our work as well. So it’s not a simple answer. There is a very complex interactions between different algorithmic biases and social and cognitive biases that play together in creating this ecosystem of information, which unfortunately, is vulnerable in many ways.
Noshir Contractor: You were one of the first people if I remember talking about astroturfing on social media. Can you tell us a little bit more about how far we’ve come in both the growth of astroturfing and how we can combat astroturfing today?
Fil Menczer: Yeah, in fact, it was 2010 when we started collecting data from Twitter, on a large scale, and actually there is a connection to web science there. I don’t know if you know the story. But at the Web Science Conference in 2010, there was an article by (Panagiotis) Takis Metaxas and Eni Mustafaraj on a bunch of fake accounts that had attempted to manipulate a special election that was happening in Massachusetts at that time, and to replace Kennedy who had died. And they found that the night before the elections, a bunch of fake accounts pushed some misinformation about the Democratic candidate. And that generated a lot of traffic, even though Twitter took down those accounts very, very quickly, because they were doing typical things that spammers do. And despite this on the day of the election, if you search the name of the candidate on Google, you would find this fake news because those social bots had been successful in creating a viral cascade. And then Google picked up that signal in their search engine.
So that was a fascinating paper, it actually got best paper award at web science. and so as I was watching it, and talking with, you know, with Takis and Eni afterwards I was thinking, you know, is this an isolated incident, we need to get more data and see if this is, in fact, just the tip of the tip of the iceberg. And that’s where we started this whole study of manipulation, of social media, and astroturf, which is like fake grassroots campaigns. And what we found, in fact, is that it was very widespread. And when you look at systematically everything that was being shared on on Twitter about the elections, that was a midterm election year, there were thousands and thousands of memes and links to fake news. And that’s the year that we found the first instances of fake news websites, and we found bots that were coordinating to support a candidate, and to amplify and make it trend and bots, they were spreading fake news — real fake news, like completely manufactured, made up attacks against candidates and then targeting journalists trying to get it to go viral. And that’s when we realized this was a system that was extremely vulnerable.
And our first tools to detect this were based on looking at the structure of the network, of the diffusion of networks. And that gave us some good signals, so we could build very simple machine learning algorithms to try and detect these kinds of astroturf. And over the next 10 years, you know, that has continued and and and now we’re, you know, we’re looking at individual accounts that may be inauthentic as well as coordination, that doesn’t happen even without donation, they may not be bought, but they may be a bunch of accounts that are run by people, but that impersonate other people. So even though it looks like it’s 1000 independent voices that is pushing a particular message or conspiracy theory, it’s really one, you know, entity that’s really controlling all those accounts, even though maybe they are using software, maybe they’re not using software. So astroturf is alive and well, unfortunately, and it’s getting more sophisticated and harder to detect. And so in some sense, it’s job security. There’s no shortage of research challenges, you know, even 10 years later to try to identify this kind of manipulation.
Noshir Contractor: Could you talk a little bit about the network by which these messages spread? Is there a way to tell whether a message was astroturfed? In other words, artificially made to spread as compared to one that was truly organic and truly grassroots, rather than artificially grassroots?
Fil Menczer: Those were the early days, where a lot of these kinds of manipulation was easier to spot than it is today. It was easy to detect some of these manipulation but of course, there might be other astroturf and social bots and malware and manipulation that we did not catch, you know, so we only know the things that we did find. But among those at that time, our intuition was that, like I said, the structure of the network could provide useful cues. So what we did is we built this diffusion network, where a node is an account, and a link between the two accounts identifies either a retweet — at that time quoted tweets didn’t exist yet — but it could also be a mention or reply.
And so now we have this network with different kinds of edges. And we can look at things like you know, how influential a node is by looking at how many times it is retweeted. So we could look at the distribution of hubs or popularity or influence among the nodes and extract statistical features. For example, you know, the skewness of the distribution of the degree or the strength which is the way that degree of these nodes, we could also look at the community structure. Was the network fragmented into, into many different groups or was like one big connected component in the network? And also, you know, was the idea whether it was a link to a fake news site or hash tag or whatever, was it injected by many independent people?
And also, we could look at the distribution of the weights on the edges, right. So for example, if you have two accounts that retweet each other thousands of times, you know, that would be demonstrated by a very heterogeneous distribution of weight degree. And that was a very strong signal. So the very, very first two bots that we discovered, was in that way, we found that there was this edge between two accounts that had a weight of 10,000. And we thought there was a bug in the code. And eventually we realized, no, no, this is no mistake, these two accounts, are we shooting each other 10,000 times in the last week. And so we looked at them. And then we realized, “Oh, my gosh, these are obviously bots.” They were two accounts that were just automatically posting and reposting things at very, very high volume. And now today, if the two accounts that did that would be immediately detected and suspended by Twitter. So you have to be a little bit more sophisticated in order to evade detection. But at that time, simple signals like those were sufficient.
And these days, our bot detection algorithms use much more sophisticated algorithms to look at over 1000 different features that characterize not only the structure of the network of diffusion, but also characteristics of the accounts, of the profiles of their friends, the content that they generate. We do we do speech analysis for the content, we look at sentiment analysis, we look at temporal patterns, for example, not just how frequently they tweet, but also do they do it in a bursty way, like humans, or in a regular way that looks more automated. So there’s lots and lots of different signals that we try to pick up to try and infer whether there is some, you know, automation.
Noshir Contractor: Well, one of the things that you’ve described is that there’s a constant, cat and mouse game between your ability to detect structural signatures and signals, and what those who are trying to evade your detection are going to continue to improve those signals. In the context of bots, are we at a stage where you find that bots are being created to create new bots?
Fil Menczer: (Laughs) That’s a very interesting question. Meta bots. We haven’t seen evidence of that. However, what we do find evidence of is some sets of accounts that are all very, very similar to each other. So for example, they all have, you know, a pattern in their name, like a common first name, followed by an underscore by a common last name, followed by a sequence of digits. Also, they might all have the same description.. So a lot of times what we what we find suspicious is not the behavior of a single account. So you have to look at not the pattern of an individual account, but the pattern of a group of accounts. And then you might say, each one of these accounts looks perfectly reasonable, it looks like maybe you know, a person who posts about politics, maybe supporting this candidate or that candidate. But now if you look at 10 of them, and they see that they are tweeting at the same time, or they’re tweeting exactly the same sequences of hashtags, or they’re all retweeting one account that they’re trying to, to support and amplify. And then that’s where you say, “Well, what is the difficult probability that by chance, you have this kind of behavior by many independent accounts,” and if that probability is very low, you know, let’s say 10 to the minus four, then you say, “Okay, this is a suspiciously similar behavior, probably there is coordination.”
Other examples are accounts that post the same images in sequence, or very similar images. So there’s lots of ways that we’re looking at to identify this kind of coordination. And so that coordination in some sense comes because under the surface, there is probably an entity that is using software to automatically control all these different accounts. Even if the messages are coming from a person like there is a human that says, you know, go red or vote blue. But that human now is doing that on 100 accounts or 1000 accounts. And so we can spot the pattern of similarity that gives it away in some sense. And so those are ways in which, you know, the arms race that you were talking about is really happening. Not only individual, you know, bots are becoming more sophisticated, but also humans are mixing with software to create accounts that are more difficult to detect by looking at individual accounts. And of course, looking at large groups of accounts is computationally much, much more challenging. And so it requires a lot more work and, and sophistication and also more, you know, computational power. So it is tough to catch all of this abuse.
Noshir Contractor: It almost seems like we need a Turing test to detect whether it’s a human being or a bot that you’re dealing with on social media.
Fil Menczer: That’s a very interesting observation. In fact, the key of the Turing test is that, you know, you were talking through an interface with either a human or the computer. And so the only thing that you could see was, was there, you know, whatever they were saying, and how, and in some sense, social media have made that easy for anyone, because all you see is the presence of social media, you have no way of knowing who’s behind that identity. Even platforms, you know, they they have access to some additional signals, like, you know, maybe phone numbers, IP addresses, but even platforms cannot really know sometimes for sure, who’s behind an account in certain typically, they can see where they are violating some terms of service, and whether there is coordination, but nobody knows who’s behind it.
And so this means that there is plausible deniability, if this campaign is trying to promote a particular candidate, that candidate can claim, perhaps correctly, that they had nothing to do with it. And there is no way of proving who’s behind it. Very, very rarely, do we have, and only through extensive work by intelligence services, can we say, “Oh, for sure, there was that particular state actor behind this activity.” In the majority of cases, at best we can detect them and maybe alert people or perhaps remove them, if they are, manipulating the public or abusing, you know, the rules, but we cannot really say, “Oh, that that actor is behind it.” And so that actor is free to just start over, maybe tweak the algorithm and do it again. So yeah, it is as hard as, as the Turing test.
Noshir Contractor: One of the things that you are really well known for his being the director of the observatory on social media, which abbreviated to OSoMe, and it’s pronounced awesome. As someone who appreciates the creation of clever acronyms, I’m truly impressed with “OSoMe.” And I wanted to ask you a little bit about what went behind that there are many people who, for decades have talked about creating some kind of an observatory to study the web, you have done it, and you’ve done it successfully. And you now host several tools that you’ve created, that go beyond your own research, but actually gets used not only by others within the research community, but it’s routinely used by journalists, and so on and so forth. Talk a little bit about how you made this happen. And what were the lessons you’ve learned from it?
Fil Menczer: Yeah, so OSoMe. It is a, is a cute acronym. I can’t take credit for it. It was, I think it was our Associate Director for Technology who came up with this idea. The idea of using the word observatory in it actually came from web science, because the web science community, you leading it, among others, was really sort of pushing this this idea of collecting data on a large scale from the web, to get to a deeper understanding about some of the social you know, social impact and social phenomena, you know, of society, in some sense. Our behaviors, how are they affected by the information that we see, how is data about our online behavior is telling us something about social action, about norms, about behaviors, and about vulnerabilities, which was the part that interested me in particular. And so that’s where the idea of observatory came from, it came from the web science community.
So, as you say, in addition to doing lots of research, we also like to develop tools. And that’s because, you know, in web science, we think that it is important to go beyond just research and to actually do work that can impact society in a good way. And, and for me, you know, based on based on our skills, one of the things that we could do to help a little bit was to make the tools that we build for research and push it a little bit further to the point that they can be used by a broader audience so that they’re not only useful to write a paper, although that’s important, too. But they can also be used, like you said, by journalists, by investigative reporters, by civil society organizations, and also by common citizens to gain an appreciation for whether they are vulnerable, whether they are talking with another human being, whether they’re being manipulated.
So, for example, our most popular tool is called Botometer. And it started from our research on bot detection. And then we had a demo for a grant that we had, and as part of that demo, we thought, okay, let’s just, you know, put it on a little website so that you could, and then we realized, wow, this could be useful to other people. And so eventually, it became a public tool that is now called Botometer. And that is used a lot it. We serve between five and 600,000 queries per day.
Noshir Contractor: Wow.
Fil Menczer: It’s very, very popular. Now, obviously, it’s not perfect, because just like any machine learning algorithms, it makes mistakes. And there is a lot of research challenges that we’re pursuing. And some of my students are working in their dissertation about how to improve these tools, how to make them better able to recognize suspicious behaviors that are different from those in the training data. Also how to combine supervised learning and unsupervised learning, as I was saying earlier to detect these coordinated manipulation campaigns that may not necessarily use automation. So there’s a lot of research challenges in building these tools. But we try to also bring the results of those out into things that can help other people.
So Botometer is our is our best example. But we have several others.
Our latest tool that we’re kind of excited about is called BotSlayer. And it is a way to let people even without technical skills, set up an infrastructure in the cloud, where they can, with just a few clicks, track all of the tweets that matched some query as they happen and look at them in real time on their screen, and also have all of the entities that we extract from these tweets, an entity can be a link to a news article, it could be a hashtag, a username, a phrase, and then for each of these entities, see how many people are sharing it? How many unique people are sharing it? Are bots more likely to share this particular entity than something else? And also, is there coordination among these accounts? Are they all retweeting the same, you know, the same set of users and so on. So, in some sense we’re making a very complex tool that we’ve been developing for our research and putting it at the fingertips of other researchers. journalists, and nonprofit organizations, so we have hundreds of organizations around the world that are licensing this, and we hope soon to have the next version that will be a little bit better and more robust, and make it available so that people can use it to study COVID-19 to study, you know, the current protests around the Black Lives Matter movement, and so on.
So those are some of the things that we have out there, there’s a few more. We really think that creating tools, you know, and making them openly and freely available to the community is an important part of our mission of the observatory.
Noshir Contractor: And your group is so good at it. And it’s really making a major contribution to academia, but also to society at large. And so thank you, again, for all the work that you’re doing on that front. One might get the impression listening to this conversation, that bots are always evil, especially when you have apps called Botslayer, for example. Now, is that true? And if not, then how can you distinguish between a good bot and a bad bot?
Fil Menczer: That’s a very good question. And absolutely, that’s not true that all bots are bad. You’re absolutely right. In fact, many, many bots are very useful. And we all use them, right? If you, for example, if you follow the feed of I don’t know, The Wall Street Journal, or the New York Times, or your favorite news source, that’s a bot, right? It’s an account that automatically posts things that you can extract from an RSS feed or you know, or some other source. And then there are some bots that are funny and interesting and entertaining, and others that are kind of trivial. So there is a huge range of behaviors, but many of them are perfectly innocuous or even or even helpful. And our research is focusing on detecting automation. Because when that automation is not revealed, then the bots can be used to manipulate. Now, if a bot says I’m a bot that tells the time every hour like at Big Ben, there is nothing wrong with that, it’s not trying to mislead anyone, right. And if it says, “I am the you know, the New York Times, and I post the news every five minutes,” there’s nothing misleading about that. People know who they are following. But if a bot says I am Nosh Contractor, and I’m a professor at Northwestern, and here’s why you should really believe that if you want to, you know, be cured from COVID, you should drink Clorox. I mean, that’s, that’s an inauthentic account, that is impersonating a person. And making it look like that person is saying something, which in this case, in this example, is false, and in fact, dangerous, very dangerous. And this is done a lot.
So we hope that the focus on detection of automated accounts and not only automated accounts, also coordinated accounts, like I was saying earlier, can be useful in spotting this kind of abuse. That’s the one that we are worried about. Obviously, we’re not worried about benign bots.
But sometimes the same technology that lets you detect one also lets you detect the other. So we train our machine learning algorithms with whatever bots we can find out there, either because they tell us themselves that they are automated, or because some human experts have looked at them carefully and concluded that they are automated or perhaps because you know, Twitter has taken them down, so we know that they were inauthentic. And so we use those data sets of labeled accounts to train our algorithm. And the hope is that they are obviously they’re not used to do anything against, you know, benign bots, but they could be used hopefully to alert people about the malicious ones.
Noshir Contractor: Thank you for clarifying that difference, because I think it’s an important distinction to recognize and appreciate that bots are not intrinsically nefarious and that we interact with them all the time.
Fil Menczer: But there is also like everything in between, right? There are accounts that for some while they are doing good things, and then they are turned or because you know they are hacked or because people let applications post on their behalf. And so you’ll have accounts that are partly automated, partly manually controlled. So it’s a very complex ecosystem where you find all sorts of complex behaviors, and it’s really hard to make sense of it.
Noshir Contractor: In closing, I want to ask you about something you have already referenced. We live right now in an age of reckoning when it comes to social upheaval, in addition to the pandemic. And I want to know, if you could share some of your opinions about how things would have been different, for better or for worse, if we were experiencing this without the web?
Fil Menczer: Oh, without the web? Oh, my gosh, that’s a really, really interesting question, and tough as well. (Sighs). Well, I’m an optimist and a technologist. So I would say that overall, probably the better outweighs the worse. But certainly, we have plenty of examples of both right? In some sense, the web has, you know, enabled some amazing advances, whether it’s in, you know, sustaining social movements for the advancement of humankind, creating public awareness about huge planetary issues and challenges with global warming, and, you know, pandemics and racism and creating awareness of these issues. Imagine that the economic harm of the current pandemic, as bad as it is, and it is terrible. Imagine how much worse it would be if we didn’t have communication technology, so that we could still teach remotely as badly as we do that. But at least, you know, it’s something that we could do remotely and, you know, let alone conferences and, and teleconferencing and so on. But just the capability of being able to, to connect with each other, even at a distance, you know, the world would be much worse-off if the web didn’t exist to enable those kinds of interactions.
So there is a lot of good in it, there is good in it in allowing, you know, minorities or groups that have less power to put their message out there. So to some sense, the democratization of information. That was this utopia of the early days of the web that we all bought into, certainly I did. to some sense, it has happened. And so the world is all the better for it. For the same reasons why the web can be used for all these good things, it can also be used for all sorts of bad things. This has been true of every technology in history. And it’s also true of the web. And it’s true of social media. And it’s true of zoom even, the last thing that we now realize how it can be abused, so, of course, technology can be abused, and all the things that have been talking about.
And our research is really focused on those kinds of abuses and manipulation, whether it is spreading, you know, misinformation, or suppressing, the voting, which for me is one of the huge challenges ahead of us that one of the reasons that motivates our desire to detect manipulation, because we see that that’s one of the main applications, you’re probably not going to change people’s opinion, if you weren’t going going to vote for one candidate, you’re not going to vote for the other candidate. But you might let somebody decide not to vote, if you convince them that that candidate is really not that much better than the and the other. And I think that this probably has happened in the past and will continue to happen. And then we haven’t even yet seen large scale consequences of new technologies that are just now becoming mature, like deep fakes. And I think that those possibly could pose big challenges in the next few months. So as for anything, there are good things and bad things, and certainly the world would be very, very different without the web, for better or for worse, I would say more for worse. Overall, I’m still an optimist, and I think that we can make things better.
Noshir Contractor: Well, thank you again, Fil, for talking with us and giving us some really interesting insights about the role of bots and the ways in which you and your team has helped contribute to detecting the nefarious bots and helping make the world a better place as a result of that, especially with not only your own research, but the tools that you’ve made available. So thank you again very much.
Fil Menczer: Thank you so much for having me.
Noshir Contractor: Untangling the Web is a production of the Web Science Trust. This episode was edited by Molly Lubbers. I am Noshir Contractor. You can find out more about our conversation today in the show notes. Thanks for listening.