You are viewing a read-only archive of the Blogs.Harvard network. Learn more.

~ Archive for Twitter ~

Two rumors about the downing of a Russian warplane by Turkey


News of Turkish airplane shooting down a Russian one over the Turkish-Syrian border has dominated the news and the social media lately. We investigated the rumor within hours after it appeared (24 Nov. 2015) and you can see the results of the analysis here:

This was not the first time a rumor of this kind emerged. About a month and a half ago (10 Oct. 2015) an identical rumor had emerged. We had investigated that rumor too and you can see the results of our analysis here:

Russian jet downing rumors

As you can see, based on the crowd’s reaction to the rumors, TwitterTrails was able to determine that the October rumor was false while the November one was true. The false rumor did not spread much and had a lot of skeptical tweets questioning its validity. On the other hand, the true rumor spread much higher and in terms of skepticism was undisputed.

Our understanding of the way the “wisdom of the crowd” works is that, when unbiased, emotionally cool observers see a rumor that seems suspicious, they usually react in one of two ways: They either do not retweet it, reducing its spread, or they may respond questioning the validity of the rumor, resulting in higher skepticism.

This is something we see often in the stories we investigate on TwitterTrails. Our understanding of the way the “wisdom of the crowd” works is that, when unbiased, emotionally cool observers see a rumor that seems suspicious, they usually react in one of two ways: They either not retweet it, reducing its spread, or they may respond questioning the validity of the rumor, resulting in higher skepticism.

When plotting the true and false rumors (after they have been verified through journalists’ work), the following image emerges:

spread-vs-skepticismIt is not a 100% separation, but one can see that the false rumors (marked by red triangles) show low spread and high skepticism, while the true ones show high spread and low skepticism. The picture is of course muddled in the lower corner. A rumor that does not attract much attention did not have the opportunity to benefit from the “wisdom of the crowd” and thus cannot be determined by our system.


Note: This posting originally appeared on our TwitterTrails blog.

False rumors do not propagate like True ones


On Twitter, claims that receive higher skepticism and lower propagation scores are more likely to be false.
On the other hand, claims that receive lower skepticism and higher propagation scores are more likely to be true.

The above is a conjecture we wrote in a recent paper entitled Investigating Rumor Propagation with TwitterTrails (currently under review). Feel free to take a look if you want to know more details about our system, but we will write here some of its highlights.

As you may know if you have read our Twitter Trails Blog before, we are developing a Web service that, starting from a tweet or a set of keywords related to a story propagating on Twitter (or a hashtag), it will investigate it and answer automatically some of the basic questions regarding the story. If you are not familiar, you may want to take a look at some of the posts. Or, it can wait until you read this one.

Recently we deployed a site containing the growing collection of stories and rumors that we investigate. Its front end looks like this:


This is the “condensed view” which allocates one line per story, 20 stories per page. There are over 120 stories collected at this point. Clicking on a title brings you the investigation page with lots of details and visualizations about its propagation, its originator, how it burst, who supports it and who refutes it.

Note that on the right side of the condensed view we automatically compute two metrics:

  • The propagation level of a story. This is a logarithmic scale of the h-index of a tweet collection that has currently 5 levels: Extensive, High, Moderate, Low and Insignificant.
  • The skepticism level of a story. This is the ratio of tweets with negated propagation over tweets with no negated propagation. It has four levels: Undisputed, Hesitant, Dubious and Extremely doubtful.

The initial quote at the top of this post refers to these metrics.

There is also a more detailed,  “main view” of TwitterTrails:


In the main view there are additional tools to select stories, based on time of collection, particular tags, levels of propagation and skepticism or keywords.

A few weeks ago we gave a presentation of TwitterTrails at the Computation and Journalism 2014 symposium at Columbia University in NYC. There is a video of our presentation that you can view if interested. In this presentation we noted that false rumors have different pattern of propagation on Twitter than true rumors. Below is a graph that shows that difference.


The graph displays propagation levels vs skepticism levels, and the data points are colored depending on whether a rumor was true (blue), false (red) or something else (green) that cannot be categorized as true or false (e.g., reference to an event or a tweet collection based on a hashtag). The vast majority of the false rumors show insignificant to low propagation while at the same time their level of skepticism ranges from dubious to extremely doubtful.

This is remarkable, but it may not be too surprising. As we write in the paper, “Intuitively, this conjecture can be explained as an example of the power of crowd sourcing. Since the ancient times philosophers have argued that people will not willing do bad unless they are guided by irrational impulses, such as anger, fear, confusion or hatred. Therefore, the more people see some false information, the more likely it is that they will either raise an objection or simply decide not to repeat it further.

We make the conjecture specific for Twitter because it may not hold for every social network. In particular, we rely on the user interface for promoting an objection to the same level as the false claim. Twitter’s interface does that; both the claim and its negation will get the same amount of real estate in the a user’s Twitter client. On the other hand, this is not true for Facebook, where a claim gets much greater exposure than a comment, while a comment may be hidden quickly due to follow up comments. So, on Facebook most people may miss an objection to a claim.”

Take a look at and tell us what you think!
We would also be happy to run an investigation for you, if interested.

(This is copy of a blog post on the site.)


Looking beyond “Big Data” analysis to discover those who make a difference


In an earlier post (Trusting Anonymous Twitter Users) I wrote about how ordinary citizens in Mexico are using Twitter to stay informed about areas of immediate risk in their cities. In our social media research we saw some anonymous Twitter accounts begin to amass large numbers of followers as they gained repute as trusted sources in the dissemination of information related to shootings, explosions and areas of danger in some Mexican cities. If you are not familiar with this earlier blog post you may want to take a look at it since I am about to describe the rest of the story as we discovered and recently published it (The Rise and the Fall of a Citizen Reporter) at the WebScience 2013 conference.

The limits of “Big Data”

While the data and the narrative we presented in the paper “Hiding in Plain Sight: A Tale of Trust and Mistrust Inside a Community of Citizen Reporters” were very interesting, my co-authors and I had the feeling that we had not discovered the full story. For one thing, who really was @GodFather, the person behind the pseudonym we had created for the prominent account in our data? Was it a real person? What if it was merely one of the successful tweet-bots that researchers have launched in the past? Or, maybe it was some guy tweeting from Scotland posing as a young woman living in Mexico. Importantly, what about the accusation that she was not really interested in the well-being of her community, but was instead working for the Zetas, the criminal drug cartel that has been accused of some of the more heinous crimes of the Mexican drug-war. Was there any truth to it?

Furthermore, there were several events that we had discovered and had not written in the paper or the blog post. Looking at the aggregate data, my co-author Eni Mustafaraj and I discovered some important developments in the lives of these citizen reporters: Shortly before the accusation against @GodFather appeared, the City had seen a lot of violence and the authorities had failed to act quickly. @GodFather had tried to organize an informant movement of  “eagles” (aguilas) on Twitter to report on the actions of the “hawks” (halcones). Hawks is the name given to low-level cartel associates working on street corners using cellphones to communicate with their bosses. These hawks are seen as important actors informing cartels about the movement of the Mexican Army and Navy so they can escape after an attack. Therefore, another distinct possibility was that @GodFather was accused because she was becoming annoying to a specific cartel.

Events in the timeline of @GodFather’s activity in our data indicated a reduction of her activity in early April, 2011. The activity of those mentioning and retweeting her also shows a similar pattern.


What was really happening? Was @GodFather one of the prominent citizen reporters informing the people about areas they should avoid on any given day? Was she a traitor working for the Zetas? Or perhaps a fake account? Why was she attacked, and why did she subsequently stop tweeting? Was she still tweeting from another account name or had she disappeared from the community?


Separating Retweets from Mentions

Another interesting data visualization separating retweets of @GodFather’s messages (in blue) from mentions of her name (in red). While in the first half of the graph her tweets (in green) seem to be echoed by the community, in the second half things change. At that time people are mainly talking about her, not echoing what she tweets.


Using a Berkman talk to make the connection

Though we wanted to find out more, our big data analysis was not helping much. We needed verification on the ground. But we could not contact @GodFather directly (we figured that, “Hi, I am a researcher from the US and would like to verify your identity…” would not take us far). We knew that her account had been compromised in the past, so she had every reason to hide her identity. Moreover, there existed several accounts with similar-sounding names, some of them clearly belonging to trolls attacking her, and we did not want to end up talking to them by accident!

How could we uncover the truth? The Berkman Center and a measure of good luck helped us make a breakthrough. In July, 2012, the Berkman Center asked my co-author Andrés Monroy-Hernández and me to give a talk (“Narcotweets: Reporting on the Mexican Drug War using Social Media”) on our earlier work. I knew that Berkman talks are advertised, attended and tweeted widely online. Though not very likely, it was possible that some “tuiteros” from Mexico would follow our talk live. If I told them what we had discovered, even using pseudonyms, members of the citizen reporter community would certainly recognize the real identities to which the pseudonyms referred, and perhaps they would be willing to talk to us.

Indeed, by the end of the talk (available for viewing), Mariel Garcia, a Berkman intern from Mexico who was tweeting about the talk, showed me a couple of tuiteros accounts that had shown active interest in the talk. They were offering to answer any questions I might have. Of course I jumped on the opportunity; a few hours and many direct messages later I had established connection with one of the prominent citizen reporters of the community.

From that citizen reporter Eni and I learned that we had missed an important point in the data analysis. One of the reasons that @GodFather had stopped tweeting was that her anonymity had been compromised in late July, 2011. One of the trolls that had been attacking her throughout the year revealed her real name, her street address, and her picture. Now that we knew where to look, we went back to the data and found the relevant tweets. Her pictures had been deleted on the Web but we were able to look through archives and locate several of them. Now that we knew a lot about Melissa Lotzer, the pseudonym used the by the owner of the @GodFather account, all we needed was a way to contact her. We wanted to interview her about her motives and threats she had received.

For reasons that will soon become apparent, we can reveal some details about the community we were studying. Our community of Twitter users is located in Monterrey, Mexico, and they have been using the tag #MTYfollow to stay informed about dangerous situations in their city. The prominent citizen reporter, @trackMTY (aka @GodFather) was owned by a young woman who, like many such reporters, spent many hours a day informing and being informed by her sources. Melissa Lotzer (not her real name, but the one with which she is known in the community) became an active citizen reporter in March 2010, shortly after the #MTYfollow tag was adopted by the community. The drug war had hit the town of Comales, in the neighboring Tamaulipas region, where a drug cartel was reportedly holding some citizens hostage. Melissa and some of her some friends formed a Facebook group, Mexico Nueva Revolucion, and sent an open letter to President Cardenas begging for him to send the Army to free Comales. Following the discussion on various blogs, we see that Melissa and the MNR group received credit for their initiative.

But not everyone in the community was happy with these developments; Melissa’s accounts were attacked several times by trolls. But by early 2011, her reputation in the community was strong enough that Twitter shut down some of the trolling accounts after the outcry of the community. Her later initiative to organize the aguilas movement, however, was not as successful. While more than 80 aguilasMTY accounts were created within 2 days (!) ready to support her cause, many of her old friends did not follow her in this movement. Renewed troll attacks and troll collaboration with an editor of the famous Blog del Narco proved to be too strong for Melissa’s reputation to withstand.


Some of the aguilasMTY accounts that were created within a couple of days in late March 2011 at the call of trackMTY

We connected with Melissa and established a trusted two-way connection. We were able to verify her identity not only from the pointers of other citizen reporters, but also because we could go back and verify her claims through our tweet corpus. You can read more about our interviews with her in the later sections of the paper The Rise and the Fall of a Citizen Reporter, and you can find our slides from the WebScience 2013 talk online.

Communities of Citizen Reporters.

In recognizing Melissa we recognize the thousand of other citizen reporters who spend long hours daily informing their fellow citizens about important and dangerous events unfolding in their cities and neighborhoods. Like most of the citizen reporters involved in supporting the communities of Monterrey, Saltilo, Reynosa, Veracruz and elsewhere, she is an idealist who wants to help others. Her experience has made her stronger despite the risk to which she has been exposed. Even after all her experience she would choose to do it all over again because, as she says:

I’m completely sure that trackmty was the reason why many people started using twitter. I receive comments daily by followers that are opening a twitter account to a family member just to follow me […] They tell me: please take care of my mom, she will be reading your tweets, she will not be reporting cases because she doesn’t know how to use a blackberry. Many similar cases like that happen every day.

Voice of Melissa Lotzer (@trackMTY) Click the play button to hear.


PS. We also found out more about the identity of one of Melissa’s trolls: A young clerk at a local policy station inspired by WWF characters and with a hobby of posting photographs of prostitutes and gays on his blog.




Trusting Anonymous Twitter Users


Can we trust anonymous Twitter users? Before writing this paper with my colleagues Eni Mustafaraj, Samantha Finn and Andrés Monroy-Hernández, I would think that it was not impossible. But this is the theme of the paper that Andrés is presenting this week at ICWSM 2012:

Hiding in Plain Sight: A Tale of Trust and Mistrust inside a Community of Citizen Reporters

Below is a brief description of our findings. (It may look a bit impersonal because it is extracted from the contents of a poster we created, but you will get the idea.)

The contributions of this paper can be described as follows:

  1. To the best of our knowledge, this paper presents the first analysis of the practices of a community of Twitter citizen reporters in a life-threatening environment over an extended period of time (10 months).
  2. We discover that in this community, anonymity and trustworthiness are coexisting. Because these citizens live in a city troubled by the narco-wars that have plagued Mexico since 2006, it is a great example of a community where anonymity of active participants is crucial, while lack of anonymity may be fatal.
  3. We describe a series of network and content based features that allow us to understand the nature of this community, as well as discover conflicts or changes in behavior.


The large volume of user-generated content on the Social Web puts a high burden on the participants to evaluate the accuracy and quality of content.
We usually rely on known reputed news sources (NPR, NYT, BBC, Der Spiegel, etc.) to evaluate them. However, not every country has a free press or is willing or able to allow the international press to move freely. In some countries, like Mexico, journalists have been killed by organized crime or put under pressure by the authorities to stop reporting on certain events.

In the era of social Web, more citizens are reporting of newsworthy issues gaining reputation as citizens-reporters.
However, not everywhere in the world is there a right to and protection of free speech. In countries where the traditional media cannot report the truth, anonymity becomes a necessity for citizens who want to exercise their right of free-speech in the service of their community.

Is it possible for anonymous individuals to become influential and gain the trust of a community? Here, we discuss the case of a community of citizen reporters that use Twitter to communicate, located in a Mexican city plagued by the drug cartels fighting for control of territory.

Our analysis shows that the most influential individuals inside the community were anonymous accounts. Neither the Mexican authorities, nor the drug cartels were happy about the real-time citizen reporting of crime or anti-crime operations in an open social network such as Twitter, and we discovered external pressures to this community and its influential players to change their reporting behavior.


When we read news, we usually choose our information sources based on the reputation of the media organization. We trust the news organizations, therefore, we expect that their reporting is credible, though in the past there have been breaches of such trust, and all media organizations have an embedded bias that affects what they choose to report.

Social media platforms specializing in organizing humanitarian response to disasters, such as Ushahidi, rely on people on the ground to report on situations that need immediate attention. Anyone can be a reporter.

However, this poses a new problem: how do we assess the credibility of citizen reporting?
Citizen reporting lacks the inherent structures that help us evaluate credibility as we do with traditional media reporting. But sometimes, citizen reporting might be the only source of information we might have.
How can we use technology to help us verify the credibility of such reports?


To address this question we look at a particular community of citizen reporters gathered around Twitter accounts in a Mexican city plagued by drug-related violence.

Twitter has a unique feature that facilitates on-the-fly creation of communities: the hyperlinked hashtags. While previous research has shown that the majority of Twitter hashtags have a very short half-life span (Romero, Meeder, and Kleinberg 2011), in this paper we analyze the practices of a community of citizens that have been using the same hashtag since March 2010 to report events of danger happening in their city.

We refer to the community with the obfuscated hashtag #ABC_city, which is a substitute for the hashtag present in the tweets of our corpus. We will also substitute the exact text of important tweets with a translation from Spanish to English, so that searching online or with the Twitter API will not lead to unique results.


Through research we discovered the birth of the community defined by the hashtag #ABC_city : The following tweet mentioning #ABC_city for the first time was the inaugural one, on March 19, 2010, by a not-particularly-active member:

#YXZ_city #ABC I propose #ABC_city to inform about news and important events in our city.

Then, this user reused the new hashtag many times in the following days together with #old_ABC hashtag and others, in order to spread its use:

@userA shootings are being reported in [address] (good source) #ABC #old_ABC #ABC_city #XYZ_city

In May 11, 2010, the same user who created the hashtag tweeted the following:

@Spammer101 Stop spamming #ABC_city. It’s only about important events that might affect our society.

Between May and November 2010 the usage of the hashtag is sparse, with the old hashtags being used more often. An increase in the adoption of #ABC_city starts on November 4th, only a week before the starting period of the #ABC_city dataset.


We used a basic dataset and a supplemental collection informed by our initial set of data.

The original dataset consists of 258,734 tweets written by 29,671 unique Twitter users, covering 286 days in the time interval November 2010 – August 2011.
On November 2010 we provided a set of keywords related to Mexico events to the archival service. The collection was later divided in separate datasets according to the presence of certain hashtags.

To supplement our limited original dataset, we performed a series of additional data collection in September, 2011. In particular, we collected all social relations for the users in the current dataset, as well as their account information.
We collected all tweets for accounts created since 2009 with less than 3200 tweets, in order to discover the history of the (anonymized) hashtag #ABC_city that defines the community we are studying.
We also made use of the dataset described in (O’Connor et al. 2010) to locate tweets archived in 2009.


While we would prefer to give further details on the collected data and use them freely in this paper, on ethical grounds, we will protect this community under anonymity, due to potential risk that our research can pose now or in the future. To exemplify the seriousness of the situation, we provide one example out of the many documented in the press of what the lack of anonymity can lead to.
On September 27, 2011, the Mexican authorities found the decapitated body of a woman in the town of Nuevo Laredo (near the Texas border) with a message apparently left by her executioners, which starts this way:

“OK, Nuevo Laredo en Vivo and social networking sites, I’m The Laredo Girl, and I’m here because of my reports, and yours, …”

Laredo Girl was the pseudonym used by the woman to participate in a local social network that enabled citizens to report criminal activities.

THE ACCOUNT @GodFather                          

Followee Relations Out of 29,671 unique users in the corpus, we were able to collect followee information for 24,973 accounts that were active and public in September 2011 (84% of all users in the corpus). There are more than 8,5 million followee links, with an average of 336 followees per user and a median of 162 followees. The total number of unique followees is almost 1,7 million.

Ranking the followees based on the number of relations inside this #ABC_city community serves as an indicator of the attention that this community as a whole pays to other Twitter users. We inspected the top 100 accounts to understand the nature of their popularity. The top account was Mexico’s president, Felipe Calderon, followed by the TV news program of the city, and an anonymous citizen reporter to whom we will refer as @GodFather. Four journalists, the city’s newspaper, a famous Mexican poet, and a comic’s character made up the rest of top ten. Almost half of the accounts in the top 100 are entertainers of Mexican fame, with only a few international superstars such as Shakira or Lady Gaga in the mix. This statistic confirms the widespread perception that a large part of the Twitter appeal derives from its use by celebrities, though it also indicates that each community is interested in its own celebrities. 25 of top 100 most followed accounts belong to local and national journalists and media organizations, compared to 10 for politicians at the state and federal level. In fact, the governor of the state in which ABC city is located (Mexico is a federation of 31 states) ranks at the 45th position in the followees list, one place behind the account of Barack Obama.

To understand the appeal to the community of the top 100 ranked accounts, we inspected their Twitter profiles. The top account, @GodFather, has 9,079 followers inside the community, or 36% of all active members. This amounts to 16% of all his audience, he has in total 57,127 followers. @GodFather is an anonymous citizen who has written the largest number of tweets in the corpus (6,675), which make up 25% of all his statuses (26,340).

A mutual-follow relation in Twitter (the friendship) is significant because it enables the involved accounts to send direct messages to one another. Direct messages offer some privacy to users, though if an account is hacked messages are compromised (unless a user has the habit of deleting them). Communication through direct messages is not visible to researchers or the public and cannot be quantified. However, it is possible to quantify the extent to which such strong ties exist inside the community by discovering mutual links in the sets of followers and followees. As shown below, on average, 40% of user relations are reciprocated.

The normal-like histogram of reciprocal link distribution of friendship relations (mutual links) in the network of the #ABC_city corpus.

The next figure shows the graph of all members with more than 75 friendship links which only reinforces the conclusion that this is a tightly connected community of users. (We limited the number of nodes for computational reasons)

The graph of all members with more than 75 friendship links. Coloring is produced automatically by the Gephi modularity algorithm that finds communities in a network using the Louvain algorithm.


Past research has shown that retweeting is indicative of agreement between the original sender and the retweeter (e.g., (Metaxas and Mustafaraj 2010; Conover et al. 2011)). Over time, retweets are effectively providing information about a community of social media users that are in agreement on specific issues. Otherwise, the chance of a community member retweeting a message of an opposing political community is under 5%.

Since retweets involve a relation between two users, the original sender and the retweeting user, we can create a network of such relations for all retweets in the corpus. This retweet graph is shown below.

The retweet graph reveals a large component that is actively involved in retweeting, with smaller star-like components at the fringes. Closer examination reveals that the stars at the fringes were occasional retweeters of famous users (e.g., entertainers) and could easily be identified and excluded from our analysis. The nodes have been drawn in size relative to their in-degree, that is to the degree that their messages had been retweeted, revealing a small number of accounts that rose to prominence in the community.

Zooming in inside this graph reveals the most influential nodes in the community, which we identified as the anonymous citizen reporters. The biggest node belongs to @GodFather.

A closer look at the core of the community reveals 13 nodes that have a larger share of their messages retweeted. The spatial proximity of these nodes determined by a force-directed algorithm indicates that they were also retweeting each other (as opposed to the nodes in the periphery of the retweet graph). The biggest node belongs to @GodFather.


Tweeting activity of three groups of users with different tweeting patterns overlaid with the frequency of appearance for the word “balacera” (shooting). All three groups have an increase in activity, matching the ups of the balacera distribution. There is only one discrepancy, in April-May 2011, related to an event explained in the next section.


Daily distribution of tweets for the anonymous account @GodFather and its daily mentions in tweets by other members of the community. In April 2010, he was accused by newly created anonymous accounts of working for the criminal organization. After that event, he decreased his involvement in the community and at the end of July stopped tweeting altogether.


In a time when social networking platforms such as Facebook and Google+ are pushing to force users to assume their real-life identities in the Web, we think that it is important to provide examples of communities of citizens for which maintaining their anonymity inside such networks is essential. But being anonymous makes one more susceptible to denigration attacks from other anonymous accounts, leaving the other members of community in the dilemma of who to trust.

Inside a community, even anonymous individuals can establish recognizable identities that they can sustain over time. Such anonymous individuals can become trustworthy if their efforts to serve the interests of the community remain constant over time.

Replication, Verification and Availability for Big Data



The next step in the evolution of Social Computing Research: Formal acceptance of credit worthiness by the community of Replication, Verification, and Availability of Big Data.

In his response to my posting on Research Replication in Social Computing, Dr. Bernardo Huberman pointed to his letter to Nature on a related issue: Verification of results. Here I expand to include proposal that I have heard others mention recently.

I totally agree, of course, that “Science is unique in that peer review, publication and replication are essential to its progress.” This is what I also propose above. And he focuses on the need for having accessible data so that people can verify claims. For those who may not have access to his letter, I reproduce the central paragraph here:

“More importantly, we need to recognize that these results will only be meaningful if they are universal, in the sense that many other data sets reveal the same behavior. This actually uncovers a deeper problem. If another set of data does not validate results obtained with private data, how do we know if it is because they are not universal or the authors made a mistake? Moreover, as many practitioners of social network research are starting to discover, many of the results are becoming part of a “cabinet de curiosites” devoid of much generality and hard to falsify.”

Let me add something further, that I heard it mentioned by Noshir Contractor and Steffen Staab at the WebScience Track during the WWW2012 conference, that I think will complement the overall proposal: People who make their data available to others should get credit for that. After all, in Science a lot of time is spend collecting and cleaning data, and whose who do that and make their data available to other researchers for verification, meta-analyses and studying of other research questions should be rewarded for their contributions.

I believe the time is right to introduce formal credit for replication of results on comparable data sets, verification on the same data set, and for making data accessible to others for further and meta-analysis. I plan to use much of my group’s research time on these issues this summer and publish our findings afterwards.

Research Replication in Social Computing


On the need for Research Replication in Social Computing

A call for replicating Social Computing results as a necessary step in the maturity of the field.

The field of Social Computing is a rather new one but it is one of the more active in Computer Science in the last few years. Many new conferences have been created to host the research efforts of computer scientists, social scientists, physicists, statisticians and many other researchers and practitioners. The excitement generated by the opportunities that opened through the relatively ease of retrieving large amounts of data has led many young researchers in diving to uncover the modes of social interactions.

At the risk of oversimplifying, one could say that the research papers we produce follow the general pattern of observational sciences:

  • We collect data that arguably can capture the phenomenon we want to study,
  • we may apply some sophisticated statistical tools, test a hypothesis applying machine learning tools, and
  • analyze the results.

Our conclusions sometimes do not just state the phenomenon we just observed, but they expand from the specific findings to claim possible projections that go beyond the observed.

One of the reasons that this approach seems familiar it that it resembles the one used in Experimental Computer Science. There, we measure the characteristics of the systems or algorithms we have built, and study their performance experimentally when exact analysis is not easy or even possible. This is a true and tried approach since, in the systems we build, we take great effort to avoid any behavior that is outside of the specifications. In the artificial worlds we create, we try to control all of its aspects, and this process has produced amazing technological results.

On the other hand, this approach may be inappropriate or incomplete compared to those used in Experimental Natural Sciences. Physicists, Biologists and Chemists would start with this approach to make initial sense of the data they are collecting, but this is just the beginning of the process. Replication of their research is normally needed to verify the validity of the original experiments. Sometimes the research results would not be validated, nevertheless, even in this case the replication process would provide insight into the workings of natural phenomena. Nature is mostly repeating its phenomena consistently, and one may have to account for all the parameters that affect them. Sometimes this is not easy, and replication offers the best guarantee that the research findings are valid.

As we mentioned, Social Computing is now being done by researchers coming from many disciplines, but it is different from both Computer Sciences and Natural Sciences. Though it has the potential of also becoming an experimental science, so far it is mostly an observational Science. This, it turns out, is a very important distinction. Society is different than Nature in several important ways. Its basic building blocks are people, not atoms, or chemical compounds or molecules. The complexity of their interactions is not easily tractable, to the degree that one may not be able to even enumerate all the factors that affect them. Moreover, people (and even social “bots” released in Social Media) do not behave consistently over time and under different conditions.

The closest relative to Social Computing is not Computer Science, we would argue, but Medical Science, where Natural Sciences phenomena are influenced by Social conditions. In both Medical and Natural Sciences, replication of results is considered an irreplaceable component of scientific progress. Any lab can make discoveries, but these discoveries are not considered valid until they have been independently replicated by other labs. Not surprisingly, replicating research findings is considered a critical publishing action, and researchers are getting credit for doing just that.

In Computer Science, replication has not been considered important and worth any credit, unless it reveals crucial flaws in the original research. It is unlikely, for example, that replicating Dijkstra’s Shortest Paths algorithm would contribute to the development of our discipline, and so it makes sense not to give credit to its replication. On the other hand, inability to replicate Hopcroft and Tarjan’s tri-connected component algorithm was a significant development, and Gutwender and Mutzel who discovered it and corrected it, did receive credit for it.

We acknowledge the need for replicating Social Computing research results, as a way of establishing the patterns that Social Media data are discovering under all meaningful conditions. We believe that such research replication will give credibility to the field. Failing to do that, we may end up collecting a large number of conflicting results that may end up discrediting the whole field.


Political retweets do mean endorsement


There are not too many results that Social Media research has discovered in the last few years that are as accurately reproducible as the title of this blog: Political RTs do mean endorsement. I have written a few things about the related research in my “Three Social Theorems” blog post a few weeks ago (this being the first theorem), and it was the theme of my talk during the “Truthiness in Digital Media” symposium at the Berkman Center.

This does not mean that every political RT is an endorsement. (If that were the case I could break it with a retweet right now). But it means that, when people retweet, that is, when they broadcast unedited to their own followers a tweet they received, most of the time they have read it, and thought that it is worth spreading. They practically endorse it.

If we realize the above, then we should not be very surprised by the spreading of the false news about Gov. Nikki R. Haley that the New York Times is reporting today “A Lie Races Across Twitter Before the Truth Can Boot Up“. While the reporter Jeremy Peters is impressed by the speed of the false news, detailing the path that it took (very good journalistic work, indeed), his most important point, I believe, is the one he makes in the second paragraph:

[…] it left news organizations facing a new round of questions about accountability and standards in the fast and loose “retweets do not imply endorsement” ethos of today’s political journalism.

Interestingly, it is mainly a few journalists that feel the need to explicitly mention in their personal profile description a disclaimer to the effect “My Retweets do not mean agreement”. In fact, out of more than 83,000 profile descriptions that my colleague Prof. Eni Mustafaraj  and I  have in our database of election-related tweets, we found only 53 that mention this disclaimer.  31 of them belong to journalists.

Should we expect more such lies to race across social media in the remaining months before the elections? Probably yes.
Should we expect journalists to be much more cautious the next time they retweet something from a source they do not trust? Certainly yes.

But the good news is that, lies, in general, have shorter, more questioned lives in Social Media. See Social Theorem 2 for the supporting research in this one.
Does it mean that no lie will ever be spread? Of course not.
But it means that most of the time they will be caught, especially as more people are aware of the RT Theorem and care about the truth.

Do all people care about the truth? Of course not. Take for example, Mr. Smith, the originator of the false blog post.



Three Social Theorems


Dear Readers,

Below are my annotated notes from a talk I gave at Berkman’s Truthiness in Digital Media Symposium a few weeks ago. I introduced the concept of Social Theorems, as a way of formulating the findings of the research that is happening the last few years in the study of Social Media. It is my impression that, while we publish a lot of papers, write a lot of blogs and the journalists report often on this work, we have troubles communicating clearly our findings. I believe that we need both to clarify our findings (thus the Social Theorems), and to repeat experiments so that we know we have enough evidence on what we really find. I am working on a longer version of this blog and your feedback is welcome!

P. Takis Metaxas

With the development of the Social Web and the availability of data that are produced by humans, Scientists and Mathematicians have gotten an interest in studying issues traditionally interesting mainly to Social Scientists.

What we have also discovered is that Society is very different than Nature.

What do I mean by that? Natural phenomena are amenable to understanding using the scientific method and mathematical tools because they can be reproduced consistently every time. In the so-called STEM disciplines, we discover natural laws and mathematical theorems and keep building on our understanding of Nature. We can create hypotheses, design experiments and study their results, with the expectation that, when we repeat the experiments, the results will be substantially the same.

But when it comes to Social phenomena, we are far less clear about what tools and methods to use. We certainly use the ones we have used in Science, but they do not seem to produce the same concrete understanding that we enjoy with Nature. Humans may not always behave in the same, predictable ways and thus our experiments may not be easily reproducible.

What have we learned so far about Social phenomena from studying the data we collect in the Social Web? Below are three Social Theorems I have encountered in the research areas I am studying. I call them “Social Theorems” because, unlike mathematical Theorems, they are not expected to apply consistently in every situation; they apply most of the time and when enough attention has been paid by enough people. Proving Social Theorems involves providing enough evidence of their validity, along with description of their exceptions (situations that they do not apply). It is also important ti have a theory, an explanation, of why they are true. Disproving them involves showing that a significant number of counter examples exists. It is not enough to have a single counter example to disprove a social theorem, as people are able to create one just for fun. One has to show that at least a significant minority of all cases related to a Social Theorem are counter-examples.

SoThm 1. Re-tweets (unedited) about political issues indicate agreement, reveal communities of likely minded people.

SoThm 2. Given enough time and people’s attention, lies have short questioned lives.

SoThm 3. People with open minds and critical thinking abilities are better at figuring out truth than those without. (Technology can help in the process.)

So, what evidence do we have so far about the validity of these Social Theorems? Since this is simply a blog, I will try to outline the evidence with a couple of examples. I am currently working on a longer version of this blog, and your feedback is greatly appreciated.

Evidence for SoThm1.

There are a couple papers that present evidence that “Re-tweets (unedited) about political issues indicate agreement, reveal communities of likely minded people.” The first is the From Obscurity to Prominence in Minutes: Political Speech and Real-Time Search paper that I co-authored with Eni Mustafaraj and presented at the WebScience 2010 conference. When we looked at the most active 200 Twitter users who were tweeting about the 2010 MA Special Senatorial election (those who sent at least 100 tweets in the week before the elections), we found that their re-tweets were revealing their political affiliations. First, we completely characterized them into liberals and conservatives based on their profiles and their tweets. Then we looked at how they were retweeting. In fact, 99% of the conservatives were only retweeting other conservatives’ messages and 96% of liberals those of other liberals’.

Then we looked at the retweeting patterns of the 1000 most active accounts (those sent at least 30 tweets in the week before the elections) and we discovered the graph below:

As you may have guessed, the liberals and conservatives are re-tweeting mostly the messages of their own folk. In addition, it makes sense: The act of re-tweeting has the effect of spreading a message to your own followers. If a liberal or conservative re-tweets (=repeats a message without modification), he/she wants this message to spread. In a politically charged climate, e.g., before some important elections, he/she will not be willing to spread a message that he disagrees with.

The second evidence comes from the paper “Political Polarization on Twitter” by Conover et. al. presented at the 2011 ICWSM conference. The retweeting pattern, shown below, indicates also a highly polarized environment.

In both cases, the pattern of user behavior is not applying 100% of the time, but it does apply most of the time. That is what makes this a Social Theorem.

Evidence for SoThm2.

The “Given enough time and people’s attention, lies have short questioned lives” Social Theorem describes a more interesting phenomenon because people tend to worry that lies somehow are much more powerful than truths. This worry stems mostly from our wish that no lie ever wins out, though we each know several lies that have survived. (For example, one could claim that there are several major religions in existence today that are propagating major lies.)

In our networked world, things are better, the evidence indicates. The next table comes from the “Twitter Under Crisis: Can we trust what we RT?” paper by Mendoza et. al., presented at the SOMA2010 Meeting. The authors examine some of the false and true rumors circulated after the Chilean earthquake in 2010. What they found is that rumors about confirmed truths had very few “denies” and were not questioned much during their propagation. On the other hand, those about confirmed false rumors were both questioned a lot and were denied much more often (see the last two columns enclosed in red rectangles). Why does this make sense? Large crowds are not easily fooled as the research on crowd sourcing has indicated.

Again, these findings do not claim that no lies will ever propagate, but that they will be confronted, questioned, and denied by others as they propagate. By comparison, truths will have a very different experience in their propagation.

The next evidence comes from the London Riots in August 2011. At the time, members of the UK government accused Twitter of spreading rumors and suggested it should be restricted in crises like these. The team that collected and studied the rumor propagation on Twitter found that this was not the case: False rumors were again, short-lived and often questioned during the riots. In a great interactive tool, the Guardian shows in detail the propagation of 7 such false rumors. I am reproducing below an image of one of them, the interested reader should take a closer look at the Guardian link.



During the Truthiness symposium, another case was presented, one that supposedly shows the flip side of this social theorem: That “misinformation has longer life, further spread on Twitter than accompanying corrections”. I copy the graph that supposedly shows that, for reference.

Does this mean that the Social Theorem is wrong? Recall that a Social Theorem cannot be refuted by a single counter-example, but by demonstrating that, at least a significant minority of counter examples, exists.

Further, the above example may not be as bad as it looks initially. First, note that the graph shows that the false spreading had a short life, it did not last more than a few hours. Moreover, note that the false rumor’s spreading was curbed as soon as the correction came out (see the red vertical line just before 7:30PM). This indicates that the correction probably had a significant effect in curbing the false information, as it might have continue to spread at the same rate as it did before.


Evidence for SoThm3.

I must admit that “People with open minds and critical thinking abilities are better at figuring out truth than those without” is a Social Theorem that I would like to be correct, I believe it to be correct, but I am not sure on how exactly to measure it. It makes sense: After all our educational systems since the Enlightenment is based on it. But how exactly do you created controlled experiments to prove or disprove it?

Here, Dear Reader, I ask for your suggestions.



Misinformation and Propaganda in Cyberspace


Dear Readers,

The following is a blog that I wrote recently for a conference on “Truthiness in Digital Media” that is organized by the Berkman Center in March. It summarizes some of the research findings that have shaped my approach to the serious challenges that misinformation propagation poses in Cyberspace.

Do you have examples of misinformation or propaganda that you have seen on the Web or on Social Media? I would love to hear from you.

Takis Metaxas


Misinformation and Propaganda in Cyberspace

Since the early days of the discipline, Computer Scientists have always been interested in developing environments that exhibit well-understood and predictable behavior. If a computer system were to behave unpredictably, then we would look into the specifications and we would, in theory, be able to detect what went wrong, fix it, and move on. To this end, the World Wide Web, created by Tim Berners-Lee, was not expected to evolve into a system with unpredictable behavior. After all, the creation of WWW was enabled by three simple ideas: the introduction of the URL, a globally addressable system of files, the HTTP, a very simple communication protocol that allowed a computer to request and receive a file from another computer, and the HTML, a document-description language to simplify the development of documents that are easily readable by non-experts. Why, then, in a few years did we start to see the development of technical papers that included terms such as “propaganda” and “trust“?

Soon after its creation the Web began to grow exponentially because anyone could add to it. Anyone could be an author, without any guarantee of quality. The exponential growth of the Web necessitated the development of search engines (SEs) that gave us the opportunity to locate information fast. They grew so successful that they became the main providers of answers to any question one may have. It does not matter that several million documents may all contain the keywords we were including in our query, a good search engine will give us all the important ones in its top-10 results. We have developed a deep trust in these search results because we have so often found them to be valuable — or, when they are not, we might not notice it.

As SEs became popular and successful, Web spammers appeared. These are entities (people, organizations, businesses) who realized that they could exploit the trust that Web users placed in search engines. They would game the search engines manipulating the quality and relevance metrics so as to force their own content in the ever-important top-10 of a relevant search. The Search Engines noticed this and a battle with the web spammers ensued: For every good idea that search engines introduced to better index and retrieve web documents, the spammers would come up with a trick to exploit the new situation. When the SEs introduced keyword frequency for ranking, the spammers came up with keyword stuffing (lots of repeating keywords to give the impression of high relevance); for web site popularity, they responded with link farms (lots of interlinked sites belonging to the same spammer); in response to the descriptive nature of anchor text they detonated Google bombs (use irrelevant keywords as anchor text to target a web site); and for the famous PageRank, they introduced mutual admiration societies (collaborating spammers exchanging links to increase everyone’s PageRank). In fact, one can describe the evolution of search results ranking technology as a response to Web spamming tricks. And since for each spamming technique there is a corresponding propagandistic one, they became the propagandists of cyberspace.

Around 2004, the first elements of misinformation around elections started to appear, and political advisers recognized that, even though the Web was not a major component of electoral campaigns at the time, it would soon become one. If they could just “persuade” search engines to rank positive articles about their candidates highly, along with negative articles about their opponents, they could convince a few casual Web users that their message was more valid and get their votes. Elections in the US, after all, often depend in a small number of closely contested races.

Search Engines have certainly tried hard to limit the success of spammers, who are seen as exploiting this technology to achieve their goals. Search results were adjusted to be less easily spammable, even if this meant that some results were hand-picked rather than algorithmically produced. In fact, during the 2008 and the 2010 elections, searching  the Web for electoral candidates would yield results that contained official entries first: The candidate’s campaign sites, the office sites, and wikipedia entries topped the results, well above even well-respected news organizations. The embarrassment of being gamed and of the infamous “miserable failure” Google bomb would not be tolerated.

Around the same time we saw the development of the Social Web, networks that allow people connect, exchange ideas, air opinions, and keep up with their friends. The Social Web created opportunities both for spreading political (and other) messages, but also misinformation through spamming. In our research we have seen several examples of propagation of politically-motivated misinformation. During the important 2010 Special Senatorial election in MA, spammers used Twitter in order to create a Google bomb that would bring their own messages to the third position of the top-10 results by frequently repeating the same tweet. They also created the first Twitter bomb targeting individuals interested in the MASEN elections with misinformation about one of the candidates, and created a pre-fab Tweet factory imitating a grass-roots campaign, attacking news organizations and reporters (a technique known as “astroturfing“).

Like propaganda in society, spam will stay with us in cyberspace. And as we increasingly look to the Web for information, it is important that we are able to detect misinformation. Arguably, now is the most important time for such detection, since we do not currently have a system of trusted editors in cyberspace like that which has served us well in the past (newspapers, publishers, institutions). What can we do?

* Retweeting reveals communities of likely-minded people: There are 2 larger groups that naturally arise when one considers the retweeting patterns of those tweeting during the 2010 MA special election. Sampling reveals that the smaller contains liberals and the larger conservatives. The larger one appears to consist of 3 different subgroups.

Some promising research in social media has shown potential in using technology to detect astroturfing. In particular, the following rules hold true most (though not all) of the time:

  1. The credibility of the information you receive is related to the trust you have towards the original sender and to those who retweeted it.
  2. Not only do Twitter friends (those that you follow) reveal a similarly-minded community, their retweeting patterns make these communities stronger and more visible.
  3. While both truths and lies propagate in cyberspace, lies have shorter life-spans and are questioned more often.

While we might prefer an automatic way of detecting misinformation with the use of algorithms, this will not happen. Citizens of cyberspace must become educated about how to detect misinformation, be provided with tools that will help them question and verify information, and draw on the strengths of crowd sourcing through their own groups of trusted editors. This Berkman conference will help us push in this direction.


Election time, and the predicting is easy…


Election time, and the predicting is easy…

As I am sure you have heard, the Iowa caucus results are in. Several journalists are reporting on the elections along with claims of “predictions” that social media are supposedly making. And the day after the Iowa caucus, they are wondering whether Twitter predicted correctly or not. And they look at the “professionals” for advise such as Globalpoint, Sociagility, Socialbackers and other impressive sounding companies.

Shepard Fairey meets Angry Birds: Poster of our 2011 ICWSM submission "Limits of Electoral Predictions using Twitter"

Well, Twitter did not get it right. That is not surprising to my co-authors and I.  Yet, they try to find a silver lining, by claiming smaller predictions such as “anticipating Santorum’s excellent performance than the national polls accomplished.” Of course, the fact that Twitter missed the mismatches with the other 5 candidates is ignored. Why can’t they see that?

A few years ago I had created a questionnaire to help my students sharpen their critical thinking skills. One question that the vast majority got right was the following: “Is Microsoft the most creative tech company?” If one were to do a Web search on this question, the first hit (the “I feel lucky” button) would be Microsoft’s own Web page, because it had as title “Microsoft is the most creative tech company.” My students realized that Microsoft may not be providing an unbiased answer to this question, and ignored it.

It is exactly this critical thinking principle that journalists obsessed with election predictions are getting wrong: The companies I mentioned above ( Globalpoint, Sociagility, Socialbackers ) are all in the business of making money by promising magical abilities in their own predictions and metrics. One should not take their claims on face value because they have financial conflict of interest in giving misleading answers (e.g. “Comparing our study data with polling data from respected independent US political polling firm Public Policy Polling, we discovered a strong, positive correlation between social media performance and voting intention in the Iowa caucus.” Note that even after the elections they talk about intentions, not results.)

That’s not the only example violating this basic critical thinking principle I saw today. Earlier, I had received a tweet that “Americans more susceptible to online scams than believed, study finds“. The article reports that older, rich, highly educated men from the Midwest, politically affiliated with the Green Party are far less susceptible to scam than young, poor, high school dropout women from the Southwest that are supporting Independents. If you read the “study” findings, you will be even more confused about the quality of this study. A closer look reveals that the “study” was done by PC Tools, a company selling “online security and system utility software.” Apparently, neither the vagueness of the “survey” nor the financial conflict of interest of the surveying company raised any flags for the reporter.

In the Web era, information is finding us, not the other way around. Being able to think critically will be crucial.



Log in