Replication, Verification and Availability for Big Data


The next step in the evolution of Social Computing Research: Formal acceptance of credit worthiness by the community of Replication, Verification, and Availability of Big Data.

In his response to my posting on Research Replication in Social Computing, Dr. Bernardo Huberman pointed to his letter to Nature on a related issue: Verification of results. Here I expand to include proposal that I have heard others mention recently.

I totally agree, of course, that “Science is unique in that peer review, publication and replication are essential to its progress.” This is what I also propose above. And he focuses on the need for having accessible data so that people can verify claims. For those who may not have access to his letter, I reproduce the central paragraph here:

“More importantly, we need to recognize that these results will only be meaningful if they are universal, in the sense that many other data sets reveal the same behavior. This actually uncovers a deeper problem. If another set of data does not validate results obtained with private data, how do we know if it is because they are not universal or the authors made a mistake? Moreover, as many practitioners of social network research are starting to discover, many of the results are becoming part of a “cabinet de curiosites” devoid of much generality and hard to falsify.”

Let me add something further, that I heard it mentioned by Noshir Contractor and Steffen Staab at the WebScience Track during the WWW2012 conference, that I think will complement the overall proposal: People who make their data available to others should get credit for that. After all, in Science a lot of time is spend collecting and cleaning data, and whose who do that and make their data available to other researchers for verification, meta-analyses and studying of other research questions should be rewarded for their contributions.

I believe the time is right to introduce formal credit for replication of results on comparable data sets, verification on the same data set, and for making data accessible to others for further and meta-analysis. I plan to use much of my group’s research time on these issues this summer and publish our findings afterwards.

