{"id":2445,"date":"2022-07-25T16:33:53","date_gmt":"2022-07-25T20:33:53","guid":{"rendered":"http:\/\/blogs.harvard.edu\/pamphlet\/?p=2445"},"modified":"2022-07-25T16:33:53","modified_gmt":"2022-07-25T20:33:53","slug":"moderating-principles","status":"publish","type":"post","link":"https:\/\/archive.blogs.harvard.edu\/pamphlet\/2022\/07\/25\/moderating-principles\/","title":{"rendered":"Moderating principles"},"content":{"rendered":"<p>Some time around April 1994, I founded the Computation and Language E-Print Archive, the first preprint repository for a subfield of computer science. It was hosted on Paul Ginsparg\u2019s <a href=\"https:\/\/arxiv.org\">arXiv<\/a> platform, which at the time had been hosting only physics papers, built out from the original arXiv repository for high-energy physics theory, hep-th. The repository, cmp-lg (as it was then called), was superseded in 1999 by an open-access preprint repository for all of computer science, the Computing Research Repository (CoRR), which covered a broad range of subject areas, including computation and language. The CoRR organizing committee also decided to host CoRR on arXiv. I switched over to moderating for the CoRR repository from cmp-lg, and have continued to do so for the last \u2013 oh my god \u2013 22 years.<a id=\"fnref:1\" class=\"footnote\" title=\"see footnote\" href=\"1\">[1]<\/a><\/p>\n<p>Articles in the arXiv are classified with a single primary <a href=\"https:\/\/arxiv.org\/archive\/cs\"><em>subject class<\/em><\/a>, and may have other subject classes as secondary. The switchover folded cmp-lg into the arXiv as articles tagged with the cs.CL (computation and language) subject class. I thus became the moderator for cs.CL.<\/p>\n<p>A preprint repository like the arXiv is not a journal. There is no peer review applied to articles. There is essentially no quality control. That is not the role of a preprint repository. The role of a preprint repository is open distribution, not vetting. Nonetheless, <em>some<\/em> kind of control is needed in making sure that, at the very least, the documents being submitted are in fact scholarly articles and are appropriately tagged as to subfield, and that need has expanded with the dramatic increase in submissions to CoRR over the years. The primary duty of a moderator is to perform this vetting and triage: verifying that a submission possesses the minimum standards for being characterized as a scholarly article, and that it falls within the purview of, say, cs.CL, as a primary or secondary subject class.<\/p>\n<p>I am (along with the other arXiv moderators) thus regularly in the position of having to make decisions as to whether a document is a scholarly article or not. To a large extent, <a href=\"https:\/\/en.wikipedia.org\/wiki\/I_know_it_when_I_see_it\">Justice Potter Stewart\u2019s approach<\/a> works reasonably well for scholarly articles: you know them when you see them. But over time, as more marginal cases come up, I\u2019ve felt that tracking my thinking on the matter would be useful for maintaining consistency in my own practice. And now that I\u2019ve done that for a while, I thought it might be useful to share my approach more broadly. That is the goal of this post.<\/p>\n<p>The following thus constitutes (some of) the de facto policies that I use in making decisions as the moderator for the cs.CL collection in the CoRR part of the arXiv repository. I emphasize that these are <em>my<\/em> policies, not those of CoRR or the moderators of other CoRR subjects. (The arXiv folks themselves provide a more general <a href=\"https:\/\/arxiv.org\/mod\/guidelines\">guide for arXiv moderators<\/a>.)<!--more--><\/p>\n<h1>What\u2019s a scholarly article?<\/h1>\n<p>To qualify for inclusion, a submission must constitute a scholarly article. For this purpose, scholarly articles fall into three classes:<\/p>\n<dl>\n<dt>Analytic<\/dt>\n<dd>The submission presents a <em>specific question that it then answers<\/em>. This might take the form of presenting a proposition and then proving it, or presenting an alternative method for a task and then demonstrating that it does or does not improve over some other method, or defining a new task and presenting a method for carrying it out. It ought to present enough of a clue about the methods such that at least the beginning of an attempt at replication could be made. On the other hand, the reported result need not be novel, or interesting, or even correct. Determining whether a result is novel, interesting, or correct is the point of reviewing, which, remember, we are not in the business of.<\/dd>\n<dt>Synthetic<\/dt>\n<dd>The submission <em>synthesizes in a systematic manner<\/em> other scholarly articles. (Yes, this definition is intentionally recursive, with analytic articles forming the base case.) Review articles fall within this class. The article should make at least an attempt at both systematicity (providing and using a well-formed taxonomy, for instance) and exhaustiveness.<\/dd>\n<dt>Dataset<\/dt>\n<dd>Occasionally, we get papers that describe a new <em>publicly available language-related dataset<\/em> of interest, with perhaps only rudimentary analysis of the data. These can be appropriate if the data is made openly available, and the availability is clearly featured in the article.<\/dd>\n<\/dl>\n<p>Papers that don\u2019t arguably fall into one of these three classes I will typically reject for inclusion in cs.CL. In particular, I frequently see articles whose putative thesis is (as far as I can tell) \u201cI did a thing\u201d. Typically, the paper reports that \u201cI built a piece of software that does <em>X<\/em>.\u201d A report of that sort does not suffice as constituting a scholarly article. A special case of this is the putative dataset paper that reports \u201cI built a corpus\u201d but provides no access to that corpus.<\/p>\n<h1>What\u2019s a CL (computation and language) article?<\/h1>\n<p>Computational linguistics (CL) is the scholarly field studying human (natural) language using the tools and techniques of computer science, and is allied with the engineering field of natural-language processing (NLP), the building or improving of useful artifacts that manipulate natural language. Articles appropriate for a cs.CL tag, then, should have something to do with natural language. Articles about sound processing of speech may be appropriate if the techniques rely on the fact that the acoustic signal is spoken language. Similarly for image processing.<\/p>\n<p>I tend to apply an extremely broad interpretation of \u201chaving something to do with language\u201d, especially for cross-listing as a cs.CL secondary subject class, on the theory that errors of omission are worse than errors of commission for subscribers to the cs.CL notifications.<\/p>\n<p>In addition, articles need to involve some computer science. CoRR is, after all, a computer science repository. Formal analysis of language that merely uses computers for the analysis does not by itself qualify a paper as involving computer science. After all, these days, all scientific research involves computers. I\u2019ve rejected some excellent papers in what is essentially formal linguistics because they involved no nontrivial involvement of computer science ideas. Similarly, we get papers that apply well-known and standard NLP methods to standard NLP problems in some application area, such as analyzing health records or student writing, where the (potential) contribution is to health or education. Absent <em>some<\/em> involvement with computer science, these articles are best thought of as falling within the application field. Such articles deserve distribution in <a href=\"http:\/\/v2.sherpa.ac.uk\/opendoar\/\">a preprint repository<\/a>, just not CoRR. Health application papers might use medRxiv, education papers EdArXiv. Sadly, there are many fields of scholarship without good, long-term, well-maintained repositories, but CoRR can\u2019t be the solution to <em>that<\/em> problem. (The open-field <a href=\"https:\/\/zenodo.org\">Zenodo repository<\/a> should serve for most such papers.)<\/p>\n<h1>Special cases<\/h1>\n<p>There are tricky cases that arise with sufficient frequency that they are worth considering explicitly.<\/p>\n<h2>Research proposals<\/h2>\n<p>We often see articles (typically very short ones, two or three pages) sketching an idea for dealing with some problem or other, but without presenting any results. These may even be published in a workshop or conference proceedings, and perhaps the ideas may be novel and worth following up on. Nonetheless, without substantive results they don\u2019t qualify as scholarly articles. The authors could of course perform the follow-up and then submit an article describing the results (whether positive or negative). Of course, more fully worked out proposals with useful details might be acceptable.<\/p>\n<h2>Course projects<\/h2>\n<p>Submissions describing course projects or other schoolwork (undergraduate or master\u2019s theses, for example) may be allowable for inclusion if they are otherwise structured as scholarly articles and meet the criteria above. However, they may require extra scrutiny to verify the criteria described above.<\/p>\n<h2>Predatory journals<\/h2>\n<p>Publication in a <a href=\"https:\/\/en.wikipedia.org\/wiki\/Predatory_publishing\">predatory journal<\/a> (a faux journal charging for article processing while providing no or only perfunctory reviewing and other publishing services) is, by arXiv policy, not grounds for rejecting an article. However, such submissions often carry copyright notices, which arXiv disallows, and the submissions also often fail the criteria for being a scholarly article above.<\/p>\n<h2>Hyper-interdisciplinary articles<\/h2>\n<p>We sometimes get articles that are what might be charitably referred to as \u201chyper-interdisciplinary\u201d, in bringing together a CL topic with some other distant topic, a theory of language meaning based on a cylindrical algebraic reconstruction of quantum gravity, say. Such an article is difficult if not impossible to determine the status of: is it a scholarly article or a crankish manifesto? These articles are typically rejected. I simply do not have the time or inclination to attempt to reconstruct an understanding of the article to determine whether it is appropriate for cs.CL.<\/p>\n<p>Perhaps in doing so, I am eliminating from the repository the one true breakthrough in cracking the nut of the hardest intellectual problem of human cognition \u2013 how language works. Perhaps, but the author might have managed to express it in terms that a computer scientist specializing in natural language can understand. In any case, nothing prevents the author from submitting the article to a journal for peer review and making his<a id=\"fnref:2\" class=\"footnote\" title=\"see footnote\" href=\"2\">[2]<\/a> case.<\/p>\n<div class=\"footnotes\">\n<hr \/>\n<ol>\n<li id=\"fn:1\">Thanks to Yonatan Belinkov, Karina Halevy, and Joe Halpern for their helpful comments on earlier drafts of this post. <a class=\"reversefootnote\" title=\"return to article\" href=\"1\">\u00a0\u21a9<\/a><\/li>\n<li id=\"fn:2\">These submissions are always solo authored by men. <a class=\"reversefootnote\" title=\"return to article\" href=\"2\">\u00a0\u21a9<\/a><\/li>\n<\/ol>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Some time around April 1994, I founded the Computation and Language E-Print Archive, the first preprint repository for a subfield of computer science. It was hosted on Paul Ginsparg\u2019s arXiv platform, which at the time had been hosting only physics papers, built out from the original arXiv repository for high-energy physics theory, hep-th. The repository, [&hellip;]<\/p>\n","protected":false},"author":2110,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2},"jetpack_post_was_ever_published":false},"categories":[6028,380,6027,618,68,1903],"tags":[],"class_list":["post-2445","post","type-post","status-publish","format-standard","hentry","category-computational-linguistics","category-computer-science","category-linguistics","category-open-access","category-scholarly-communication","category-writing"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p5pLfN-Dr","jetpack-related-posts":[{"id":729,"url":"https:\/\/archive.blogs.harvard.edu\/pamphlet\/2011\/03\/12\/the-importance-of-dark-deposit\/","url_meta":{"origin":2445,"position":0},"title":"The importance of dark deposit","author":"Stuart Shieber","date":"Saturday, March 12, 2011","format":false,"excerpt":"Hubble's Dark Matter Map from flickr user NASA Goddard Photo and Video, used by permission The Harvard repository, DASH, comprises several thousand articles in all fields of scholarship. These articles are stored and advertised through an item page providing metadata \u2014 such as title, author, citation, abstract, and link to\u2026","rel":"","context":"In &quot;open access&quot;","block_context":{"text":"open access","link":"https:\/\/archive.blogs.harvard.edu\/pamphlet\/category\/scholarly-communication\/open-access\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":471,"url":"https:\/\/archive.blogs.harvard.edu\/pamphlet\/2010\/06\/09\/a-proposal-to-simplify-the-university-of-north-texas-open-access-policy\/","url_meta":{"origin":2445,"position":1},"title":"A proposal to simplify the University of North Texas open-access policy","author":"Stuart Shieber","date":"Wednesday, June 9, 2010","format":false,"excerpt":"\"In High Places\", statue by Gerald Balciar, University of North Texas - Denton campus, installed 1990. Image via Wikipedia. The University of North Texas is engaged in a laudable process of designing an open-access policy for their community. Draft language for their policy is now available at their site on\u2026","rel":"","context":"In &quot;open access&quot;","block_context":{"text":"open access","link":"https:\/\/archive.blogs.harvard.edu\/pamphlet\/category\/scholarly-communication\/open-access\/"},"img":{"alt_text":"","src":"http:\/\/upload.wikimedia.org\/wikipedia\/en\/thumb\/c\/c4\/UNT_Eagle_statue.jpg\/300px-UNT_Eagle_statue.jpg","width":350,"height":200},"classes":[]},{"id":712,"url":"https:\/\/archive.blogs.harvard.edu\/pamphlet\/2011\/02\/14\/dissertation-distribution-online-my-comments-at-the-aha\/","url_meta":{"origin":2445,"position":2},"title":"Dissertation distribution online: my comments at the AHA","author":"Stuart Shieber","date":"Monday, February 14, 2011","format":false,"excerpt":"I spoke at a panel last month at the annual meeting of the\u00a0American Historical Association devoted to the question of electronic dissertations and intellectual property rights entitled \"When Universities Put Dissertations on the Internet: New Practice; New Problem?\" My co-panelists included Edward Fox, professor of computer science at Virginia Tech\u2026","rel":"","context":"In &quot;open access&quot;","block_context":{"text":"open access","link":"https:\/\/archive.blogs.harvard.edu\/pamphlet\/category\/scholarly-communication\/open-access\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":588,"url":"https:\/\/archive.blogs.harvard.edu\/pamphlet\/2010\/09\/07\/for-publishers-using-pmc-to-kill-multiple-birds-with-one-stone\/","url_meta":{"origin":2445,"position":3},"title":"For publishers, using PMC to kill multiple birds with one stone","author":"Stuart Shieber","date":"Tuesday, September 7, 2010","format":false,"excerpt":"Here's a clever way for a journal to efficiently and cost-effectively provide open access to its articles (at least in the life sciences): Use PubMed Central as the journal's article repository. This expedient has all kinds of advantages: You have to allow for PMC distribution anyway, in fields where much\u2026","rel":"","context":"In &quot;open access&quot;","block_context":{"text":"open access","link":"https:\/\/archive.blogs.harvard.edu\/pamphlet\/category\/scholarly-communication\/open-access\/"},"img":{"alt_text":"PubMed Central logo","src":"https:\/\/i0.wp.com\/www.ncbi.nlm.nih.gov\/corehtml\/pmc\/pmcgifs\/pmclogo.gif?resize=350%2C200","width":350,"height":200},"classes":[]},{"id":1089,"url":"https:\/\/archive.blogs.harvard.edu\/pamphlet\/2011\/12\/02\/clarifying-the-harvard-policies-a-response\/","url_meta":{"origin":2445,"position":4},"title":"Clarifying the Harvard policies: a response","author":"Stuart Shieber","date":"Friday, December 2, 2011","format":false,"excerpt":"My friend and ex-colleague Matt Welsh has an interesting post supporting the Research Without Walls pledge, in which he talks about the Harvard open-access policies. He says: Another way to fight back is for your home institution to require all of your work be made open.\u00a0Harvard was one of the\u2026","rel":"","context":"In &quot;open access&quot;","block_context":{"text":"open access","link":"https:\/\/archive.blogs.harvard.edu\/pamphlet\/category\/scholarly-communication\/open-access\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":56,"url":"https:\/\/archive.blogs.harvard.edu\/pamphlet\/2009\/05\/27\/some-background-on-open-access\/","url_meta":{"origin":2445,"position":5},"title":"Some background on open access","author":"Stuart Shieber","date":"Wednesday, May 27, 2009","format":false,"excerpt":"I assume that readers of the open access discussions on this blog are familiar with the state of play in the area, but just in case, here's some background. Peter Suber defines open access in his A Very Brief Introduction to Open Access as follows: \"Open-access (OA) literature is digital,\u2026","rel":"","context":"In &quot;meta&quot;","block_context":{"text":"meta","link":"https:\/\/archive.blogs.harvard.edu\/pamphlet\/category\/meta\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"https:\/\/archive.blogs.harvard.edu\/pamphlet\/wp-json\/wp\/v2\/posts\/2445","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/archive.blogs.harvard.edu\/pamphlet\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/archive.blogs.harvard.edu\/pamphlet\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/archive.blogs.harvard.edu\/pamphlet\/wp-json\/wp\/v2\/users\/2110"}],"replies":[{"embeddable":true,"href":"https:\/\/archive.blogs.harvard.edu\/pamphlet\/wp-json\/wp\/v2\/comments?post=2445"}],"version-history":[{"count":5,"href":"https:\/\/archive.blogs.harvard.edu\/pamphlet\/wp-json\/wp\/v2\/posts\/2445\/revisions"}],"predecessor-version":[{"id":2450,"href":"https:\/\/archive.blogs.harvard.edu\/pamphlet\/wp-json\/wp\/v2\/posts\/2445\/revisions\/2450"}],"wp:attachment":[{"href":"https:\/\/archive.blogs.harvard.edu\/pamphlet\/wp-json\/wp\/v2\/media?parent=2445"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/archive.blogs.harvard.edu\/pamphlet\/wp-json\/wp\/v2\/categories?post=2445"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/archive.blogs.harvard.edu\/pamphlet\/wp-json\/wp\/v2\/tags?post=2445"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}