The Latest “Crisis” – Is the Research Literature Overrun with ChatGPT- and LLM-generated Articles? – The Scholarly Kitchen – Inergency

By inergency On Mar 20, 2024

Elsevier has been under the spotlight this month for publishing a paper that contains a clearly ChatGPT-written portion of its introduction. The first sentence of the paper’s Introduction reads, “Certainly, here is a possible introduction for your topic:…” To date, the article remains unchanged, and unretracted. A second paper, containing the phrase “I’m very sorry, but I don’t have access to real-time information or patient-specific data, as I am an AI language model” was subsequently found, and similarly remains unchanged. This has led to a spate of amateur bibliometricians scanning the literature for similar common AI-generated phrases, with some alarming results. But it’s worth digging a little deeper into these results to get a sense of whether this is indeed a widespread problem, and where such papers have made it through to publication, where the errors are occurring.

Several of the investigations into AI-pollution of the literature that I’ve seen employ Google Scholar for data collection (the link above, and another here). But when you start looking at the Google Scholar search results, you notice that a lot of what’s listed, at least on the first few pages, are either preprints, items on ResearchGate, book chapters, or often something posted to a website you’ve never heard of with a Russian domain URL. The problem here is that Google Scholar is deliberately a largely non-gated index. It scans the internet for things that look like research papers (does it have an Abstract, does it have References), rather than limiting results to a carefully curated list of reputable publications. Basically, it grabs anything that looks “scholarly”. This is a feature, not a bug, and one of the important values that Google Scholar offers is that it can reach beyond the more limiting inclusion criteria (and often English language and Global North biased) content of indexes like the Web of Science.

But what happens when one does similar searches on a more curated database, one that is indeed limited to what most might consider a more accurate picture of the reputable scholarly literature? Here I’ve chosen Dimensions, an inter-linked research information system provided by Digital Science, as its content inclusion is broader than the Web of Science, but not as unlimited as Google Scholar. With the caveat that all bibliometrics indexes are lagging, and take some time to bring in the most recently published articles (the two Elsevier papers mentioned above are dated as being from March and June of 2024 and so aren’t yet indexed as far as I can tell), my results are perhaps less worrying. All searches below were limited to research articles (no preprints, book chapters, or meeting abstracts) published after November 2022, when ChatGPT was publicly released.

A search for “Certainly, here is” brings up a total of ten articles published over that time period. Of those ten articles, eight are about ChatGPT, so the inclusion of the phrase is likely not suspect. A search for “as of my last knowledge update” gives a total of six articles, again with four of those articles focused on ChatGPT itself. A search for “I don’t have access to real-time data” brings up only three articles, all of which cover ChatGPT or AI. During this same period, Dimensions lists nearly 5.7M research articles and review articles published, putting the error rate for these three phrases to slip through into publications at 0.00007%.

Retraction Watch has a larger list of 77 items (as of this writing), using a more comprehensive set of criteria to spot problematic, likely AI-generated text which includes journal articles from Elsevier, Springer Nature, MDPI, PLOS, Frontiers, Wiley, IEEE, and Sage. Again, this list needs further sorting, as it also includes some five book chapters, eleven preprints, and at least sixteen conference proceedings pieces. Removing these 32 items from the list suggests a failure rate of 0.00056%.

While several might argue that this does not constitute a “crisis”, it is likely that such errors will continue to rise, and frankly, there’s not really any excuse for allowing even a single paper with such an obvious tell to make it through to publication. While this has led several to question the peer review process at the journals where these failures occurred, it’s worth considering other points in the publication workflow where such errors might happen. As Lisa Hinchliffe recently pointed out, it’s possible these sections are being added at the revision stage or even post-acceptance. Peer reviewers and editors looking at a revision may only be looking at the specific sections where they requested changes, and may miss other additions an author has put into the new version of the article. Angela Cochran wrote about how this has been exploited by unscrupulous authors adding in hundreds of citations in order to juice their own metrics. Also possible, the LLM-generated language may have been added at the pageproof stage (whether deliberately or not). Most journals outsource typesetting to third party vendors, and how carefully a journal scrutinizes the final, typeset version of the paper varies widely. As always, time spent by human editorial staff is the most expensive part of the publishing process, so several journals assume their vendors have done their jobs, and don’t go over each paper with a fine toothed comb unless a problem is raised.

Two other important conclusions can be drawn from this uproar. The first is that despite preprints having been around for decades, those both within and adjacent to the research community clearly do not understand their nature and why they’re different from the peer reviewed literature, so more educational effort is needed. It should not be surprising to anyone that there are a lot of rough early drafts of papers or unpublishable manuscripts in SSRN (founded in 1994) or arXiv (launched in 1991). We’ve heard a lot of concern about journalists not being able to recognize that preprints aren’t peer reviewed, but maybe there’s as big a problem much closer to home. The second conclusion is that there seems to be a perception that appearing in Google Scholar search results offers some assurance of credibility or validation. This is absolutely not the case, and perhaps the fault here lies with the lack of differentiation between the profile service offered by Google Scholar, which is personally curated by individuals and its search results which are far less discriminating.

Going forward, I might hope that at the journals where the small number of papers have slipped through, an audit is underway to better understand where the language was introduced and how it managed to get all the way to publication. Automated checks should be able to weed out common AI language like this, but they likely need to be run at multiple points in the publication process, rather than just on initial submissions. While the systems in place seem to be performing pretty well overall, there’s no room for complacency, and research integrity vigilance will only become more and more demanding.