The words that reveal generative AI text

The words that reveal generative AI text

Until now, even AI companies have struggled to come up with tools that can reliably detect when text has been generated using a large language model. Now, a group of researchers has established a new method for estimating LLM usage across a wide range of scientific writing by measuring which “excess words” began appearing much more frequently during the LLM era (i.e. 2023 and 2024). The results “suggest that at least 10 percent of the 2024 abstracts were processed with LLMs,” according to the researchers.

In a preliminary paper published earlier this month, four researchers from Germany’s University of Tübingen and Northwestern University said they were inspired by studies that measured the impact of the Covid-19 pandemic by examining the excess deaths compared to the recent past. Taking a similar look at “excessive word use” after LLM writing tools became widely available in late 2022, the researchers found that “the emergence of LLMs led to a sharp increase in the frequency of certain words of speech” which was “unprecedented in both cases”. quality and quantity.”

Dive in

To measure these vocabulary changes, researchers analyzed 14 million abstracts of articles published on PubMed between 2010 and 2024, tracking the relative frequency of each word as it appeared each year. They then compared the expected frequency of these words (based on the pre-2023 trend line) to the actual frequency of these words in the 2023 and 2024 abstracts, when LLMs were widely used.

The results revealed a number of extremely rare words in these pre-2023 scientific abstracts, which suddenly gained popularity after the introduction of LLMs. The word “excavations,” for example, appears in 25 times more articles in 2024 than the pre-LLM trend would expect; words like “presentation” and “underscores” also increased ninefold. Other previously common words became significantly more common in post-LLM abstracts: the frequency of “potential” increased by 4.1 percentage points, “results” by 2.7 percentage points, and “crucial” by 2 .6 percentage points, for example.

Of course, these types of changes in word usage could occur independently of LLM usage: the natural evolution of language means that words sometimes come in and out of fashion. However, the researchers found that in the pre-LLM era, such massive and sudden year-over-year increases were only seen for words related to major global health events: “ebola” in 2015 ; “zika” in 2017; and words like “coronavirus”, “lockdown” and “pandemic” for the period 2020 to 2022.

However, in the post-LLM period, researchers discovered hundreds of words whose scientific usage increased suddenly and sharply and which had no common connection to world events. In fact, while excess words during the Covid pandemic were predominantly nouns, researchers found that words with an increase in frequency post-LLM were predominantly “words of style” like verbs, adjectives and adverbs. (a small sample: “through, in addition, complete, crucial, improving, exposed, ideas, notably, in particular, within”).

This is not a completely new finding: the increased prevalence of “dig” in scientific articles has, for example, been widely noted in the recent past. But previous studies typically relied on comparisons with “ground truth” human writing samples or lists of predefined LLM markers obtained outside of the study. Here, the set of pre-2023 summaries acts as its own effective control group to show how vocabulary choice has changed overall in the post-LLM era.

A complex interaction

By highlighting hundreds of “marker words” that have become much more common in the post-LLM era, the telltale signs of LLM use can sometimes be easy to spot. Consider this example of an abstract line mentioned by the researchers, with the marker words highlighted: “A complete understanding of the complex interaction between […] And […] East pivot for effective therapeutic strategies.

After performing some statistical measurements of the appearance of marker words in individual articles, the researchers estimate that at least 10% of post-2022 articles in the PubMed corpus were written with at least some LLM assistance. The number could be even higher, the researchers say, because their set might lack LLM-assisted summaries that don’t include any of the marker words they identified.

600,000 background checks leaked from publicly available database

600,000 background checks leaked from publicly available database

Cyber ​​Wardens Releases Cybercrime Guide to Protect Small Businesses

Cyber ​​Wardens Releases Cybercrime Guide to Protect Small Businesses

Leave a Reply

Your email address will not be published. Required fields are marked *