Detection and anti-detection of texts generated by GPT-3 with RoBERTa, an experiment

5 min readJan 25, 2023

There has been a lot of talk about generative models, ChatGPT, and GPT-3 these days. One of the things that were immediately asked was whether it was possible to detect machine-generated texts automatically.

As we have all heard, models such as GPT-3 do not answer queries as we would but take care to predict, one at a time, the words with which to compose an output, given a ‘context window’ of 2048 tokens. In short, they ‘look’ at the previous 2048 tokens and choose the word with the highest probability of appearing in that context, one at a time. This probability is determined by training the model on a large corpus of documents (300 billion tokens).

If we wanted to define a ‘human’ parallel, ChatGPT’s answer is similar to a well-spoken student who, caught unprepared on a question, starts to compose sentences that seem to make logical sense and thus manage to convince the bored professor of his preparation.

Even this parallel is, however, far different from reality and we must know that in reality, behind every single word expressed by tools such as these, there are processes that transform the input text into vectors, apply mathematical operations to these vectors and then transform the resulting vectors into words.

This peculiarity opens the way for identifying patterns, invisible to the human eye, allowing other machines to identify when a text is written by a system like GPT-3.

Machine-generated texts detection: some questions

Many models for recognizing AI-generated texts are flourishing at the moment, and typically these models refer to systems for evaluating the perplexity of the text. Perplexity is a metric used to judge the performance of a language model.

We can define perplexity as the inverse of the probability of the test set, normalized for the number of words, and can find more information in this article.

Another very interesting approach involves fine-tuning pre-trained models based on transformers (the same family of models as GPT-3) and specializing in classification (such as RoBERTa) to identify texts written by these models.

There is a lot of OpenSource literature on texts generated by GPT-2 and other models, and many models available in this way are re-trained on texts written with GPT-3.

Based on this information, I asked myself a few questions:

are the classification models currently available (and usually trained on GPT-2) also effective on texts written with GPT-3 (in particular with davinci-003, commonly called GPT-3.5)?
what are the tricks in setting up OpenAI calls and automatic post-processing of texts that can make texts less identifiable by these systems (and thus on which these systems should be trained with additional fine-tuning)?

It is important to know that in the setup of an OpenAI API output, one parameter that can be set, in addition to the prompt, is the temperature. Intuitively, this parameter echoes the physical concept of temperature — the physical property that measures the state of agitation of matter. The higher the temperature, the more crazy and unpredictable the molecules will seem. Similarly, a higher temperature in the model setting will lead OpenAI to choose more improbable words.

If we use temperature 0, probably the completion of the sentence ‘Man’s best friend is ….’ will be [dog]. With temperature 1, the sentence could be completed with [ferret].

If the models are based on perplexity, it is safe to assume that high temperatures will make the texts more difficult to identify.

The other check I wanted to make is about small fixes to be made to the text automatically: the simplest thing I could think of is to process texts written with OpenAI and replace some words with synonyms.

What I’ve done

At this point we come to the setup of the experiment, which followed this process:

Extracting a dataset of articles written by a person I chose the dataset of Seth Godin’s blog articles, written between 2002 and 2020, available on Kaggle. I have always liked Seth Godin and wanted to be a kind of tribute;
Rewriting the titles and texts, via the OpenAI API I had the titles and texts of the articles rewritten in a new form, using 5 different temperatures (0, 0.25, 0.5, 0.75, 1);
Variant generation with WordNet, thanks to this library I generated a variant of each article thus obtained by synonym substitution via WordNet. WordNet is a large lexical database of the English language. Nouns, verbs, adjectives, and adverbs are grouped into cognitive synonyms (synsets), each of which expresses a distinct concept;
Classification and evaluation, at which point I classified texts with this model based on RoBERTa-Large and trained on texts generated by GPT-2, then compared the classification results with reality.

Results

The model, against all initial expectations, performs extremely well on texts written with GPT-3 (davinci-003) despite having been trained on a body of texts generated via GPT-2.

As might have been expected, performance decreases significantly as temperature increases and collapses when synonyms are used. Texts generated at temperatures 0 and 0.25, on which the model originally performs with an accuracy of over 90%, become difficult to recognize when substitutions are made with WordNet synonyms.

Model performance based on text generation temperature and post processing with WordNet synonym substitution. Original texts: blog by Seth Godin. Machine-generated texts via OpenAI GPT-3 (davinci-003).

Conclusions and next steps

The recognition of machine-generated texts is possible and manageable even with modest tools, and this opens the door to various implementation scenarios, although, in my opinion, the focus should be on the quality of the content, regardless of how it is generated.

With a view to training recognition models, it certainly becomes interesting to also evaluate the training by assuming other automatic post-processing operations that go beyond the simple prompt and temperature setup.

It will certainly be interesting to update this study in two ways

by enlarging the starting dataset, and going beyond Seth Godin’s blog;
by fine-tuning the RoBERTa-based model through a dataset of articles written with GPT-3 and ChatGPT, and evaluating the performance after this fine-tuning.

There is still plenty of room for work!

Detection and anti-detection of texts generated by GPT-3 with RoBERTa, an experiment

Machine-generated texts detection: some questions

What I’ve done

Results

Conclusions and next steps

Written by Paolo Dello Vicario

Responses (1)