How language models perfected plagiarism into an art
Neural network language models (LMs) are capable of generating grammatical and coherent text. But the originality of the text that such models produce is suspect.
So are these LMs simply “stochastic parrots” regurgitating text, or have they actually learned how to create complicated structures that support sophisticated generalization?
Why is novelty important?
The novelty of a generated text tells us how different it is from the training set. Studying the novelty of LMs is important for two main reasons: Models are intended to learn the training Distribution, not just memorize the training set; Models that simply copy the training data are more likely to disclose sensitive information or repeat hate speech.
In a recent publication, researchers from Johns Hopkins University, New York University, Microsoft Research and Facebook AI Research have developed a method for measuring the novelty of the text generated by LMs. The study looked at how well LMs repurposed language in novel ways.
Do language models plagiarize training data?
To assess the novelty of the generated text, the researchers introduced a list of analyzes (called RAVEN) that covered both the sequential and syntactic structure of the text. They then applied these analyzes to a Transformer, Transformer-XL, LSTM, and all four sizes of GPT-2.
According to their findings, all of these models could demonstrate novelty in all aspects of the structure. You have generated Roman n-Grammar, morphological combinations and syntactic structures. 74% of the sentences generated by Transformer-XL had a syntactic structure that differed from the training sentences, and GPT-2 was able to find original words (including inflections and derivations).
That is, for smaller ones n-Grams, the models are still less novel than the baseline (based on the degree of duplication in a model-generated text to a human-generated text). In addition, there is occasional evidence of large-scale copying. For example, GPT-2 tends to copy larger training passages (more than 1,000 words).
All in all, one can assume that neural language models do this not just plagiarize the training data and also use constructive processes to combine familiar parts in novel ways.
Threat to Academic Integrity?
Neural language models are so good at generating novel text that it has become difficult for statistical and traditional ML solutions to detect machine-obfuscated plagiarism.
AI typing assistants like OpenAI’s GPT-3 are shockingly easy to use. You can enter a headline and a few sentences related to the topic, and GPT-3 will start filling in the details automatically. The model produces plausible content and endless output, and most importantly, allows you to communicate with the “robot writer” to correct errors.
Efficiency comes from the ever-increasing size of the training data. For context, the entirety of Wikipedia (which consists of more than 6 million articles and 3.9 billion words) accounts for only 0.6% of the input size for GPT-3.
Studies show that an alarming number of students use online paraphrasing tools such as SpinBot and SpinnerChief to obfuscate plagiarized text. Such tools use AI to manipulate text (e.g. replacing words with their synonyms) to give the work a semblance of originality.
Using neural language models for paraphrasing is a recent trend, and so far there is not enough data collected to train plagiarism detection systems (PDS) with it. Today, most institutions use text matching software to counteract plagiarism. The tools are effective at identifying duplicate text, but struggle to spot paraphrases, translations, and other elaborate forms of plagiarism.
plagiarism detection systems
The plagiarism detection technology uses lexical, syntactic, semantic and cross-language text analysis. Some methods focus on non-textual features, such as B. academic citation images and mathematical content to detect plagiarism. Most research now focuses on quantifying the degree of similarity between two sentences in order to detect AI-assisted text paraphrasing.
According to a paper published by Bergische Universität Wuppertal in 2021, obtaining additional training data is the best solution to improve the recognition of machine-paraphrased text.