AI Cheating Detection

 

By Dr. Michael LaBossiere

When ChatGPT and its competitors became available to students, some warned of an AI apocalypse in education. This fear mirrored the broader worries about the over-hyped dangers of AI. This is not to deny that AI presents challenges and danger, but we need to have a realistic view of the threats and promises so that rational policies and practices can be implemented.

As a professor and the chair of the General Education Assessment Committee at Florida A&M University I assess the work of my students, and I am involved with the broader task of assessing general education. In both cases a key challenge is determining how much of the work turned in by students is their work. After all, we want to know how our students are performing and not how AI or some unknown writer is performing.

While students have been cheating since the advent of education, it was feared AI would cause a cheating tsunami. This worry seemed sensible since AI makes cheating easy, free and harder to detect. Large language models allow “plagiarism on demand” by generating new text each time. With the development of software such as Turnitin, detecting traditional plagiarism became automated and fast. These tools also identify the sources used in plagiarism, providing professors with reliable evidence. But large language models defeat this method of detection, since they generate original text. Ironically, some faculty now see a 0% plagiarism score on Turnitin as a possible red flag. But has an AI cheating tsunami washed over education?

Determining how many students are cheating is like determining how many people are committing crime: one only knows how many people have been caught and not how many people are doing it. Because of this, caution must be exercised when drawing a conclusion about the extent of cheating otherwise one runs the risk of falling victim to the fallacy of overconfident inference from unknown statistics.

In the case of AI cheating in education, one source of data is Turnitin’s AI detection software. Over the course of a year, the service checked 200 million assignments and flagged AI use in 1 in 10 assignments while 3 in 100 were flagged as mostly AI. These results have remained stable, suggesting that AI cheating is neither a tsunami nor increasing. But this assumes that the AI detection software is accurate. Turnitin claims it has a false positive rate of 1%. In addition to Turnitin, there are other AI detection services that have been evaluated, with the worst having an accuracy of 38% and the best claimed to have a 90% accuracy. But there are two major problems with the accuracy of existing plagiarism detection software. The first, as the title of a recent paper notes, “GPT detectors are biased against non-native English writers.” As the authors noted, while AI detectors are nearly perfectly accurate in evaluating essays by U.S. born eighth-graders, they misclassified 61.22% of TOEFL essays written by non-native English students. All seven of the tested detectors incorrectly flagged 18 of the 91 TOEFL essays and 89 of 91 of the essays (97%) were flagged by at least one detector. The second is that AI detectors can be fooled. The current detectors usually work by evaluating perplexity as a metric. Perplexity, which is a measure of such factors as lexical diversity and grammatical complexity, can be created in AI text by using simple prompt engineering. For example, a student could prompt ChatGPT to rewrite the text using more literary language. There is also a concern that the algorithms used in proprietary detection software will be kept secret, so it will be difficult to determine what biases and defects they might have.

Because of these problems, educators should be cautious when using such software to evaluate student work. This is especially true in cases in which a student is assigned a failing grade or even accused of academic misconduct because they are suspected of using AI. In the case of traditional cheating, a professor could have clear evidence in the form of copied text. In the case of AI detection, the professor only has the evaluation of software whose inner workings are most likely not available for examination and whose true accuracy remains unknown. Because of this, educational institutes need to develop rational guidelines for best practices when using AI detection software. But the question remains as to how likely it is that students will engage in cheating now that ChatGPT and its ilk are readily available. Stanford scholars Victor Lee and Denise Pope have been studying cheating, and past surveys over 15 years showed that 60-70% of students admitted to cheating. In 2023 the percentage stayed about the same or decreased slightly, even when students were asked about using AI. While there is the concern that cheaters would lie about cheating, Pope and Lee use anonymous surveys and take care in designing the survey questions. While cheating remains a problem, AI has not increased it, and the feared tsunami seems to have died far offshore.

This does make sense in that cheating has always been relatively easy, and the decision to cheat is more a matter of moral and practical judgment rather than based on the available technology. While technology can provide new means

of cheating, a student must still be willing to cheat, and that percentage seems to be relatively stable in the face of changing technology. That said, large language models are a new technology and their long-term impact in cheating is something that needs to be determined. But, so far, the doomsayers predictions have not come true. Fairness requires acknowledging that this might be because educators took effective action to prevent this; it would be poor reasoning to fall for the prediction fallacy.

As a final point of discussion, it is worth considering that perhaps AI has not resulted in a surge in cheating because it is not a great tool for this. As Arvind Narayanan and Sayash Kapoor have argued, AI seems to be most useful at doing useless things. To be fair, assignments in higher education can be useless things of the type AI is good at doing. But if AI is being used to complete useless assignments, then this is a problem with the assignments (and the professors) and not AI.

In closing, while AI does not seem to have created the expected tsunami of cheating, schools and professors need to develop rational best practices for handling AI detection. There is also the concern that AI will get better at cheating or that as students grow up with AI, they will be more inclined to use it to cheat. And, of course, it is worth considering whether such use should be considered cheating or if it is time to retire some types of assignments and change our approach to education as, for example, we did when calculators were accepted.