Skip to content
BY 4.0 license Open Access Published by De Gruyter Open Access October 6, 2023

The Effectiveness of Software Designed to Detect AI-Generated Writing: A Comparison of 16 AI Text Detectors

  • William H. Walters EMAIL logo
From the journal Open Information Science

Abstract

This study evaluates the accuracy of 16 publicly available AI text detectors in discriminating between AI-generated and human-generated writing. The evaluated documents include 42 undergraduate essays generated by ChatGPT-3.5, 42 generated by ChatGPT-4, and 42 written by students in a first-year composition course without the use of AI. Each detector’s performance was assessed with regard to its overall accuracy, its accuracy with each type of document, its decisiveness (the relative number of uncertain responses), the number of false positives (human-generated papers designated as AI by the detector), and the number of false negatives (AI-generated papers designated as human). Three detectors – Copyleaks, TurnItIn, and Originality.ai – have high accuracy with all three sets of documents. Although most of the other 13 detectors can distinguish between GPT-3.5 papers and human-generated papers with reasonably high accuracy, they are generally ineffective at distinguishing between GPT-4 papers and those written by undergraduate students. Overall, the detectors that require registration and payment are only slightly more accurate than the others.

1 Introduction

1.1 Generative AI and AI Text Detectors

Despite the great potential of generative artificial intelligence, the use of AI raises problems in situations where performance goals are meant to signal progress toward learning goals – where the completion of a written paper, for instance, is valuable not as an end in itself but as a mechanism for helping students learn how to plan, complete, and edit their written work (Dweck, 1986). Many authors have expressed concern that students are submitting papers generated by ChatGPT and other AI tools as their own original work, thereby attaining the performance goal but bypassing the learning goal. This has implications for teaching, learning, and academic integrity (e.g., Lund et al., 2023; Marche, 2022). Moreover, students’ use of AI is widespread and likely to increase. In a recent survey of 1,000 US university students, 43% reported that they had used ChatGPT or a similar AI tool. Twenty-two percent of all respondents had used AI “to help complete [their] assignments or exams,” and 32% planned to use or continue using AI in their academic work (Welding, 2023). The problem may have become more serious since the release of ChatGPT-4 in March 2023 (OpenAI, 2023a,c).

AI text detectors provide qualitative or quantitative assessments of the likelihood that a particular document was AI generated. They can therefore help instructors determine whether students have used AI to complete their academic work. They can also help students determine whether a particular paper is likely to trigger allegations of academic misconduct. Many AI detectors work by breaking the text down into tokens (words or other common sequences of characters) and predicting the probability that a particular token will be followed by the next in the sequence. The texts most likely to be identified as AI generated are those with high predictability and low perplexity – those with relatively few of the random elements and idiosyncrasies that people tend to use in their writing and speech. Some AI text detectors employ other methods (Crothers, Japkowicz, & Viktor, 2023), but methods based on perplexity and related concepts are most often used by the detectors available to the general public.

1.2 Previous Evaluations of AI Text Detectors

Quite a few websites and blogs claim to evaluate the accuracy of various AI text detectors (e.g., Abdullahi, 2023; Andrews, 2023; Aw, 2023; Caulfield, 2023; Cemper, 2023; Compilatio.net, 2023; Demers, 2023; Deziel, 2023; Gewirtz, 2023; Ivanov, 2023; Singh, 2023; van Oijen, 2023; Wiggers, 2023; Winston.ai, 2023). Unfortunately, each has significant limitations or biases. Fourteen problems can be readily identified:

  1. The authors or their sponsors have a clear conflict of interest; they provide AI detection software or accept extensive advertising from providers.

  2. The assessment has a strong subjective component, often conflating accuracy with other factors such as convenience or ease of use.

  3. The assessment uses a number of different procedures that are not applied systematically to every detector.

  4. The tests are performed on just a small number of documents.

  5. The report does not specify how the documents were generated or acquired.

  6. AI-generated text is evaluated but human-generated text is not. The assessment can therefore detect false negatives but not false positives.

  7. The documents evaluated are not typical of those submitted by students undertaking academic work.

  8. The human-generated documents are taken from sources (such as websites) that are potentially available to AIs as sources of training documents.

  9. The human-generated documents are written by the investigators themselves. This introduces the potential for conscious or unconscious bias.

  10. The assessment does not consider a representative set of detectors. It may exclude the newest or most widely used detectors, or it may compare one effective detector with several ineffective ones.

  11. The assessment includes only those detectors that do not require registration or payment.

  12. The report does not specify which versions of the detectors were used, or how they were used (e.g., which test options were chosen, and whether the software evaluated entire documents or just portions of them).

  13. The report does not mention the specific responses provided by the software or how those responses were coded as AI generated, human generated, or uncertain.

  14. The results are presented inconsistently, with detailed results for some detectors or documents but not for others.

At least one website presents a more careful assessment (Gillham, 2023). Moreover, recent scholarly investigations have avoided most of the problems mentioned here. Since the release of GPT-3.5, more than a dozen studies have included evaluations of the English-language AI text detectors currently in general use (Aremu, 2023; Cingillioglu, 2023; Desaire, Chua, Isom, Jarosova, & Hua, 2023; Gao et al., 2023; Guo et al., 2023; Khalil & Er, 2023; Krishna, Song, Karpinska, Wieting, & Iyyer, 2023; Liang, Yuksekgonul, Mao, Wu, & Zou, 2023; Pegoraro, Kumari, Fereidooni, & Sadeghi, 2023; Perkins, Roe, Postma, McGaughran, & Hickerson, 2023; Wang, Liu, Xie, & Li, 2023; Weber-Wulff et al., 2023; Yan, Fauss, Hao, & Cui, 2023). Tables 1 and 2 summarize the results of the analyses most similar to this investigation. That is, the tables exclude evaluations of detectors not currently available to the public (e.g., Desaire et al., 2023; Guo et al., 2023; Yan et al., 2023), studies of texts created by nonnative writers of English (Liang et al., 2023), evaluations of computer code and related materials (Wang et al., 2023), analyses in which the AI-generated papers were modified before being submitted to the detectors (Anderson et al., 2023; Krishna et al., 2023; Sadasivan, Kumar, Balasubramanian, Wang, & Feizi, 2023; Weber-Wulff et al., 2023), and reports in which the detectors were not identified by name (Dalalah & Dalalah, 2023).

Table 1

Percentage of ChatGPT texts correctly identified as AI in previous studiesa

Detector Aremu, 2023 Cingillioglu, 2023 Desaire et al., 2023 Gao et al., 2023 Guo et al., 2023 Khalil & Er, 2023 Krishna et al., 2023 b Krishna et al., 2023 c Liang et al., 2023 d Liang et al., 2023 e Pegoraro et al., 2023 Perkins et al., 2023 Wang et al., 2023 c Wang et al., 2023 b Weber-Wulff et al., 2023 Weber-Wulff et al., 2023 f Yan et al., 2023
No. of documents 4 75 120 50 27k 50 31 145 7k 22 15k 25k 18 18 800
ChatGPT version 3.5 3 3.5 3.5 3.5 3.5 3.5 4 3.5 3.5 3.5 3.5 3
ChatGPT 92
Checker AI 13
Compilatio 89 92
Content at Scale Low 38 0 0
Copyleaks 97 23
Crossplag Low 58 37 89 89
DetectGPT 27 67 18 66 63 56 75
Draft and Goal 24
GLTR High 32
GPT-2/RoBERTa 92 High High 7 60 79 94 94 100
GPTZero High 96 7 100 14 27 44 17 78 86
Grover 43
Hello-SimpleAI 47
Hugging Face 11
OpenAI Low 96 30 41 58 41 32 99 74 50 61
Originality.ai 42 59 8
Perplexity 44
PlagiarismCheck 33 47
Quill.org 58 57
RankGen 1
RoBERTa-QA High 68 67
Sapling Low 74 68
TurnItIn 91 94 97
Winston AI 94 94
Writefull 22 28 53
Writer 7 23 17 44 53
ZeroGPT High 100 31 46 83 83

aIncludes only those analyses that evaluated unmodified ChatGPT output. bWikipedia-type articles. cResponses to short questions. dCollege admissions essays. eAbstracts of scientific papers. fHalf credit was assigned for responses that were neither clearly correct nor clearly incorrect.

Table 2

Percentage of human-generated texts correctly identified as human in previous studies

Detector Aremu, 2023 Cingillioglu, 2023 Desaire et al., 2023 Gao et al., 2023 Guo et al., 2023 Liang et al., 2023 a Pegoraro et al., 2023 Wang et al., 2023 b Wang et al., 2023 c Weber-Wulff et al., 2023 Weber-Wulff et al., 2023 d Yan et al., 2023
No. of documents 24 75 60 50 59k 88 6k 15k 25k 9 9 800
Checker AI 95
Compilation 89 94
Content at Scale 100 80 100 100
Copyleaks 93 92
Crossplag 100 88 100 100
DetectGPT 80 94 65 100 100
Draft and Goal 91
GLTR High 98
GPT-2/RoBERTa 97 High High 96 6 11 100 100 100
GPTZero High 96 100 94 98 97 67 67
Grover 91
Hello-SimpleAI 98
Hugging Face 63
OpenAI High 97 91 92 37 39 100 100
Originality.ai 99 95
Perplexity 98
PlagiarismCheck 78 89
Quill.org 91
RoBERTa-QA High 95 65
Sapling High 95
TurnItIn 100 100
Winston AI 78 83
Writefull 99 100 100
Writer 95 96 93 100 100
ZeroGPT High 100 92 100 100

aEssays by middle school students. bResponses to short questions. cWikipedia-type articles. dHalf credit was assigned for responses that were neither clearly correct nor clearly incorrect.

Together, Tables 1 and 2 suggest that GPT-2/RoBERTa, TurnItIn, and ZeroGPT are the most consistently accurate detectors. Overall, however, the results for the 27 detectors are not consistent across the 29 analyses. There are at least three reasons for this. First, three different versions of ChatGPT were used to generate the AI documents. Most of the investigations used GPT-3.5, but at least two used GPT-3 and at least one used GPT-4. Second, the documents themselves are of various types. Seventeen analyses evaluated undergraduate essays or responses to short, straightforward questions, but the others used a variety of texts including abstracts of scientific papers (Gao et al., 2023; Liang et al., 2023), college admissions essays (Liang et al., 2023), essays by middle school students (Liang et al., 2023), examination papers (Yan et al., 2023), overview articles in scientific journals (Desaire et al., 2023), and Wikipedia-type articles (Krishna et al., 2023; Wang et al., 2023). Finally, each research team interpreted the detector output differently, adopting either rigorous or lenient standards for the identification of AI- and human-generated text. This at least partly explains why some detectors performed well in certain studies but not nearly as well in others.

2 Methods

This study evaluates the accuracy of 16 publicly available AI text detectors using three sets of documents: 42 undergraduate essays generated by ChatGPT-3.5, 42 generated by ChatGPT-4, and 42 written without the use of AI by students in a first-year composition course. Each detector’s performance was assessed with regard to its overall accuracy across all 126 documents, its accuracy when tested against each of the three sets of documents, its decisiveness (the relative number of uncertain responses), the number of false positives (human-generated papers designated as AI by the detector), and the number of false negatives (AI-generated papers designated as human by the detector). The analysis involved four steps:

  1. Prepare the three sets of documents.

  2. Select the 16 AI text detectors to include in the study.

  3. Use each detector to evaluate each of the 126 documents, coding the responses as AI, human, or uncertain.

  4. Evaluate the accuracy of each detector – its effectiveness in identifying AI-generated and human-generated text.

2.1 Preparing the 126 Documents

GPT 3.5 and GPT 4 were each used to generate 42 short papers (literature reviews) of the kind typically expected of students in first-year composition courses at US universities. The 42 paper topics cover the social sciences, the natural sciences, and the humanities (Appendix 1). A new chat/conversation was initiated for each paper topic, and each topic was embedded within a ChatGPT prompt of the type recommended by Atlas (2023). The same introductory text was used in each case: “I want you to act as an academic researcher. Your task is to write a paper of approximately 2000 words with parenthetical citations and a bibliography that includes at least 5 scholarly resources such as journal articles and scholarly books. The paper should respond to this question: ‘[paper topic].’” Because the ChatGPT response field is limited in length, the system’s initial response to each prompt was never a complete paper. An additional prompt of “Please continue” was used, sometimes more than once, to get ChatGPT to continue the text exactly where it had left off.[1] All the AI texts were generated in the first week of April 2023.

The 42 human-generated documents were taken from a set of 178 papers submitted by Manhattan College English 110 (First Year Composition) students during the 2014–2015 academic year. The use of papers from 2014 to 2015, before the widespread availability of AI tools such as ChatGPT, ensures that these papers were created without the use of AI. Although the English 110 papers do not cover the exact same topics as the AI-generated papers, they are quite similar; they cover topics such as gun control, racism in the US education system, policy responses to climate change, robotic warfare, family structure in traditional folk tales, e-cigarettes and public health, the ethical implications of the death penalty, concussion in the National Hockey League, and 3D printing technology. Stratified random sampling was used to select a set of papers with the same broad subject representation as the ChatGPT documents: 25 papers in the social sciences, 9 in the natural sciences, and 8 in the humanities.

2.2 Selecting the 16 AI Text Detectors

Although dozens of AI text detectors are available online, just 10 appear on two or more of five recent “best AI text detector” lists (Abdullahi, 2023; Caulfield, 2023; Ganesh, 2023; Somoye, 2023; Wiggers, 2023): Content at Scale (2023), Copyleaks (2023), Crossplag (2023), GPT Radar (2023), GPTZero (2023), OpenAI (2023b),[2] Originality.ai (2023), Sapling (2023), Writer (2023), and ZeroGPT (2023). This study evaluates those 10 AI text detectors, along with TurnItIn and 5 others (Table 3).

Table 3

Characteristics of the 16 AI text detectors

Detector Payment Limits on use Input Min. length Max. length Longer docs.
Content at scale Not required None Text box 4 wds. 25,000 chars. Truncates
ContentDetector.ai Not required None Text box 2 wds. ∼15,000 wds. Will not process
Copyleaksa Free: up to 45,000 wds. per day; thereafter: when billed monthly, $0.28 to $0.44 per thousand wds. Without registration: 6,250 wds. per day; with registration but without payment: 45,000 wds. per day; with registration and payment: depends on amount paid Free: text box; subscribers: text box or upload 150 chars. Free: 25,000 chars.; subscribers: 500,000 wds. Will not process
Crossplag Free, but registration is required for full functionality None Text box 2 wds. 3,000 wds. Truncates
Grammica Not required None Text box 2 wds. ∼380 wds. Truncates
GPT Radar Free: up to ∼2,500 wds. per day; thereafter: ∼$0.02 per 125 wds. Depends on amount paid Text box ∼75 wds. ∼1,400 wds. – lower than the stated limit Will not process
GPTZerob Classic: not required; Educator (more effective): $9.99 per month; Pro (most effective): $19.99 per month Classic: Limits not stated; Educator: 1 million wds. per month; Pro: 2 million wds. per month Text box or upload 250 chars. Classic: 5,000 chars.; Educator: 50,000 chars.; Pro: 50,000 chars. Text box: will not process; upload: truncates
IvyPanda Free, but registration is required None Text box 2 wds. 4,500 chars. Truncates
OpenAI Free, but registration is required None Text box 1,000 chars. ∼3,000 wds. Will not process
Originality.aic $0.01 per 100 wds. Depends on amount paid Text box 50 wds. 10,000 wds. Will not process
Sapling Free version has limited functionality; subscription: $25 per month, but the system may offer a free 1-month trial None Text box ∼150 chars. Free: ∼2,000 chars.; paid: ∼8,000 chars. Truncates
Scribbr Not required None Text box 25 wds. 500 wds. Will not process
SEO.ai Not required None Text box 2 wds. 5,000 chars. Truncates
TurnItIn Institutional subscription required None Upload 20 wds. 800 pages Will not process
Writer Not required None Text box 2 wds. 1,500 chars. Will not process
ZeroGPT Not required None Text box or upload 2 wds. 50,000 chars. Will not process

aFree interface: https://copyleaks.com/ai-content-detector; subscriber interface: https://app.copyleaks.com/dashboard/v1/account/new-scan. bThis study presents the Pro results; the Educator results are identical except that one GPT-3.5 paper classified as uncertain by Educator is classified as AI by Pro. cThis study uses detection model 1.4 rather than 1.1.

TurnItIn (2023) was added to the study due to its widespread availability at colleges and universities in the United States and elsewhere. Instructors at institutions with subscriptions to the TurnItIn plagiarism detector also have access to the AI text detector, unless their universities have chosen not to make it available.[3]

The five other AI text detectors included in the study – ContentDetector.ai (2023), Grammica (2023), IvyPanda (2023), Scribbr (2023), and SEO.ai (2023) – are promoted widely online, do not require registration or payment, and do not appear on any of the five “best detector” lists. Arguably, these detectors are typical of the tools students might use to conduct a quick check of their papers for evidence of AI involvement. A Google search for free AI text detector was conducted, and the first five detectors that met the criteria (and that worked reliably for the set of 126 documents) were included in the study. Some of them are clearly intended for students who want to use AI without getting caught, and the IvyPanda site includes advertisements for a paper-writing service (“Our experts can complete a task on any subject based on your instructions – without any AI! To ensure that your paper is 100% human-written and plagiarism-free, place an order here.”)

2.3 Evaluating the Documents and Coding the Responses

Each of the 126 documents was stripped of any introductory material (e.g., course and author information), tables, figures, and lists of works cited, then entered into each of the 16 AI text detectors in plain-text format. Documents longer than the maximum allowable length (Table 3) were truncated. The detector tests were conducted from June 25 through July 12, 2023.

As Appendix 2 reveals, each detector’s output is unique. The responses used by the detectors to characterize the documents vary in five important respects:

  1. whether they include descriptive text, numeric values, or both

  2. whether the wording of the text is formal or casual

  3. whether the assessments suggest a high degree of confidence (“this text is AI generated”) or greater ambiguity (“parts of the text may show evidence of AI involvement”)

  4. whether the numeric scores represent the proportion of the text that is AI generated, the detector’s level of confidence in the result, or something else

  5. whether there are just a few possible responses or many.

Each of the 2,016 responses was coded as AI generated, human generated, or uncertain. (AI generated indicates that a significant portion of the text – not necessarily all of it – is likely to be AI generated.) For responses that included both descriptive text and a numeric component, the descriptive text (e.g., “likely AI generated”) was regarded as definitive. For the strictly numeric results provided by Grammica, Originality.ai, Sapling, Scribbr, and TurnItIn, each response was categorized as AI, human, or uncertain based on three factors: the meaning of the numeric value, the natural breaks in the frequency distribution, and the general principle that roughly twice as many responses should be included in the AI category as in the human category.

Although just one individual coded the responses, the distinctions among the AI, uncertain, and human categories were generally quite clear. (Appendix 2 shows the responses generated by the AI text detectors and the number of times each response was given.) The only difficulty occurred with Sapling, for which the breaks in the frequency distribution were not always pronounced. Overall, the classifications used here are very similar to those adopted by Weber-Wulff et al. (2023).

3 Results and Discussion

3.1 Accuracy of the 16 AI Text Detectors

Two of the 16 detectors, Copyleaks and TurnItIn, correctly identified the AI- or human-generated status of all 126 documents, with no incorrect or uncertain responses. As noted in Section 2.2, however, it is possible that TurnItIn performs especially well with the human-generated papers used in this particular analysis. A third detector, Originality.ai, performed nearly as well, correctly assessing the status of all but two documents – human-generated papers that it could not classify with certainty (Table 4 and Figure 1).

Table 4

Percentage of documents for which each detector gave correct or incorrect responsesa

All papers AI papers GPT-3.5 papers GPT-4 papers Human papers
Detector Percentage correct Percentage incorrect Percentage uncertain Percentage correct Percentage incorrect Percentage correct Percentage incorrect Percentage correct Percentage incorrect Percentage correct Percentage incorrect
Copyleaksb 100 0 0 100 0 100 0 100 0 100 0
TurnItIn 100 0 0 100 0 100 0 100 0 100 0
Originality.aib 98 0 2 100 0 100 0 100 0 95 0
Scribbr 88 11 1 85 15 100 0 69 31 95 2
ZeroGPTb 87 1 12 92 0 100 0 83 0 79 2
Grammica 86 11 3 81 17 100 0 62 33 95 0
GPTZerob 81 4 15 77 5 98 0 57 10 88 2
Crossplagb 80 20 0 77 23 86 14 69 31 86 14
OpenAIb 78 6 17 69 8 98 2 40 14 95 0
IvyPanda 77 0 23 71 0 100 0 43 0 88 0
GPT Radarb 76 24 0 64 36 98 2 31 69 100 0
SEO.ai 72 4 24 92 0 100 0 83 0 33 12
Content at Scaleb 71 13 15 63 15 74 2 52 29 88 10
Writerb 71 29 0 64 36 88 12 40 60 86 14
Saplingb 65 7 28 63 11 93 0 33 21 69 0
ContentDetector.ai 63 10 27 45 14 83 0 7 29 100 0
Avg. percentage 81 9 10 78 11 95 2 61 20 87 4
Standard deviation 12 9 11 16 12 8 4 28 22 17 5
Median percentage 79 7 8 77 10 99 0 60 18 92 0

aIn each case, the percentage uncertain is the percentage neither correct nor incorrect. bAppears on at least two of the “best AI text detector” websites.

Figure 1 
                  Percentage of all 126 documents for which each detector gave correct, uncertain, or incorrect responses.
Figure 1

Percentage of all 126 documents for which each detector gave correct, uncertain, or incorrect responses.

Among the other 13 detectors, overall accuracy ranges from 63 to 88%. The distribution of percentage correct follows a smooth progression, with just three distinct groups: the top 3 detectors, the next 11, and the bottom 2 – Sapling and ContentDetector.ai.

All the detectors except Content at Scale and ContentDetector.ai are able to identify the GPT-3.5 documents as AI generated at least 86% of the time, and seven perform flawlessly with this particular set of documents (Figure 2). Likewise, all but three – ZeroGPT, SEO.ai, and Sapling – are effective at identifying human-generated text (Figure 3). However, only the top three detectors can correctly classify GPT-4 documents with greater than 83% accuracy; the rest tend to classify those documents as human or uncertain (Figure 4). Arguably, this is the most important distinction between the top 3 detectors and the other 13.

Figure 2 
                  Percentage of the 42 GPT-3.5 documents for which each detector gave correct, uncertain, or incorrect responses.
Figure 2

Percentage of the 42 GPT-3.5 documents for which each detector gave correct, uncertain, or incorrect responses.

Figure 3 
                  Percentage of the 42 human-generated documents for which each detector gave correct, uncertain, or incorrect responses.
Figure 3

Percentage of the 42 human-generated documents for which each detector gave correct, uncertain, or incorrect responses.

Figure 4 
                  Percentage of the 42 GPT-4 documents for which each detector gave correct, uncertain, or incorrect responses.
Figure 4

Percentage of the 42 GPT-4 documents for which each detector gave correct, uncertain, or incorrect responses.

3.2 Correlates of Accuracy

As noted in Section 2.2, 10 of the 16 detectors were initially identified through online “best detector” lists. Overall, the detectors that appear on these lists are only marginally more accurate than the others – 81% correct versus 77%. For the set of all detectors other than TurnItIn, there is no meaningful correlation between the accuracy of a detector and its appearance on the “best detector” lists; Kendall’s tau-b = 0.08.

In general, the accuracy of a detector is only modestly associated with its paid or free status. While all three of the most accurate detectors require registration and payment for full functionality, the three others that require payment – GPTZero, GPT Radar, and Sapling – have just average or below-average accuracy. Among the six detectors that require a subscription, the average accuracy is 87%; among the others, it is 77%. Overall, the correlation between the accuracy of a detector and its paid or free status is weak; Kendall’s tau-b = 0.29.

3.3 Key Similarities and Differences Among the 16 AI Text Detectors

Table 5 highlights the characteristics that set each detector apart from the others. Copyleaks, TurnItIn, and Originality.ai are similar in many respects. Likewise, ZeroGPT and GPTZero are much the same, as are Sapling and ContentDetector.ai.

Table 5

Effectiveness of the 16 AI text detectors

Detector Overall accuracy Accuracy, GPT-3.5 Accuracy, GPT-4 Decisive-ness False positives False negatives
Copyleaksa V. high V. high V. high High
TurnItIn V. high V. high V. high High
Originality.aia V. high V. high V. high High
Scribbr High V. high High
ZeroGPTa High V. high
Grammica High V. high Low High
GPTZeroa High V. high
Crossplaga High Many Many
OpenAIa V. high Low
IvyPanda V. high Low Low
GPT Radara V. high Low High Many
SEO.ai V. high Low Many
Content at Scalea Low Many
Writera Low High Many Many
Saplinga Low Low Low
ContentDetector.ai Low Low Low

aAppears on at least two of the “best AI text detector” websites.

The three accuracy columns in Table 5 are based not just on percentage correct, but on percentage incorrect and the ratio of correct to incorrect responses. For example, GPTZero has a high accuracy designation while Crossplag does not – but this cannot be attributed to the one-point difference in their accuracy rates. Instead, it reflects the fact that GPTZero has a lower rate of incorrect responses. When the type of document is unclear, GPTZero generally gives a response of uncertain. In contrast, Crossplag is more likely to label AI text as human and vice versa.

As described in Section 3.1, many detectors are effective at identifying GPT-3.5 text but ineffective at identifying GPT-4 text. This same result can be seen when percentage incorrect is taken into account. In particular, four detectors have excellent performance with regard to GPT-3.5 but very poor performance with regard to GPT-4. GPT Radar is perhaps the best example of this, with correct responses for 98% of the GPT-3.5 documents but for just 31% of the GPT-4 documents – worse than might be expected due to chance alone.

The decisiveness column represents the percentage of documents for which each detector gave responses of AI or human rather than uncertain. The high decisiveness label was assigned to detectors with uncertainty rates lower than 4% and the low label to those with uncertainty rates higher than 22%.

The false positives column identifies the detectors that are especially likely to respond AI when evaluating papers written by humans. The four detectors labeled many each have false positive rates of 10–14%. In contrast, the other detectors each have no more than a single false positive within the set of 42 human-generated documents.

Likewise, the false negatives column identifies the detectors that are especially likely to respond human for papers that were actually produced by an AI. Crossplag, GPT Radar, and Writer each have false negative rates of 23–36%, while the other detectors have a maximum rate of 17% and a mean of 6.5%.

The many false positives for SEO.ai and Content at Scale reflect their general tendency to declare that text is AI rather than human. Likewise, the many false negatives for GPT Radar reflect its tendency to label text as human rather than AI. The situation is different for Crossplag and Writer, however. Those two detectors have many false positives and many false negatives due to a combination of relative inaccuracy and high decisiveness. Overall, the more accurate detectors tend to be more decisive – the correlation between percentage correct and percentage uncertain is −0.68 – but Crossplag and Writer are exceptions to that general relationship.

4 Conclusion

4.1 Main Findings

The results of this study support three main conclusions:

  1. Three AI text detectors – Copyleaks, TurnItIn, and Originality – have very high accuracy with all three sets of documents examined for this study: GPT-3.5 papers, GPT-4 papers, and human-generated papers.

  2. Most of the other detectors can distinguish between GPT-3.5 papers and human-generated papers with reasonably high accuracy. However, most are ineffective at distinguishing between GPT-4 papers and papers written by students.

  3. In general, a detector’s free or paid status is not a good indicator of its accuracy, nor is its appearance on the “best AI text detector” lists considered here.

Several recent articles in the popular press have asserted that AI-generated text is almost impossible to identify (Heikkilä, 2023; Maruccia, 2023; Mujezinovic, 2023; Wiggers, 2023; Williams, 2023), and it is true that most detectors perform poorly with GPT-4 documents. However, these results also suggest that technological improvements in publicly available AI text generators are matched very quickly by improvements in the capabilities of the best AI text detectors. The release of GPT-4 in March 2023 may have given AI users a temporary ability to pass off AI text as human-authored – but less than 4 months later, the three most effective AI text detectors perform just as well with GPT-4 documents as with GPT-3.5 documents.

4.2 Previous Research and New Results

Previous research suggests that TurnItIn, ZeroGPT, and GPT-2/RoBERTa are among the more accurate AI text detectors (Tables 1 and 2). These results support those earlier findings with regard to TurnItIn and ZeroGPT.

Of the top three detectors identified in this investigation, TurnItIn achieved very high accuracy in all five previous evaluations. Copyleaks, included in four earlier analyses, performed very well in three of them. The prior results for Originality.ai are mixed, suggesting that it classifies human-generated documents accurately but has difficulty with AI-generated text. In this analysis, no such difficulty can be seen (Tables 4 and 5). As noted in Section 1.2, previous studies have used a wide range of methods that do not always generate comparable results. Consequently, comparative analyses such as this are especially important.

4.3 Implications

Many authors have called for the modification of traditional undergraduate essays and written assignments in ways that circumvent the capabilities of generative AI (e.g., Baidoo-Anu & Owusu Ansah, 2023; Golinkoff & Wilson, 2023; Marche, 2022; Rigolino, 2023; Tate, 2023). At the most superficial level, this involves changes in assessment methods – a greater reliance on in-class exams and interactive presentations, for instance. At a deeper level, it involves a greater emphasis on the kinds of capabilities that are unique to humans, such as the generation and refinement of ideas rather than texts. AI tools can also be incorporated into teaching, helping students learn how to edit, how to evaluate subtle differences in style and content, how to determine whether an assertion is supported by evidence, and how to use AI effectively. Even in circumstances where the use of AI is accepted or required, however, there is still a need to determine the extent of AI involvement.

When students are not expected to use AI, false positives can lead to unwarranted accusations of misconduct while false negatives may allow violations of academic integrity to go undetected. For this reason, the detectors with high false positive or false negative rates (Table 5) should be avoided. If we also exclude the detectors that are generally ineffective in detecting GPT-4 text, just a few detectors – essentially, the top three – remain as viable candidates for use in the academic environment.

Local and individual factors are likely to influence the ways in which AI text detectors are used and perceived. Some faculty may be inclined to accept their results uncritically, without further investigation or consideration of the context. At the same time, other faculty may reject the use of detectors in favor of less systematic, intuitive judgments. It is probably best to adopt a moderate approach – to consider the results provided by AI text detectors, to account for other evidence as well, and to acknowledges that some detectors are far more effective (or ineffective) than others. Assessments of students’ work should also consider the specific parts of the text for which AI involvement was detected. Fortunately, 10 of the detectors evaluated here – all but Crossplag, Grammica, OpenAI, Scribbr, SEO.ai, and Writer – provide separate assessments or scores for particular phrases, sentences, or paragraphs within each document.

4.4 Limitations and Further Research

Because this investigation used student papers that could potentially have been used to train the TurnItIn detector, TurnItIn may be especially accurate for the particular human-generated texts evaluated here. As noted in Section 2.2, however, this is unlikely to have had a major impact on the results. More generally, this analysis is based on a set of 126 undergraduate composition papers (literature reviews), so the results may not be generalizable to other kinds of documents. The most significant limitation of the study, however, is that it does not account for the fact that users of ChatGPT are likely to paraphrase or otherwise modify AI-generated texts rather than simply submitting them, unaltered, as their own academic work (Welding, 2023). It is important to know how well these detectors perform with unaltered ChatGPT text, but a more realistic assessment would also evaluate their effectiveness in identifying documents that have been generated by AI, then modified by users.

This is just the second study to evaluate the effectiveness of publicly available AI text detectors in identifying documents generated by ChatGPT-4. (Perkins et al., 2023, was the first.) Additional analyses of GPT-4 documents are needed. Moreover, this investigation and other recent studies suggest several questions for further research:

  1. How well do AI text detectors evaluate documents that are partly AI generated and partly human generated? Are the assessments provided by the detectors (e.g., “30% AI”) accurate, and does their accuracy vary with the proportion of AI-generated text?

  2. What paraphrasing strategies are most effective at thwarting AI text detectors? For instance, is it better to replace words with less common synonyms, to change the order of clauses, or to introduce idiosyncratic phrases? Several studies have shown that paraphrasing can alter AI-generated texts to make them less susceptible to detection (Anderson et al., 2023; Krishna et al., 2023; Sadasivan et al., 2023; Weber-Wulff et al., 2023), but none have evaluated the effectiveness of the various paraphrasing techniques.

  3. How do students actually modify AI-generated or AI-assisted texts when completing their assignments? Are those modifications effective at rendering AI involvement undetectable?

Finally, there is a need to investigate potential biases in the performance of AI text detectors. Liang et al. (2023) have demonstrated that the texts written by nonnative speakers of English are far more likely than those of native speakers to generate false positive responses. If would be helpful to know whether this bias is widespread or whether it is restricted to particular types of authors or documents.

Acknowledgments

I am grateful for the comments of Esther Isabelle Wilder and two anonymous referees.

  1. Funding information: No funding was involved.

  2. Conflict of interest: The author states no conflict of interest.

  3. Data availability statement: The texts generated by GPT-3.5 and GPT-4 in response to the 42 prompts are available from the author on request, as are the results (responses) generated by the 16 AI text detectors for each of the 126 documents.

Appendix 1 Topics of the ChatGPT Papers

Although most of the paper topics were suggested by personal experience with students and their written work (Walters et al., 2020), about two dozen websites were consulted for additional ideas. Topics 24, 33, and 42 are similar to those suggested by Paperell.net (2023). Topics 19, 22, and 37 are similar to those suggested by Sarikas (2020), Allison (2023), and Kearney (2022), respectively. Topics 1–8 are in the humanities, 9–33 in the social sciences, and 34–42 in the natural sciences:

  1. Why was Stonehenge built? What are the most likely explanations, and what evidence supports or challenges each of them?

  2. What were the causes of the Second Boer War (1899 to 1902)? What did the British Empire, the South African Republic, and the Orange Free State each hope to achieve?

  3. What major nineteenth-century literary works received initially negative reviews but are now regarded as key contributions to literature? What accounts for the changing opinions of these works?

  4. What studies best demonstrate how quantitative methods can be applied to the analysis of English-language literary works?

  5. When did unicorns first appear in literature? How has the depiction of unicorns and their characteristics changed over time?

  6. Will languages other than English gain importance over time as languages of scientific discourse?

  7. What accounts for the dominance of American and British songwriters and musicians in twentieth- and twenty-first-century popular music? Why did no other countries’ artists have a similar impact?

  8. What are the historical origins of the religious concept of purgatory? Who put forth the concept of purgatory? Was it accepted initially? When and how did it assume its place within Catholic theology?

  9. Among retired Americans and those approaching retirement, are there distinct types of migration or geographic mobility (distinct groups of migrants)? What are the distinctive characteristics of each type or group?

  10. What were the unintended effects of China’s one-child policy? How have the Chinese government and the Chinese people responded to them?

  11. How have ride-sharing services such as Uber and Lyft influenced overall employment in the taxi and ride-sharing industry? How have they influenced wages?

  12. In the present-day United States, what are the most effective strategies by which wealthy individuals can minimize their income tax payments?

  13. What are the long-term economic and political impacts of the global shortages of copper, lithium, nickel, and cobalt?

  14. What is the best way to determine the impact of Brexit on the UK economy?

  15. Why did the US government first institute minimum wage laws? What were they hoping to achieve?

  16. Among American college students, to what extent do self-reported assessments of ability represent self-efficacy rather than ability?

  17. Can synchronous demonstrations, delivered online, be just as effective as in-person lab instruction for undergraduate biology courses?

  18. Can the educational success of US charter schools at the high school (secondary) level be attributed to factors other than the socioeconomic characteristics of their students?

  19. Do students who get free meals in grades P–5 do better academically than students of similar backgrounds who do not get free meals?

  20. Is there evidence to support the idea that high school math teachers who struggled with math can be more effective than those for whom math came easily?

  21. To what extent are university students’ evaluations of their instructors related to the difficulty of the course? What is the best way to overcome any bias related to the link between teaching evaluations and course difficulty?

  22. What are the advantages and disadvantages of taking a “gap year” of employment or volunteer work between high school and college – for individuals and for society?

  23. Are there systematic differences in the organizational leadership styles of men and women? To what extent are they unique to either women or men?

  24. Who were the most successful businesswomen of the twentieth century?

  25. Internationally, how have Patrick S. Atiyah’s “Accidents, Compensation and the Law” and “The Damages Lottery” influenced legal education, practice, and theory?

  26. What are the military missions or situations for which aerial drones have proven most successful? In what areas do they have the greatest unmet potential?

  27. What occupations are most likely to disappear entirely over the next 20 years?

  28. In the United States, what safety-related innovations (devices, policies, or procedures) were once mandated by law or regulation but later abandoned? Why were they abandoned? On what grounds should safety-related innovations be evaluated?

  29. Do the fans at a football stadium influence the outcome of the game? Can we isolate the impact of the fans’ behavior from the impact of having home-field advantage (and more fans in the stadium)? [Both GPT-3.5 and GPT-4 interpreted this question in terms of association football (soccer) rather than American football.]

  30. Across nations, what is the influence of gun control legislation on rates of gun-related homicide, suicide, and accidental death? What factors make these comparisons potentially difficult?

  31. Are adolescents who play violent video games especially likely to commit acts of violence? Do violent video games have other negative (or positive) psychological effects?

  32. In terms of recruiting, training, and managing personnel, what are the most effective methods of preventing police violence against the public (“police brutality”)?

  33. What percentage of political assassination attempts are successful? What evidence can be used to address this question?

  34. At the individual level, what is the impact of professional dental care on morbidity and mortality risk?

  35. How harmful are e-cigarettes to the health of those who use them, relative to conventional cigarettes?

  36. To what extent do sleep disorders influence the productivity of the American labor force?

  37. Can cloning or similar methods be used to bring back extinct plant species? Extinct animal species?

  38. What strategies have proven most effective as methods of stabilizing and increasing the orangutan population?

  39. To what extent can global climate change be attributed to ruminant grazing and dairy farming?

  40. What is the best way to gauge the environmental impact of a large-scale switch to electric vehicles for private passenger transportation in the United States? Account for the impact of the vehicles themselves as well as the need to generate electricity from sources such as natural gas, coal, nuclear, wind, and hydropower.

  41. Which island nations and coastal nations will be most affected by climate change? What steps are they taking to prepare?

  42. How are molten salt reactors different from conventional nuclear fission reactors? What are their unique advantages and disadvantages? In what ways are they more or less safe than conventional fission reactors?

Appendix 2 Responses Provided by the AI Text Detectors

The numbers in the n columns indicate the number of documents in each response category across all three document types – GPT-3.5, GPT-4, and human generated.

Content at Scalea

Response n
Responses counted as AI 57
 Highly likely to be AI generated! (10–29% human) 8
 Likely to be AI generated! (33–44% human) 17
 Likely both AI and human! (60–79% human) 32
Responses counted as uncertain 19
 Unclear if it is AI content! (45–58% human) 19
Responses counted as human 50
 Highly likely to be human! (80–100% human) 50

aThe descriptive text appears to be based primarily on the detector’s confidence in the assessment, while the numeric results appear to reflect the percentage of the text that is AI.

ContentDetector.aia

Response n
Responses counted as AI 38
 Likely AI content (How artificial is your content: 67–82%) 38
Responses counted as uncertain 34
 Unclear (How artificial is your content: 50–67%) 34
Responses counted as human 54
 Likely human content (How artificial is your content: 16–50%) 54

aThese results indicate the detector’s confidence in the assessment – not the percentage of the text that is AI.

Copyleaksa

Response n
Responses counted as AI 84
 Suspected cheating: AI text detected. Very high. We are unable to verify that the text was written by a human 84
Responses counted as uncertain 0
 (None) 0
Responses counted as human 42
 (No AI-related alerts associated with the text) 42

aCopyleaks provides an overall descriptive assessment for the entire document along with statements such as “93.3% probability for human” or “94.8% probability for AI” for particular parts of the document. Those numeric values indicate the detector’s confidence in the assessment – not the percentage of the text that is AI. Moreover, the percentages reported by Copyleaks are not actual probabilities, since “30% probability for human” does not mean “70% probability for AI.” It simply means “This text is probably human generated, and our confidence in that assessment is 30 on a scale from 1 to 100.”

Crossplaga

Response n
Responses counted as AI 71
 This text is mainly written by an AI (No % score) 8
 This text is mainly written by an AI (67–100% AI) 59
 This text is co-written by both a human and an AI (50% AI) 4
Responses counted as uncertain 0
 (None) 0
Responses counted as human 55
 This text is mainly written by a human (0–6% AI) 55

aThese results indicate the detector’s confidence in the assessment – not the percentage of the text that is AI.

GPT Radara

Response n
Responses counted as AI 54
 Likely AI generated (52–84% accuracy) 54
Responses counted as uncertain 0
 (None) 0
Responses counted as human 72
 Likely human generated (57–83% accuracy) 72

aThese results appear to indicate the detector’s confidence in the assessment – not the percentage of the text that is AI.

GPTZero

Response n
Responses counted as AI 66
 Your text is likely to be written entirely by AI 58
 Your text is has [sic] a moderate likelihood of being written by AI 8
Responses counted as uncertain 19
 Your text may include parts written by AI 19
Responses counted as human 41
 Your text is most likely human written but there are some sentences with low perplexities 9
 Your text is likely to be written entirely by a human 32

Grammicaa

Response n
Responses counted as AI 68
 100% AI 49
 91–99% AI 10
 81–88% AI 3
 50–62% AI 5
 39% AI 1
Responses counted as uncertain 4
 25–29% AI 3
 17% AI 1
Responses counted as human 54
 1–8% AI 11
 0% AI 43

aThese results indicate the percentage of the text that is AI – not the detector’s confidence in the assessment.

IvyPandaa

Response n
Responses counted as AI 60
 High risk 52
 Relatively high risk 8
Responses counted as uncertain 29
 Medium risk 29
Responses counted as human 37
 Relatively low risk 37

aThese results indicate the detector’s confidence in the assessment – not the percentage of the text that is AI.

OpenAI

Response n
Responses counted as AI 58
 Likely AI generated 32
 Possibly AI generated 26
Responses counted as uncertain 21
 Unclear if it is AI generated 21
Responses counted as human 47
 Unlikely AI generated 5
 Very unlikely AI generated 42

Originality.aia

Response n
Responses counted as AI 84
 100% AI 80
 98–99% AI 3
 70% AI 1
Responses counted as uncertain 2
 33–34% AI 2
Responses counted as human 40
 15–25% AI 4
 0–7% AI 36

aThese results indicate the detector’s confidence in the assessment – not the percentage of the text that is AI.

Saplinga

Response n
Responses counted as AI 53
 97–100% AI 36
 81–94% AI 7
 73–79% AI 10
Responses counted as uncertain 35
 61–68% AI 8
 52–58% AI 7
 40–49% AI 11
 30–38% AI 8
 “Unexpected error” 1
Responses counted as human 38
 20–29% AI 8
 10–19% AI 11
 3–9% AI 7
 0% AI 12

aThese results indicate the detector’s confidence in the assessment – not the percentage of the text that is AI. The “Unexpected error” message persisted even after repeated attempts to conduct the analysis.

Scribbra

Response n
Responses counted as AI 72
 100% AI 51
 93–99% AI 10
 81–86% AI 2
 55–73% AI 5
 45% AI 1
 31–36% AI 3
Responses counted as uncertain 1
 25% AI 1
Responses counted as human 53
 1–7% AI 10
 0% AI 43

aThese results indicate the percentage of the text that is AI – not the detector’s confidence in the assessment.

SEO.aia

Response n
Responses counted as AI 82
 Your text appears AI generated (Probability for AI is 71–100%) 82
Responses counted as uncertain 30
 Your text appears uncertain to determine (Probability for AI is 45–70%) 30
Responses counted as human 14
 Your text appears human made (Probability for AI is 1–37%) 14

aThese results indicate the detector’s confidence in the assessment – not the percentage of the text that is AI.

TurnItIna

Response n
Responses counted as AI 84
 100% AI 83
 84% AI 1
Responses counted as uncertain 0
 (None) 0
Responses counted as human 42
 0% AI 42

aThese results indicate the percentage of the text that is AI – not the detector’s confidence in the assessment.

Writer

Response n
Responses counted as AI 60
 You should edit your text until there’s less detectable AI content (0–90% human-generated content) 60
Responses counted as uncertain 0
 (None) 0
Responses counted as human 66
 Looking great! (92–94% human-generated content) 10
 Fantastic! (96–100% human-generated content) 56

ZeroGPTa

Response n
Responses counted as AI 78
 Your file content is AI/GPT generated (63–100% AI) 70
 Your file content is most likely AI/GPT generated (61–85% AI) 4
 Your file content is likely generated by AI/GPT (55% AI) 1
 Most of your file content is AI/GPT generated (32–61% AI) 3
Responses counted as uncertain 15
 Your file content contains mixed signals, with some parts generated by AI/GPT (36–48% AI) 4
 Your file content is likely human written, may include parts generated by AI/GPT (13–50% AI) 5
 Your file content is most likely human written, may include parts generated by AI/GPT (24–33% AI) 6
Responses counted as human 33
 Your file content is most likely human written (13–27% AI) 11
 Your file content is human written (0–17% AI) 22

aZeroGPT provides both numeric values (which indicate the percentage of the text that is AI) and text descriptions (which appear to reflect the numeric values as well as the detector’s confidence in the assessment). The text descriptions do not always correspond to specific numeric values.

References

Abdullahi, A. (2023, May 5). Top 10 AI detector tools for 2023. eWeek. https://www.eweek.com/artificial-intelligence/ai-detector-software/.Search in Google Scholar

Allison, N. (2023, Mar. 16). 250 + interesting research paper topics for 2022. MyPerfectWords. https://myperfectwords.com/blog/research-paper-guide/research-paper-topics.Search in Google Scholar

Anderson, N., Belavy, D. L., Perle, S. M., Hendricks, S., Hespanhol, L., Verhagen, E., & Memon, A. R. (2023). AI did not write this manuscript, or did it? Can we trick the AI text detector into generated texts? The potential future of ChatGPT and AI in sports & exercise medicine manuscript generation. BMJ Open Sport & Exercise Medicine, 9(1), article e001568. doi: 10.1136/bmjsem-2023-001568.Search in Google Scholar

Andrews, E. (2023). Comparing AI detection tools: One instructor’s experience. Academic Honesty and Integrity. https://tilt.colostate.edu/comparing-ai-detection-tools-one-instructors-experience/.Search in Google Scholar

Aremu, T. (2023, June 7). Unlocking Pandora’s box: Unveiling the elusive realm of AI text detection. Rochester, NY: SSRN. doi: 10.2139/ssrn.4470719.Search in Google Scholar

Atlas, S. (2023). Chatbot prompting: A guide for students, educators, and an AI-augmented workforce. https://www.researchgate.net/publication/367464129_Chatbot_Prompting_A_guide_for_students_educators_and_an_AI-augmented_workforce.Search in Google Scholar

Aw, B. (2023, July 23). 12 best AI detectors in 2023: Results from 180 tests. https://brendanaw.com/best-ai-detector.Search in Google Scholar

Baidoo-Anu, D., & Owusu Ansah, L. (2023, Jan. 25). Education in the era of generative artificial intelligence (AI): Understanding the potential benefits of ChatGPT in promoting teaching and learning. Rochester, NY: SSRN. doi: 10.2139/ssrn.4337484.Search in Google Scholar

Caulfield, J. (2023, June 2). Best AI detector: Free & premium tools compared. Scribbr. https://www.scribbr.com/ai-tools/best-ai-detector/.Search in Google Scholar

Cemper, C. C. (2023, Jan. 29). 13 AI content detection tools tested and AI watermarks. LinkResearchTools. https://www.linkresearchtools.com/blog/ai-content-detector-tools/.Search in Google Scholar

Cingillioglu, I. (2023). Detecting AI-generated essays: The ChatGPT challenge. International Journal of Information and Learning Technology, 40(3), 259–268. doi: 10.1108/IJILT-03-2023-0043.Search in Google Scholar

Compilatio.net. (2023, Feb. 16). Comparison of the best AI detectors in 2023. https://www.compilatio.net/en/blog/best-ai-detectors.Search in Google Scholar

Content at Scale. (2023). AI detector for ChatGPT, GPT4, bard & more. https://contentatscale.ai/ai-content-detector/.Search in Google Scholar

ContentDetector.ai. (2023). AI content detector – ChatGPT plagiarism checker. https://contentdetector.ai/.Search in Google Scholar

Copyleaks. (2023). AI content detector. https://copyleaks.com/ai-content-detector.Search in Google Scholar

Crossplag. (2023). AI content detector. https://crossplag.com/ai-content-detector/.Search in Google Scholar

Crothers, E. N., Japkowicz, N., & Viktor, H. L. (2023, July 18). Machine-generated text: A comprehensive survey of threat models and detection methods. IEEE Access, 11, 70977–71002. doi: 10.1109/ACCESS.2023.3294090.Search in Google Scholar

Dalalah, D., & Dalalah, O. M. A. (2023). The false positives and false negatives of generative AI detection tools in education and academic research: The case of ChatGPT. International Journal of Management Education, 21(2), article 100822. doi: 10.1016/j.ijme.2023.100822.Search in Google Scholar

Demers, T. (2023, Apr. 25). 16 of the best AI and ChatGPT content detectors compared. Search Engine Land. https://searchengineland.com/ai-chatgpt-content-detectors-395957.Search in Google Scholar

Desaire, H., Chua, A. E., Isom, M., Jarosova, R., & Hua, D. (2023). Distinguishing academic science writing from humans or ChatGPT with over 99% accuracy using off-the-shelf machine learning tools. Cell Reports Physical Science, 4(6), article 101426. doi: 10.1016/j.xcrp.2023.101426.Search in Google Scholar

Deziel, M. (2023, Feb. 19). We pitted ChatGPT against tools for detecting AI-written text, and the results are troubling. The Conversation. https://theconversation.com/we-pitted-chatgpt-against-tools-for-detecting-ai-written-text-and-the-results-are-troubling-199774.Search in Google Scholar

Dweck, C. S. (1986). Motivational processes affecting learning. American Psychologist, 41(10), 1040–1048. doi: 10.1037/0003-066X.41.10.1040.Search in Google Scholar

Ganesh, S. (2023, June 12). Explore these top 5 AI detector tools to detect AI-generated content. Analytics Insight. https://www.analyticsinsight.net/explore-these-top-5-ai-detector-tools-to-detect-ai-generated-content/.Search in Google Scholar

Gao, C. A., Howard, F. M., Markov, N. S., Dyer, E. C., Ramesh, S., Luo, Y., & Pearson, A. T. (2023). Comparing scientific abstracts generated by ChatGPT to real abstracts with detectors and blinded human reviewers. NPJ Digital Medicine, 6, article 75. doi: 10.1038/s41746-023-00819-6.Search in Google Scholar

Gewirtz, D. (2023, Jan. 13). Can AI detectors save us from ChatGPT? I tried 3 online tools to find out. ZDNET Tech Today. https://www.zdnet.com/article/can-ai-detectors-save-us-from-chatgpt-i-tried-3-online-tools-to-find-out/.Search in Google Scholar

Gillham, J. (2023). AI content detector accuracy review + open source dataset and research tool. Originality.ai. https://originality.ai/blog/ai-content-detection-accuracy.Search in Google Scholar

Golinkoff, R. M., & Wilson, J. (2023, Feb. 2). ChatGPT is a wake-up call to revamp how we teach writing. Philadelphia Inquirer. https://www.inquirer.com/opinion/commentary/chatgpt-ban-ai-education-writing-critical-thinking-20230202.html.Search in Google Scholar

GPT Radar. (2023). Detect AI generated text in a click. https://gptradar.com/.Search in Google Scholar

GPTZero. (2023). More than an AI detector. Preserve what’s human. https://gptzero.me/.Search in Google Scholar

Grammica. (2023). AI detector. https://grammica.com/ai-detector.Search in Google Scholar

Guo, B., Zhang, X., Wang, Z., Jiang, M., Nie, J., Ding, Y., … Wu, Y. (2023, Jan. 18). How close is ChatGPT to human experts? Comparison corpus, evaluation, and detection. Ithaca, NY: Cornell University. doi: 10.48550/arXiv.2301.07597.Search in Google Scholar

Heikkilä, M. (2023, Feb. 7). Why detecting AI-generated text is so difficult (and what to do about it). MIT Technology Review. https://www.technologyreview.com/2023/02/07/1067928/why-detecting-ai-generated-text-is-so-difficult-and-what-to-do-about-it/.Search in Google Scholar

Ivanov, V. (2023, June 23). Which is the best AI content detector? https://trickmenot.ai/which-is-the-best-ai-content-detector/.Search in Google Scholar

IvyPanda. (2023). GPT essay checker for students. https://ivypanda.com/gpt-essay-checker.Search in Google Scholar

Kearney, V. (2022, Oct. 26). 100 technology topics for research papers. Owlcation. https://owlcation.com/academia/100-Technology-Topics-for-Research-Paper.Search in Google Scholar

Khalil, M., & Er, E. (2023, Feb. 8). Will ChatGPT get you caught? Rethinking of plagiarism detection. Ithaca, NY: Cornell University. Ithaca, NY: Cornell University. doi: 10.48550/arXiv.2302.04335.Search in Google Scholar

Krishna, K., Song, Y., Karpinska, M., Wieting, J., & Iyyer, M. (2023, Mar. 23). Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense. Ithaca, NY: Cornell University. doi: 10.48550/arXiv.2303.13408.Search in Google Scholar

Liang, W., Yuksekgonul, M., Mao, Y., Wu, E., & Zou, J. (2023, July 10). GPT detectors are biased against non-native English writers. Ithaca, NY: Cornell University. doi: 10.48550/arXiv.2304.02819.Search in Google Scholar

Lund, B. D., Wang, T., Mannuru, N. R., Nie, B., Shimray, S., & Wang, Z. (2023). ChatGPT and a new academic reality: Artificial Intelligence-written research papers and the ethics of the large language models in scholarly publishing. Journal of the Association for Information Science and Technology, 74(5), 570–581. doi: 10.1002/asi.24750.Search in Google Scholar

Marche, S. (2022, Dec. 6). The college essay is dead. Nobody is prepared for how AI will transform academia. The Atlantic. https://www.theatlantic.com/technology/archive/2022/12/chatgpt-ai-writing-college-student-essays/672371/.Search in Google Scholar

Maruccia, A. (2023, Mar. 22). Reliable detection of AI-generated text is impossible, a new study says. TechSpot. https://www.techspot.com/news/98031-reliable-detection-ai-generated-text-impossible-new-study.html.Search in Google Scholar

Mujezinovic, D. (2023, May 11). AI content detectors don’t work, and that’s a big problem. MUO: Make Use Of. https://www.makeuseof.com/ai-content-detectors-dont-work/.Search in Google Scholar

OpenAI. (2023a). GPT-4 is OpenAI’s most advanced system, producing safer and more useful responses. https://openai.com/gpt-4.Search in Google Scholar

OpenAI. (2023b). New AI classifier for indicating AI-written text. https://openai.com/blog/new-ai-classifier-for-indicating-ai-written-text.Search in Google Scholar

OpenAI. (2023c, Mar. 27). GPT-4 technical report. https://paperswithcode.com/paper/gpt-4-technical-report-1.Search in Google Scholar

Originality.ai. (2023). Most accurate AI content checker & plagiarism checker for content marketers. https://originality.ai/.Search in Google Scholar

Paperell.net. (2023). 200 best research paper topics for 2023 + examples. https://paperell.net/blog/best-research-paper-topics.Search in Google Scholar

Pegoraro, A., Kumari, K., Fereidooni, H., & Sadeghi, A.-R. (2023, Apr. 5). To ChatGPT, or not to ChatGPT: That is the question! Ithaca, NY: Cornell University. doi: 10.48550/arXiv.2304.01487.Search in Google Scholar

Perkins, M., Roe, J., Postma, D., McGaughran, J., & Hickerson, D. (2023, May 29). Game of tones: Faculty detection of GPT-4 generated content in university assessments. Ithaca, NY: Cornell University. doi: 10.48550/arXiv.2305.18081.Search in Google Scholar

Rigolino, R. E. (2023, Jan. 31). With ChatGPT, we’re all editors now. Inside Higher Ed. https://www.insidehighered.com/views/2023/01/31/chatgpt-we-must-teach-students-be-editors-opinion.Search in Google Scholar

Sadasivan, V. S., Kumar, A., Balasubramanian, S., Wang, W., & Feizi, S. (2023, June 28). Can AI-generated text be reliably detected? Ithaca, NY: Cornell University. doi: 10.48550/arXiv.2303.11156.Search in Google Scholar

Sapling. (2023). AI detector. https://sapling.ai/ai-content-detector.Search in Google Scholar

Sarikas, C. (2020, Jan. 25). 113 great research paper topics. PrepScholar. https://blog.prepscholar.com/good-research-paper-topics.Search in Google Scholar

Scribbr. (2023). Free AI detector. https://www.scribbr.com/ai-detector/.Search in Google Scholar

SEO.ai. (2023). AI content detector. https://seo.ai/detector.Search in Google Scholar

Singh, A. (2023, July 24). 12 best AI content detectors of 2023 (accurate data). DemandSage. https://www.demandsage.com/ai-content-detectors/.Search in Google Scholar

Somoye, F. L. (2023, June 12). ChatGPT detectors in 2023. PC Guide. https://www.pcguide.com/apps/chat-gpt-detectors/.Search in Google Scholar

Tate, J. (2023, Feb. 5). Socrates never wrote a term paper. Wall Street Journal. 281, A15. https://www.wsj.com/articles/socrates-never-wrote-a-term-paper-education-teaching-learning-college-ai-chatgpt-lecturing-students-11675613853.Search in Google Scholar

TurnItIn. (2023). Empower students to do their best, original work. https://www.turnitin.com/.Search in Google Scholar

van Oijen, V. (2023, Mar. 31). AI-generated text detectors: Do they work? SURF Communities: AI in Education. https://communities.surf.nl/en/ai-in-education/article/ai-generated-text-detectors-do-they-work.Search in Google Scholar

Walters, W. H., Sheehan, S. E., Handfield, A. E., López-Fitzsimmons, B. M., Markgren, S., & Paradise, L. (2020). A multi-method information literacy assessment program: Foundation and early results. Portal: Libraries and the Academy, 20(1), 101–135. doi: 10.1353/pla.2020.0006.Search in Google Scholar

Wang, J., Liu, S., Xie, X., & Li, Y. (2023, Apr. 11). Evaluating AIGC detectors on code content. Ithaca, NY: Cornell University. doi: 10.48550/arXiv.2304.05193.Search in Google Scholar

Weber-Wulff, D., Anohina-Naumeca, A., Bjelobaba, S., Foltýnek, T., Guerrero-Dib, J., Popoola, O., … Waddington, L. (2023, July 10). Testing of detection tools for AI-generated text. Ithaca, NY: Cornell University. doi: 10.48550/arXiv.2306.15666.Search in Google Scholar

Welding, L. (2023, Mar. 27). Half of college students say using AI on schoolwork is cheating or plagiarism. BestColleges. https://www.bestcolleges.com/research/college-students-ai-tools-survey/.Search in Google Scholar

Wiggers, K. (2023, Feb. 16). Most sites claiming to catch AI-written text fail spectacularly. TechCrunch. https://techcrunch.com/2023/02/16/most-sites-claiming-to-catch-ai-written-text-fail-spectacularly/.Search in Google Scholar

Williams, R. (2023, July 7). AI-text detection tools are really easy to fool. MIT Technology Review. https://www.technologyreview.com/2023/07/07/1075982/ai-text-detection-tools-are-really-easy-to-fool/.Search in Google Scholar

Winston.ai. (2023, Feb. 14). Best AI detectors in 2023 compared. https://gowinston.ai/best-ai-detector/.Search in Google Scholar

Writer. (2023). AI content detector. https://writer.com/ai-content-detector/.Search in Google Scholar

Yan, D., Fauss, M., Hao, J., & Cui, W. (2023). Detection of AI-generated essays in writing assessments. Psychological Test and Assessment Modeling, 65(1), 125–144. https://www.psychologie-aktuell.com/fileadmin/Redaktion/Journale/ptam_2023-1/PTAM__1-2023_5_kor.pdf.Search in Google Scholar

ZeroGPT. (2023). GPT-4, ChatGPT & AI detector by ZeroGPT: detect OpenAI text. https://www.zerogpt.com/.Search in Google Scholar

Received: 2023-08-01
Revised: 2023-09-12
Accepted: 2023-09-15
Published Online: 2023-10-06

© 2023 the author(s), published by De Gruyter

This work is licensed under the Creative Commons Attribution 4.0 International License.

Downloaded on 26.2.2024 from https://www.degruyter.com/document/doi/10.1515/opis-2022-0158/html
Scroll to top button