Project5 Fake Text Detection

Uploaded by

Abhishek Verma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

Project5 Fake Text Detection

Uploaded by

Abhishek Verma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Project#5: Fake Text Detection

Abhishek Verma1 , Aditya Gupta2 , Aman Dixit3 , Maulik Singhal4 , Prakhar Pradhan5
1
190042 2 190061 2 190103 2 190489 3 190618
1
AE, 1 EE, 1 BSBE, 2 ME, 3 CE
{abhivrm, adigup, amandx, smaulik, prakharp}@iitk.ac.in

Abstract 2 Motivation
With the evolving complexity of AI technologies
This study investigates detecting machine- and ML algorithms, we have the opportunity to
generated text from the human-generated text.
use the same technology to find the truth. As this
This paper presents various approaches to build-
ing ML-based models, such as Graphical Neu-
is just the start of the future where every second
ral networks(GNN) and fine-tuning using the job would be done by artificial intelligence, we
pre-trained model RoBERTa. We propose the also have to start early and evolve accordingly to
BERT score as an evaluation metric along with give tight competition. Humans have evolved for
perplexity and burstiness. We present senti- 300 thousand years to reach here, so it is obvious
ment as a semantic feature to make the model any advancement in machine learning algorithms
more robust and tune the dataset to make the cannot match the complexity of human speech or
model less prone to adversarial attacks. These
actions. We take advantage of this minute differ-
findings can be an effective intervention in im-
proving the existing models. ence and use the same technology and algorithms
used by GPTs to always stay ahead in this race.

1 Introduction 3 Problem Defintion

Fake text detection refers to the process of iden- Key points:

tifying text that has been produced by artificial 1. The increasing sophistication of natural lan-
intelligence or machine learning algorithms. With guage processing (NLP) techniques and AI
the advancements in natural language processing models has led to the creation of convincing
(NLP) and deep learning algorithms, AI-generated AI-generated text, which can be used for var-
text has become increasingly sophisticated and dif- ious purposes, including spreading misinfor-
ficult to detect. mation, creating fake news, and manipulating
These texts can take many forms, such as fake public opinion. Therefore, there is a need to
news articles, chatbot conversations, product re- develop methods for detecting AI-generated
views, and even social media posts. It can be gener- text to help mitigate the potential harm this
ated with the intention of spreading misinformation, technology can cause.
manipulating public opinion, or impersonating real
2. The problem of detecting AI-generated text
individuals or entities.
can be framed as a binary classification
Detecting fake text is essential to maintain the
problem, where the goal is to determine
authenticity and trustworthiness of online informa-
whether a given text is human-generated or
tion. With the growing prevalence of AI-generated
AI-generated.
text, the development of effective detection tech-
niques has become increasingly important in en- 3. Some potential approaches to this problem in-
suring the accuracy and reliability of online con- clude analyzing the syntax and grammar of the
tent. Various approaches, including language mod- text, detecting patterns of repetition or incon-
els, anomaly detection, and machine learning algo- sistency that are characteristic of AI-generated
rithms, have been employed to detect AI-generated text, and comparing the language and style of
text, and ongoing research in this area is critical to the text to a large corpus of human-generated
combat the spread of false information. text to identify irregularities.
4 Related Work title of the topic, Wikipedia introduction, GPT
generated introduction, length of wiki intro,
Key points:
length of GPT intro, prompt given to GPT,
1. In the study by Eric Mitchel et al., empirical etc.
research revealed a curvature-based criterion
2. To generate more dataset, we have used the
for determining whether a passage is derived
arxiv dataset 2 which is a subset of the orig-
from a particular LLM. This method, called
inal ArXiv data due to its large size (1.1TB
DetectGpt, is a zero-shot method as it does
and growing). It includes a json metadata
not require training a distinct classifier or col-
file with information for each paper, such as
lecting a separate dataset of real and gener-
the ArXiv ID, submitter, authors, title, com-
ated text. It employs only the log probabilities
ments, journal-ref, DOI, abstract, categories,
computed by the model of interest and random
and version history. It contains 8686 rows
passage perturbations. These deviations were
(4343 each)
generated by an additional language model
that had been pre-trained. They discovered 3. Dataset essays generated using openAPI 3
that this method enhanced the performance of which contain 2500 rows(testing). Prompt
existing zero-shot methods for detecting fake used for generation of data was "Write an
news by an AUROC of approximately 0.1. essay about <topic> in <country> in about
<word limit> words." The data was generated
2. In another study by Yongqiang Maa, they
size wise for [10,50,100,300,500] words to
constructed a feature description framework
test the data with respect to the input size.
based on syntax, semantics, and pragmat-
ics. Then, they leveraged the proposed frame- 6 Proposed Approach
work’s characteristics, i.e., writing style, co-
herence, consistency, and argument logistics, 1. Supervised classification: In our approach, we
to analyze the two content categories. used handcrafted features which will be based
on the following categories :
3. Jawahar and others investigated the disparity
between AI-generated and human-written sci- (a) Lexical characteristics: In the input text,
entific text using the popular XGBoost model lexical features record details about cer-
with two feature extraction schemas, TF-IDF tain words. Word frequency, word length,
and a hand-crafted set of features. They and the existence of certain words or
trained their model to distinguish between phrases in the input text are a few ex-
four classes of text origin: Definitely human- amples of lexical characteristics.
written, Possibly human-written, Possibly (b) Syntax features: The sequence and re-
machine-generated and Definitely machine- lationships between words are among
generated the information that syntax features col-
lect about the structure of the input text.
4. Zellers et al. (2019) proposed the Grover Part-of-speech tags, dependency relation-
model to generate fake news samples and de- ships, or the quantity of noun or verb
tect fake news. phrases in the input text are a few exam-
ples of syntactic characteristics.
5 Corpus/Data Description
(c) Semantics: The meaning of the input
1. We are using WIKI-INTRO-DATASET pro- text, including the definitions of specific
vided in the following link 1 . to train our words and phrases and the connections
model. The dataset has been generated by between them, is referred to as seman-
extracting introductions from Wikipedia for tics. There are two dimensions such as
150k topics and generated text using GPT for coherence and consistency. The degree
the same topics. The schema of the dataset to which a text is logically related and
has 12 columns, some of which are: ID, the 2
https://www.kaggle.com/datasets/Cornell-
1 University/arxiv
https://huggingface.co/datasets/aadityaubhat/GPT-wiki-
3
intro https://www.openapis.org
understandable is called coherence. A co- 4. Cloud Training (Amazon Instance and Google
hesive paragraph transitions easily from GCP) - We have tried training our models on
one topic to the next and is structured Amazon Sagemaker and Google GCP since
logically. On the other side, consistency our training dataset was large.
relates to how devoid of inconsistencies
We will be using the following evaluation metrics
or conflicts a text is.
to evaluate and optimize our approach:
(d) Pragmatics: The input text’s context and
intended meaning are referred to as prag- 1. Perplexity: A statistical language model’s
matics. Discourse analysis, which cap- ability to anticipate a brand-new, previously
tures the coherence and flow of the input unknown sequence of tokens is known as per-
text, and sentiment analysis, which cap- plexity. It is determined by taking the test
tures the emotive tone of the input text, set’s inverse probability and normalizing it by
are two examples of pragmatic character- the number of words. Perplexity, then, gauges
istics. how startled a model is when it meets a novel
token sequence. Better performance is indi-
2. Fine Tuning: Fine-tuning entails modifying a cated by lower confusion.
previously trained language model to a new
2. Burstiness: It is a measurement of the distri-
task by training it on a particular dataset as-
bution of word frequencies within a corpus of
sociated with the task. An extra output layer
texts. It describes how some words tend to oc-
tailored to the new job is added to the pre-
cur in groups or bursts rather than uniformly
trained model, which is then adjusted, and the
dispersed across the text.
entire model is refined on the fresh dataset.
According to this method, the model can per- Metrics Values
form better on the new task by learning task- Accuracy on Validation
92.33%
specific characteristics and patterns from the Data (6000 rows)
new dataset. Accuracy on OOB Data
76.05%
(8686 rows)
7 Experiments and Results Total Perplexity Score 1985.971
We experimented various stuffs- Total Burstiness Score 84.824

8 Error Analysis
1. Logistic Regression - We have implemented
LR using all characteristic features.
(a) Text vectorization
i. BOW, CBOW
ii. TF-IDF
iii. One-hot Encoding
iv. Word2vec
(b) Word Embedding Types
(c) Frequency based
i. BOW, TF-IDF, Glove
(d) Prediction-based
i. Word2Vec

2. Parallel Training - We have tried parallel train-

ing on various systems Figure 1: Confusion Matrix

3. Ensemble Methods - Since we have used par-

9 Future Directions
allel training ensemble method was used to
stick all the model pickle files and create a 1. Adversarial Detection: As fake text generation
metafile. techniques become more sophisticated, fake
text detectors may need to incorporate adver- Detection of Machine Generated Text: A Crit-
sarial detection methods. Adversarial training, ical Survey. In Proceedings of the 28th Inter-
where models are trained on both real and arti- national Conference on Computational Lin-
ficially generated fake text samples, can help guistics, pages 2296–2309, Barcelona, Spain
improve the model’s robustness against adver- (Online). International Committee on Compu-
sarial attacks and make them more effective tational Linguistics.
in detecting advanced fake text techniques.
• Mitchell, E., Lee, Y., Khazatsky, A., Manning,
2. We need more data !! C.D. and Finn, C. (2023). DetectGPT: Zero-
Shot Machine-Generated Text Detection using
10 Individual Contribution Probability Curvature. URL: http://arxiv.
org/abs/2301.11305
Each member of our team has been assigned an
equal share of the project’s workload, with respon- • Shijaku, Rexhep Canhasi, Ercan. (2023).
sibilities thoughtfully distributed across five key ChatGPT Generated Text Detection.
10.13140/RG.2.2.21317.52960. URL:
1. Research https://www.researchgate.net/
publication/366898047_ChatGPT_
2. Data collection and preprocessing Generated_Text_Detection
3. Model development and tuning • Ma, Y., Liu, J., Yi, F., Cheng, Q., Huang, Y.,
Lu, W., Liu, X. (2023). AI vs. human – differ-
4. Implementation and integration entiation analysis of Scientific Content Gen-
eration. URL : https://arxiv.org/abs/
5. Testing and evaluation
2301.10416
11 Conclusion
Based on the experiments conducted, the team
tried logistic regression with different character-
istic features such as BOW, CBOW, TF-IDF, One-
hot encoding, Glove, and Word2Vec. They also
implemented parallel training and ensemble meth-
ods to stick all the model pickle files and create
a metafile. Additionally, they tried cloud training
on Amazon Sagemaker and Google GCP since the
training dataset was large.
To evaluate and optimize their approach, the
team used the perplexity metric to measure the
model’s ability to anticipate a brand-new, previ-
ously unknown sequence of tokens, and burstiness
to measure the distribution of word frequencies
within a corpus of texts. The results showed an
accuracy of 92.33
Overall, the team’s approach shows promise in
achieving high accuracy in classification. However,
further optimization may be necessary to improve
the model’s performance on OOB data.

References

• Ganesh Jawahar, Muhammad Abdul-Mageed,

and Laks Lakshmanan, V.S.. 2020. Automatic

A simple yet Efficient Ensemble Approach For Ai -generated Text Detection
No ratings yet
A simple yet Efficient Ensemble Approach For Ai -generated Text Detection
9 pages
Ilovepdf Merged Pagenumber
No ratings yet
Ilovepdf Merged Pagenumber
59 pages
Report4 Merged Organized 1 30
No ratings yet
Report4 Merged Organized 1 30
30 pages
One-Class Learning For AI-Generated Essay Detection
No ratings yet
One-Class Learning For AI-Generated Essay Detection
24 pages
Beyond Black Box AI-Generated Plagiarism Detection: From Sentence To Document Level
No ratings yet
Beyond Black Box AI-Generated Plagiarism Detection: From Sentence To Document Level
9 pages
AI Generated Content Detection
No ratings yet
AI Generated Content Detection
8 pages
Detectgpt: Zero-Shot Machine-Generated Text Detection Using Probability Curvature
No ratings yet
Detectgpt: Zero-Shot Machine-Generated Text Detection Using Probability Curvature
12 pages
Zero-Shot Machine-Generated Text Detection
No ratings yet
Zero-Shot Machine-Generated Text Detection
13 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
67 pages
2310.08903v2
No ratings yet
2310.08903v2
13 pages
Report4 Merged Organized 31 59
No ratings yet
Report4 Merged Organized 31 59
29 pages
Detect GPT
No ratings yet
Detect GPT
12 pages
A Survey of AI-generated Text Forensic Systems: Detection, Attribution, and Characterization
No ratings yet
A Survey of AI-generated Text Forensic Systems: Detection, Attribution, and Characterization
20 pages
On The Possibilities of AI-Generated Text Detection
No ratings yet
On The Possibilities of AI-Generated Text Detection
29 pages
Distinguishing Human Generated Text From Chatgpt Generated Text Using Machine Learning
No ratings yet
Distinguishing Human Generated Text From Chatgpt Generated Text Using Machine Learning
6 pages
Ppaeer of Major Project
No ratings yet
Ppaeer of Major Project
8 pages
10.1515 - Opis 2022 0158
No ratings yet
10.1515 - Opis 2022 0158
24 pages
Generative AI Text Classification Using Ensemble LLM Approaches
No ratings yet
Generative AI Text Classification Using Ensemble LLM Approaches
8 pages
Fighting Fire With Fire: Can ChatGPT Detect AI-generated Text?
No ratings yet
Fighting Fire With Fire: Can ChatGPT Detect AI-generated Text?
8 pages
Final Report A I Detect
No ratings yet
Final Report A I Detect
34 pages
Visvesvaraya Technological University: Detection of AI Generated Text
No ratings yet
Visvesvaraya Technological University: Detection of AI Generated Text
5 pages
Use Prompt To Differentiate Text Generated by ChatGPT and Humans
No ratings yet
Use Prompt To Differentiate Text Generated by ChatGPT and Humans
8 pages
AutexTification IberLEF 2023 3 Junio
No ratings yet
AutexTification IberLEF 2023 3 Junio
18 pages
26501-Article Text-30564-1-2-20230626
No ratings yet
26501-Article Text-30564-1-2-20230626
9 pages
2025-A Survey on LLM-Generated Text Detection Necessity, Methods, And Future Directions
No ratings yet
2025-A Survey on LLM-Generated Text Detection Necessity, Methods, And Future Directions
64 pages
Automatic Detection of Generated Text Is Easiest When Humans Are Fooled
No ratings yet
Automatic Detection of Generated Text Is Easiest When Humans Are Fooled
15 pages
LLM-Detector: Improving AI-Generated Chinese Text Detection With Open-Source LLM Instruction Tuning
No ratings yet
LLM-Detector: Improving AI-Generated Chinese Text Detection With Open-Source LLM Instruction Tuning
17 pages
2023 IEEE Machine-Generated - Text - A - Comprehensive - Survey - of - Threat - Models - and - Detection - Methods
No ratings yet
2023 IEEE Machine-Generated - Text - A - Comprehensive - Survey - of - Threat - Models - and - Detection - Methods
26 pages
EVADE CHATGPT DETECTORS VIA A SINGLE SPACE
No ratings yet
EVADE CHATGPT DETECTORS VIA A SINGLE SPACE
12 pages
10-1108_ijilt-03-2023-0043
No ratings yet
10-1108_ijilt-03-2023-0043
10 pages
Fake News Detection With Semantic Features and Text Mining
No ratings yet
Fake News Detection With Semantic Features and Text Mining
6 pages
An Information Density-Based Machine-Generated Text Detector
No ratings yet
An Information Density-Based Machine-Generated Text Detector
13 pages
WhenAutomatedAssessmentMeetACG
No ratings yet
WhenAutomatedAssessmentMeetACG
35 pages
Detecting AI
No ratings yet
Detecting AI
2 pages
Beyond Lexical Boundaries: LLM-Generated Text Detection For Romanian Digital Libraries
No ratings yet
Beyond Lexical Boundaries: LLM-Generated Text Detection For Romanian Digital Libraries
31 pages
A Thesis That Writes Itself
No ratings yet
A Thesis That Writes Itself
38 pages
Can AI-Generated Text Be Reliably Detected?: Preprint
No ratings yet
Can AI-Generated Text Be Reliably Detected?: Preprint
23 pages
Perceptions of Human and Machine-Generated Articles
No ratings yet
Perceptions of Human and Machine-Generated Articles
16 pages
The Main Objective Is To Detect The Fake News, Which Is A Classic Text Classification
No ratings yet
The Main Objective Is To Detect The Fake News, Which Is A Classic Text Classification
57 pages
Chat GPT
No ratings yet
Chat GPT
23 pages
Fake News Detection With Different Model
No ratings yet
Fake News Detection With Different Model
15 pages
Xyz
No ratings yet
Xyz
62 pages
fake news detection ppt
No ratings yet
fake news detection ppt
25 pages
What Is Natural Language Processing (NLP)
No ratings yet
What Is Natural Language Processing (NLP)
15 pages
Report Rohun Sjmoon
No ratings yet
Report Rohun Sjmoon
6 pages
RKM029A02 - Project - INT248 - Report 1
No ratings yet
RKM029A02 - Project - INT248 - Report 1
16 pages
Applying Human-in-the-Loop to construct a dataset for determining content reliability to combat fake news
No ratings yet
Applying Human-in-the-Loop to construct a dataset for determining content reliability to combat fake news
16 pages
Fake News Detection Using Machine Learning
No ratings yet
Fake News Detection Using Machine Learning
6 pages
Lab Lab Port
No ratings yet
Lab Lab Port
20 pages
A i Project Proposal
No ratings yet
A i Project Proposal
10 pages
2311.04917v2
No ratings yet
2311.04917v2
18 pages
Deepfake Detection on Social Media Leveraging Deep Learning and Fast Text Embeddings
No ratings yet
Deepfake Detection on Social Media Leveraging Deep Learning and Fast Text Embeddings
8 pages
Reviewing of The Five AI Content Detectors - Perplexity Index
No ratings yet
Reviewing of The Five AI Content Detectors - Perplexity Index
11 pages
Batch 17
No ratings yet
Batch 17
27 pages
1 s2.0 S1877050923015363 Main 2
No ratings yet
1 s2.0 S1877050923015363 Main 2
10 pages
Tweepfake: About Detecting Deepfake Tweets
No ratings yet
Tweepfake: About Detecting Deepfake Tweets
19 pages
The Science of Detecting LLM-Generated Texts: Ruixiang Tang, Yu-Neng Chuang, Xia Hu
No ratings yet
The Science of Detecting LLM-Generated Texts: Ruixiang Tang, Yu-Neng Chuang, Xia Hu
10 pages
Evaluating Generative Models For Graph-to-Text Generation
No ratings yet
Evaluating Generative Models For Graph-to-Text Generation
9 pages
Automatic Image Annotation: Fundamentals and Applications
From Everand
Automatic Image Annotation: Fundamentals and Applications
Fouad Sabry
No ratings yet
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
From Everand
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
Fouad Sabry
No ratings yet
Development of Chatbot for Cybersecurity
No ratings yet
Development of Chatbot for Cybersecurity
31 pages
Developments in The Built Environment
No ratings yet
Developments in The Built Environment
29 pages
Generative AI For Pentesting The Good The Bad The
No ratings yet
Generative AI For Pentesting The Good The Bad The
24 pages
Introduction to LLM Agents
No ratings yet
Introduction to LLM Agents
2 pages
Gen Ai Solutions
No ratings yet
Gen Ai Solutions
14 pages
From LLMs To LLM Based Agents For Software Engineering 1723301316
100% (1)
From LLMs To LLM Based Agents For Software Engineering 1723301316
42 pages
Evaluating The Effectiveness of Chat GPT in Promoting Academic Success Through Assignment Solving Among Graduate Students in The University of Louisiana Lafayette
No ratings yet
Evaluating The Effectiveness of Chat GPT in Promoting Academic Success Through Assignment Solving Among Graduate Students in The University of Louisiana Lafayette
8 pages
Artificial Intelligence -A Strategic Imperative for CFOs
No ratings yet
Artificial Intelligence -A Strategic Imperative for CFOs
18 pages
16 Chat GPT Prompts Template
No ratings yet
16 Chat GPT Prompts Template
5 pages
ETI Project
No ratings yet
ETI Project
16 pages
Mercer Case Study
No ratings yet
Mercer Case Study
46 pages
AI Fundamentals SkillUp Session 2 (2)
No ratings yet
AI Fundamentals SkillUp Session 2 (2)
41 pages
LLM Basics
No ratings yet
LLM Basics
3 pages
Intro To OpenAI GPT API - Intro To OpenAI GPT API Cheatsheet - Codecademy
No ratings yet
Intro To OpenAI GPT API - Intro To OpenAI GPT API Cheatsheet - Codecademy
7 pages
16-2023-15
No ratings yet
16-2023-15
32 pages
Data-Driven Marketing: Master's Degree Program in
No ratings yet
Data-Driven Marketing: Master's Degree Program in
82 pages
PEC GEN AI NOTES
No ratings yet
PEC GEN AI NOTES
11 pages
AI Application That Can Provide Legal Advice To The General
No ratings yet
AI Application That Can Provide Legal Advice To The General
2 pages
Software Engineering Using Autonomous Agents Are We There Yet
No ratings yet
Software Engineering Using Autonomous Agents Are We There Yet
3 pages
De Report Chatbot
No ratings yet
De Report Chatbot
31 pages
Local Multi-Agent RAG Superbot Using GraphRAG, AutoGen, Ollama, And Chainlit. _ by Karthik Rajan _ AI Advances
No ratings yet
Local Multi-Agent RAG Superbot Using GraphRAG, AutoGen, Ollama, And Chainlit. _ by Karthik Rajan _ AI Advances
23 pages
A_Phenomenological_Study_Teac
No ratings yet
A_Phenomenological_Study_Teac
116 pages
24 - VNHSGE - VietNamese High School Graduation Examination Dataset For Large Language Models
No ratings yet
24 - VNHSGE - VietNamese High School Graduation Examination Dataset For Large Language Models
74 pages
Assessing Fine-Tuning Efficacy in LLMS: A Case Study With Learning Guidance Chatbots
No ratings yet
Assessing Fine-Tuning Efficacy in LLMS: A Case Study With Learning Guidance Chatbots
11 pages
A_Generative_AI-Based_Personalized_Guidance_Tool_for_Enhancing_the_Feedback_to_M
No ratings yet
A_Generative_AI-Based_Personalized_Guidance_Tool_for_Enhancing_the_Feedback_to_M
8 pages
Conference-template-A4
No ratings yet
Conference-template-A4
9 pages
3. From ChatGPT to ThreatGPT Impact of Generative AI in Cybersecurity and Privacy Compressed
No ratings yet
3. From ChatGPT to ThreatGPT Impact of Generative AI in Cybersecurity and Privacy Compressed
8 pages
Notice For Industrial Training Yop2026 Ver 1.1
No ratings yet
Notice For Industrial Training Yop2026 Ver 1.1
57 pages
Easy Reading Chat GPT - 148726
No ratings yet
Easy Reading Chat GPT - 148726
2 pages
thinkingaboutaiandhealthmisinformation-share-240829204240-0a000baf
No ratings yet
thinkingaboutaiandhealthmisinformation-share-240829204240-0a000baf
34 pages