KECReport

Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

KANTIPUR ENGINEERING COLLEGE

(Affiliated to Tribhuvan University)


Dhapakhel, Lalitpur

[Subject Code: CT707]


A MAJOR PROJECT FINAL REPORT ON

DEEPFAKE DETECTION USING CVIT

Submitted by:
Rajan Gaihre [KAN076BCT60]
Santosh Adhikari [KAN076BCT73]
Sushant Shrestha [KAN076BCT92]
Utsab Karki [KAN076BCT95]

A MAJOR PROJECT SUBMITTED IN PARTIAL


FULFILLMENT OF THE REQUIREMENT FOR THE DEGREE
OF BACHELOR IN COMPUTER ENGINEERING

Submitted to:
Department of Computer and Electronics Engineering

June, 2023
DEEPFAKE DETECTION USING CVIT

Submitted by:
Rajan Gaihre [KAN076BCT60]
Santosh Adhikari [KAN076BCT73]
Sushant Shrestha [KAN076BCT92]
Utsab Karki [KAN076BCT95]

A MAJOR PROJECT SUBMITTED IN PARTIAL


FULFILLMENT OF THE REQUIREMENT FOR THE DEGREE
OF BACHELOR IN COMPUTER ENGINEERING

Submitted to:
Department of Computer and Electronics Engineering
Kantipur Engineering College
Dhapakhel, Lalitpur

June, 2023
ABSTRACT

The recent rise of machine learning and artificial intelligence has caused the rapid de-
velopment of the use of AI models.The use of artificial intelligence has brought forth
numerous advancements and benefits in various domains.However, alongside its pos-
itive impacts, AI has also introduced negative aspects, such as the rise in popularity
of deepfake technology. Through the use deepfake technology one can create synthe-
sis hyper-realistic videos known as Deepfakes. Deep learning techniques can be used
to generate faces, swap faces between two subjects in a video,alter facial expressions,
change gender, and alter facial features. While these innovations have opened up ex-
citing possibilities for creative expression and entertainment, they also raise significant
concerns regarding the potential misuse and ethical implications. We propose to use the
convolution vision transformer for the deepfake detection. The CNN extracts the learn-
able features while the Vit takes the features obtained from the CNN and categorizes
them.

KeywordsF aceDetection, M achineLearning, V isionT ransf ormer, Convolution


neuralnetworks

i
TABLE OF CONTENTS

ABSTRACT i
1 INTRODUCTION 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Application Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.5 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.6 Development Requirement . . . . . . . . . . . . . . . . . . . . . . . . 3
1.6.1 Hardware Requirement . . . . . . . . . . . . . . . . . . . . . 3
1.6.2 Software Requirement . . . . . . . . . . . . . . . . . . . . . . 3
1.7 Deployment Requirement . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.7.1 Hardware Requirement . . . . . . . . . . . . . . . . . . . . . . 3
1.7.2 Software Requirement . . . . . . . . . . . . . . . . . . . . . . 4
1.8 Feasibility Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.8.1 Economic Feasibility . . . . . . . . . . . . . . . . . . . . . . . 4
1.8.2 Technical Feasibility . . . . . . . . . . . . . . . . . . . . . . . 4
1.8.3 Operational Feasibility . . . . . . . . . . . . . . . . . . . . . . 4
1.8.4 Schedule Feasibility . . . . . . . . . . . . . . . . . . . . . . . 4
1.9 Work Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 LITERATURE REVIEW 6
2.1 Similar Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 MesoNet : A Compact Facial Video Forgery Detection Network 6
2.1.2 Deepfakes Detection with Automatic Face Weighting . . . . . . 6
2.1.3 Capsule-Forensics: Using Capsule Networks to Detect Forged
Images and Videos . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Related Papers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.2 Deepfake . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 METHODOLOGY 8
3.1 Working Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

ii
3.2 Detail Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2.1 Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2.2 Preprocessing Component . . . . . . . . . . . . . . . . . . . . 9
3.2.3 Detection Component . . . . . . . . . . . . . . . . . . . . . . 10
3.2.4 Convolutional Vision Transformer . . . . . . . . . . . . . . . . 10
3.2.5 Feature Learning Component . . . . . . . . . . . . . . . . . . 11
3.2.6 Vision Transformer Component . . . . . . . . . . . . . . . . . 11
3.2.7 MTCNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 Use case diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4 Software model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4 Expected Output 16
4.1 Expected Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
References 16

iii
LIST OF FIGURES

1.1 Gantt Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5


3.1 Flowchart of Working mechanism . . . . . . . . . . . . . . . . . . . . 9
3.2 Convolutional Vision Transformer . . . . . . . . . . . . . . . . . . . . 10
3.3 ViT process with the learnable class embedding . . . . . . . . . . . . . 13
3.4 Use case diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.5 Incremental Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

iv
CHAPTER 1
INTRODUCTION

1.1 Background

Technologies for altering images, videos, and audios are developing rapidly. The rise
of Deepfake technology has gained significant attention in recent years due to its ability
to generate highly realistic manipulated media.The different Techniques and technical
expertise to create and manipulate digital content are also easily accessible as they are
abundant reading material in the internet. Currently, it is possible to seamlessly gener-
ate hyper-realistic digital images with a little resource and easy-to-follow instructions
available online.Deepfake is a technique which aims to replace the face of a targeted
person by the face of someone else in a video. It is created by splicing synthesized face
region into the original image. The term can also mean to represent the final output of a
hyper-realistic video created. Deepfakes can be used for the creation of hyper-realistic
Computer Generated Imagery (CGI), Virtual Reality (VR), Augmented Reality (AR),
Education, Animation, Arts, and Cinema. However, since Deepfakes are deceptive in
nature, they can also be used for malicious purposes.By utilizing the capabilities of
CVT and focusing on the inconsistency in pixel-level details, we aim to address the dis-
advantages of deepfake technology and provide a robust defense against its malicious
usage. This research project strives to contribute to the development of advanced deep-
fake detection techniques, enhancing the security and integrity of digital media in an
increasingly vulnerable landscape[1].

1.2 Problem Statement

The rapid development of deepfake technology poses a significant challenge in the


realm of digital media integrity and cybersecurity.The use of deepfakes raises concerns
about their potential misuse for spreading disinformation, impersonation, and other ma-
licious activities.The primary problem is the difficulty in detecting deepfake videos ac-
curately and efficiently. Deepfakes have become increasingly sophisticated, making it

1
challenging to distinguish them from genuine videos with traditional detection meth-
ods. Existing solutions often lack generalizability and struggle to keep pace with the
rapid evolution of deepfake techniques. Moreover, the accessibility of deepfake creation
tools and the abundance of training data pose additional challenges for detection algo-
rithms. Another problem is the potential harmful impact of deepfakes on individuals,
organizations, and society as a whole. Deepfakes can undermine trust in digital me-
dia, manipulate public opinion, and harm the reputation of individuals and businesses.
The consequences of undetected deepfakes can be severe, ranging from political unrest
to financial fraud. The ultimate goal is to develop advanced deepfake detection tech-
niques that can accurately identify and distinguish between genuine and manipulated
videos, ensuring the integrity and trustworthiness of digital media. By addressing the
problem of deepfake detection, we aim to mitigate the risks associated with deepfakes,
protect individuals and organizations from potential harm, and safeguard the credibility
of digital content.

1.3 Objectives

The main objectives of the application can be listed as follows:


I Identify and distinguish between genuine and manipulated videos.

1.4 Application Scope

The applications of the projects include:


I Deepfake detection can be integrated into popular social media platforms to iden-
tify and flag manipulated videos shared by users.
II Deepfake detection can be utilized by news organizations and journalists to verify
the authenticity of videos before reporting or publishing them.

1.5 Features

The features of the of the application include as follows:


I The system accurately identify and differentiate between genuine and manipu-

2
lated videos, detecting the presence of deepfake techniques such as face swap-
ping.

1.6 Development Requirement

The development requirement includes the following:

1.6.1 Hardware Requirement

Hardware configuration and requirement for the operation are as follows:


I PC with 4GB RAM
II Intel i3 or above series processor

1.6.2 Software Requirement

Software configuration and requirement for the operation are as follows:


I Operating system: Windows 10,11 or Linux.
II Web browser: Google Chrome,Opera.

1.7 Deployment Requirement

Development requirements deals with mainly two types of requirements which are hard-
ware and software. The requirement plays a vital role in the successful completion of
the project which are listed below:

1.7.1 Hardware Requirement

Hardware Requirement for the deployment are as follows:


I PC with 8GB RAM + SSD disk storage .
II Intel i5 series processor.
III Camera.

3
1.7.2 Software Requirement

Software Requirement for the deployment are as follows:


I Web Library and Language Involved: HTML, CSS, JavaScript, Django.
II IDES: Visual Studio Code, Jupyter notebook.

1.8 Feasibility Study

The feasibility analysis of the project is one of the most important things to be consid-
ered for the project development. It can be divided into the following components.

1.8.1 Economic Feasibility

The project is mostly a software-based project and only requires a camera for the hard-
ware component which is present in every laptop and mobile. So, it is economically
feasible.

1.8.2 Technical Feasibility

This system is a web-based system and demands laptops with cameras and internet
availability for its smooth functioning. Apart from that, this system is user-friendly and
easy to operate without requiring any advanced hardware components.

1.8.3 Operational Feasibility

All the process in the program will work and the operation can be carried out easily by
anyone with little application knowledge. As per the operational feasibility, the project
meets its requirement criteria for the system development.

1.8.4 Schedule Feasibility

Our system is based on web-based platform that allows the user to login and register
and uses the system accordingly. Development using CNN, slow hashing, designing

4
frontend with HTML,CSS, etc. So, it is feasible.

1.9 Work Schedule

The work schedule is shown with the help of the gantt chart in fig 1.1.

Figure 1.1: Gantt Chart

5
CHAPTER 2
LITERATURE REVIEW

2.1 Similar Projects

2.1.1 MesoNet : A Compact Facial Video Forgery Detection Network

This paper entitled”MesoNet : A Compact Facial Video Forgery Detection Network”


has introduced a method for automatically and effectively identifying manipulated faces
in videos. Specifically, it focuses on two techniques called Deepfake and Face2Face,
which create highly realistic fake videos. Traditional methods used for analyzing im-
ages are not well-suited for videos because the compression used in video formats sig-
nificantly degrades the data quality. To overcome this challenge, the paper adopts a
deep learning approach and proposes two networks with a small number of layers that
specifically analyze the key characteristics of images. The performance of these effi-
cient networks is evaluated using both an existing dataset and a new dataset compiled
from online videos. The results of the tests show a highly accurate detection rate, with
over 98 precentage for Deepfake and 95 precentage for Face2Face [2].

2.1.2 Deepfakes Detection with Automatic Face Weighting

This paper entitled”Deepfakes Detection with Automatic Face Weighting” proposes


a novel method for deepfake detection that leverages convolutional neural networks
(CNNs) and recurrent neural networks (RNNs). This approach focuses on extracting
visual and temporal features from facial regions in videos to effectively identify manip-
ulations. They evaluated a method using the DFDC dataset and demonstrate its com-
petitive performance compared to other existing techniques and introduce a technique
for automatically weighting different face regions, and also explore the use of boosting
techniques to enhance the robustness of predictions [3].

6
2.1.3 Capsule-Forensics: Using Capsule Networks to Detect Forged Im-

ages and Videos

This paper entitled”Capsule-Forensics: Using Capsule Networks to Detect Forged Im-


ages and Videos” presents presents a method that uses a capsule network (a deep learn-
ing network) to detect forged images and videos in a wide range of forgery scenarios.
The authors focused on replay attacks, face swapping, facial reenactments and fully
computer generated image and spoofing [4].

2.2 Related Papers

2.2.1 Machine Learning

Machine learning is widely used in software to enable an improved experience with the
user. Using Machine learning, robots can acquire skills or learn to adapt to the environ-
ment in which they are working. Robots can acquire skills such as object placement,
grasping objects, and locomotion skills through either automated learning or learning
via human intervention. The race is on for ML to be used in health care analytics. A
number of start-ups are looking at the advantages of using ML with big data to provide
health care professionals with better-informed data to enable them to make better de-
cisions.Machine learning algorithms fall into one of two learning types: supervised or
unsupervised learning.

2.2.2 Deepfake

Deep fakes a term that first emerged in 2017 to describe realistic photo, audio, video,
and other forgeries generated with artificial intelligence AI technologies could present a
variety of national security challenges in the years to come. As these technologies con-
tinue to mature, they could hold significant implications for congressional oversight,
U.S. defense authorizations and appropriations, and the regulation of social media plat-
forms.

7
CHAPTER 3
METHODOLOGY

3.1 Working Mechanism

In the web application, users can either upload a video or provide a video link. The
application then performs face extraction using the MTCNN algorithm and proceeds
with data augmentation. Face images from the video are extracted and converted to a
standardized size of 224 x 224 in RGB format.Data augmentation plays a vital role in
enhancing the training data. It involves applying various transformations to the existing
training samples, generating additional samples with slightly modified versions of the
original data.For image classification tasks, we utilize the Convolutional Vision Trans-
former (CVT), a hybrid model that combines the strengths of Convolutional Neural
Networks (CNNs) and Vision Transformers (ViTs). The CNN component is responsible
for feature learning, extracting meaningful patterns from the face images. Meanwhile,
the ViT component receives a feature map of the face images. It splits the images into
patches, applies linear projection, and employs learnable embeddings to analyze and
determine the presence of deepfake manipulation.By leveraging the synergies between
CNNs and ViTs, the CVT model offers a powerful approach to address image classifica-
tion challenges, particularly in detecting deepfake content. This integrated framework
enables the system to effectively learn and recognize both local and global features
within the face images, providing accurate assessments of potential deepfake activ-
ity.Through the web application, users can benefit from the sophisticated CVT model,
ensuring reliable detection and assessment of deepfake content, contributing to a more
trustworthy digital environment.The block diagram of the working mechanism is shown
in the fig 3.1

8
Figure 3.1: Flowchart of Working mechanism

3.2 Detail Description

3.2.1 Detection

To detect Deepfake videos,the model consists of two components: the preprocessing


component and the detection component.

3.2.2 Preprocessing Component

DFDC was taken as the dataset for the training.The preprocessing component has two
subcomponents: the face extraction, and the data augmentation component. The face
extraction component is responsible for extracting face images from a video in a 224
x 224 RGB format whereas data augmentation involves applying a variety of transfor-
mations to the existing training data to create additional samples with slightly modified
versions of the original data.

9
3.2.3 Detection Component

Detection component includes training component, validation component, and testing


component.At first we have to train our model and then validate it for better accu-
racy.The testing component is where we classify and determine the class of the faces
extracted in a specific video.

3.2.4 Convolutional Vision Transformer

The Convolutional Vision Transformer (CVT) is a hybrid model that combines the
strengths of Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs)
for image classification tasks.The proposed CViT model consists of two components:Feature
Learning (FL) and the ViT.The FL extracts learnable features from the face images and
then ViT takes in the FL as input and turns them into a sequence of image pixels for the
final detection process[1].The Convolutional Vision Transformer is shown in fig 3.2.

Figure 3.2: Convolutional Vision Transformer

10
3.2.5 Feature Learning Component

FL component uses VGG structure without a fully-connected layer because the purpose
is not for classification but to extract face image features for the ViT component. The
FL component has 17 convolutional layers, with a kernel of 3 x 3. The convolutional
layers extract the low level feature of the face images. All convolutional layers have a
stride and padding of 1. Batch normalization to noramalize the output features and the
ReLU activation function for non-linearity are applied in all of the layers. The Batch
normalization function normalizes change in the distribution of the previous layers [4],
as the change in between the layers will affect the learning process of the CNN archi-
tecture. A five max-pooling of a 2x2 pixel window with stride equal to 2 is also used.
The max-pooling operation reduces dimension of image size by half. After each max-
pooling operation, the width of the convolutional layer(channel) is doubled by a factor
of 2, with the first layer having 32 channels and the last layer 512. The final output of
the FL is a 512 x 7 x 7 input images, which are fed to the ViT architecture[5].

3.2.6 Vision Transformer Component

The input to the ViT component is a feature map of the face images.This component
involves following processes:

Split images into patches

The process of dividing an image into smaller rectangular or square regions called
patches. Each patch represents a localized region of the image and typically has fixed
dimensions.By dividing an image into patches, it becomes possible to analyze and pro-
cess specific regions of interest within the image. This approach is particularly useful
when dealing with large images or when focusing on local features or structures. To ap-
ply the self-attention mechanism in vision transformers, the image is first divided into
smaller patches or tiles. Each patch is then treated as an individual token, similar to
words in a sentence[5].

11
Linear Projection

After building the image patches, a linear projection layer is used to map the image
patch “arrays” to patch embedding “vectors”.By mapping the patches to embeddings,
we now have the correct dimensionality for input into the transformer[5].

Learnable Embeddings

The encoding stage in a vision transformer plays a crucial role in capturing relevant fea-
tures and patterns from the image patches, enabling subsequent processing and decision-
making.It uses a “blank slate” token as the sole input to a classification head pushes the
transformer to learn to encode a “general representation” of the entire sentence into that
embedding. The model must do this to enable accurate classifier predictions[5].

Positional Embeddings

The positional embeddings are learned vectors with the same dimensionality as our
patch embeddings.After creating the patch embeddings and prepending the “class” em-
bedding, we sum them all with positional embeddings.These positional embeddings are
learned during pretraining and (sometimes) during fine-tuning. During training, these
embeddings converge into vector spaces where they show high similarity to their neigh-
boring position embeddings — particularly those sharing the same column and row[5].

12
Figure 3.3: ViT process with the learnable class embedding

3.2.7 MTCNN

The Multi-task Cascaded Convolutional Neural Networks(MTCNN) algorithm used to


detect face and face landmarks,works in three steps and uses one neural network for
each process. The initial part is a proposal network which will predict potential face po-
sitions and their bounding boxes just like an attention network in Faster R-CNN. The re-
sult of this process is a large number of face detections and lots of false detections. The
second part uses images and outputs of the first prediction, thus making a refinement
of the result to eliminate most of the false detections and aggregate bounding boxes.
The last part refines even much more the predictions and adds facial landmarks predic-
tions in the original MTCNN implementation. Experimental results had always been
demonstrated that while keeping the reliability of real-time performance, this method
consistently outperforms the sophisticated conventional methods across most of chal-
lenging benchmarks. Face detection and face alignment are analysed with respect to
Face Detection Data Set and Benchmark (FDDB) and WIDER FACE benchmarks, and
Annotated Facial Landmarks in the Wild (AFLW) benchmark respectively . This better
performance for the real time , is of great importance in a surveillance system [6].

13
3.3 Use case diagram

The use case for the project is given in fig 3.4.

Figure 3.4: Use case diagram

3.4 Software model

Incremental Model is a process of software development where requirements divided


into multiple standalone modules of the software development cycle. In this model,
each module goes through the requirements, design, implementation and testing phases.
Every subsequent release of the module adds function to the previous release. The pro-
cess continues until the complete system achieved.Incremental process model is also

14
know as Successive version model. The first increment is core product for basic re-
quirements, and in next increment, supplementary features are added.Once the core
product is analyzed by the client, there is plan development for the next increment.
Many successive iterations/ versions are implemented and delivered to the customer
until the desired system is released .The figure of the model is shown in fig 3. is shown
in fig 3.5.

Figure 3.5: Incremental Model

15
CHAPTER 4
EXPECTED OUTPUT

4.1 Expected Output

The system allows users to upload a video or provide a video link in order to determine
whether the video contains deepfake content or not. The model then analyzes the video
to determine if there is any inconsistency in the video and other signs of tampering.By
processing the input, the system generates an accurate result indicating the presence or
absence of deepfake manipulation in the provided video. By using this system, users can
confidently verify the authenticity of videos and prevent the spread of misinformation.

16
REFERENCES

[1] D. Wodajo and S. Atnafu, “Deepfake video detection using convolutional vision
transformer,” arXiv preprint arXiv:2102.11126, 2021.

[2] D. Afchar, V. Nozick, J. Yamagishi, and I. Echizen, “Mesonet: a compact facial


video forgery detection network,” in 2018 IEEE International Workshop on Infor-
mation Forensics and Security (WIFS), 2018, pp. 1–7.

[3] D. M. Montserrat, H. Hao, S. K. Yarlagadda, S. Baireddy, R. Shao, J. Horváth,


E. Bartusiak, J. Yang, D. Güera, F. Zhu, and E. J. Delp, “Deepfakes detection with
automatic face weighting,” in 2020 IEEE/CVF Conference on Computer Vision and
Pattern Recognition Workshops (CVPRW), 2020, pp. 2851–2859.

[4] H. H. Nguyen, J. Yamagishi, and I. Echizen, “Capsule-forensics: Using capsule


networks to detect forged images and videos,” in ICASSP 2019 - 2019 IEEE Inter-
national Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019,
pp. 2307–2311.

[5] Pinecone, “Vision Transformers,” https://www.pinecone.io/learn/


vision-transformers/, Accessed 2023, accessed on June 16, 2023.

[6] E. Jose, M. Greeshma, M. T. Haridas, and M. Supriya, “Face recognition based


surveillance system using facenet and mtcnn on jetson tx2,” in 2019 5th Interna-
tional Conference on Advanced Computing & Communication Systems (ICACCS).
IEEE, 2019, pp. 608–613.

17

You might also like