KECReport
KECReport
KECReport
Submitted by:
Rajan Gaihre [KAN076BCT60]
Santosh Adhikari [KAN076BCT73]
Sushant Shrestha [KAN076BCT92]
Utsab Karki [KAN076BCT95]
Submitted to:
Department of Computer and Electronics Engineering
June, 2023
DEEPFAKE DETECTION USING CVIT
Submitted by:
Rajan Gaihre [KAN076BCT60]
Santosh Adhikari [KAN076BCT73]
Sushant Shrestha [KAN076BCT92]
Utsab Karki [KAN076BCT95]
Submitted to:
Department of Computer and Electronics Engineering
Kantipur Engineering College
Dhapakhel, Lalitpur
June, 2023
ABSTRACT
The recent rise of machine learning and artificial intelligence has caused the rapid de-
velopment of the use of AI models.The use of artificial intelligence has brought forth
numerous advancements and benefits in various domains.However, alongside its pos-
itive impacts, AI has also introduced negative aspects, such as the rise in popularity
of deepfake technology. Through the use deepfake technology one can create synthe-
sis hyper-realistic videos known as Deepfakes. Deep learning techniques can be used
to generate faces, swap faces between two subjects in a video,alter facial expressions,
change gender, and alter facial features. While these innovations have opened up ex-
citing possibilities for creative expression and entertainment, they also raise significant
concerns regarding the potential misuse and ethical implications. We propose to use the
convolution vision transformer for the deepfake detection. The CNN extracts the learn-
able features while the Vit takes the features obtained from the CNN and categorizes
them.
i
TABLE OF CONTENTS
ABSTRACT i
1 INTRODUCTION 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Application Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.5 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.6 Development Requirement . . . . . . . . . . . . . . . . . . . . . . . . 3
1.6.1 Hardware Requirement . . . . . . . . . . . . . . . . . . . . . 3
1.6.2 Software Requirement . . . . . . . . . . . . . . . . . . . . . . 3
1.7 Deployment Requirement . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.7.1 Hardware Requirement . . . . . . . . . . . . . . . . . . . . . . 3
1.7.2 Software Requirement . . . . . . . . . . . . . . . . . . . . . . 4
1.8 Feasibility Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.8.1 Economic Feasibility . . . . . . . . . . . . . . . . . . . . . . . 4
1.8.2 Technical Feasibility . . . . . . . . . . . . . . . . . . . . . . . 4
1.8.3 Operational Feasibility . . . . . . . . . . . . . . . . . . . . . . 4
1.8.4 Schedule Feasibility . . . . . . . . . . . . . . . . . . . . . . . 4
1.9 Work Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 LITERATURE REVIEW 6
2.1 Similar Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 MesoNet : A Compact Facial Video Forgery Detection Network 6
2.1.2 Deepfakes Detection with Automatic Face Weighting . . . . . . 6
2.1.3 Capsule-Forensics: Using Capsule Networks to Detect Forged
Images and Videos . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Related Papers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.2 Deepfake . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 METHODOLOGY 8
3.1 Working Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
ii
3.2 Detail Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2.1 Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2.2 Preprocessing Component . . . . . . . . . . . . . . . . . . . . 9
3.2.3 Detection Component . . . . . . . . . . . . . . . . . . . . . . 10
3.2.4 Convolutional Vision Transformer . . . . . . . . . . . . . . . . 10
3.2.5 Feature Learning Component . . . . . . . . . . . . . . . . . . 11
3.2.6 Vision Transformer Component . . . . . . . . . . . . . . . . . 11
3.2.7 MTCNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 Use case diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4 Software model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4 Expected Output 16
4.1 Expected Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
References 16
iii
LIST OF FIGURES
iv
CHAPTER 1
INTRODUCTION
1.1 Background
Technologies for altering images, videos, and audios are developing rapidly. The rise
of Deepfake technology has gained significant attention in recent years due to its ability
to generate highly realistic manipulated media.The different Techniques and technical
expertise to create and manipulate digital content are also easily accessible as they are
abundant reading material in the internet. Currently, it is possible to seamlessly gener-
ate hyper-realistic digital images with a little resource and easy-to-follow instructions
available online.Deepfake is a technique which aims to replace the face of a targeted
person by the face of someone else in a video. It is created by splicing synthesized face
region into the original image. The term can also mean to represent the final output of a
hyper-realistic video created. Deepfakes can be used for the creation of hyper-realistic
Computer Generated Imagery (CGI), Virtual Reality (VR), Augmented Reality (AR),
Education, Animation, Arts, and Cinema. However, since Deepfakes are deceptive in
nature, they can also be used for malicious purposes.By utilizing the capabilities of
CVT and focusing on the inconsistency in pixel-level details, we aim to address the dis-
advantages of deepfake technology and provide a robust defense against its malicious
usage. This research project strives to contribute to the development of advanced deep-
fake detection techniques, enhancing the security and integrity of digital media in an
increasingly vulnerable landscape[1].
1
challenging to distinguish them from genuine videos with traditional detection meth-
ods. Existing solutions often lack generalizability and struggle to keep pace with the
rapid evolution of deepfake techniques. Moreover, the accessibility of deepfake creation
tools and the abundance of training data pose additional challenges for detection algo-
rithms. Another problem is the potential harmful impact of deepfakes on individuals,
organizations, and society as a whole. Deepfakes can undermine trust in digital me-
dia, manipulate public opinion, and harm the reputation of individuals and businesses.
The consequences of undetected deepfakes can be severe, ranging from political unrest
to financial fraud. The ultimate goal is to develop advanced deepfake detection tech-
niques that can accurately identify and distinguish between genuine and manipulated
videos, ensuring the integrity and trustworthiness of digital media. By addressing the
problem of deepfake detection, we aim to mitigate the risks associated with deepfakes,
protect individuals and organizations from potential harm, and safeguard the credibility
of digital content.
1.3 Objectives
1.5 Features
2
lated videos, detecting the presence of deepfake techniques such as face swap-
ping.
Development requirements deals with mainly two types of requirements which are hard-
ware and software. The requirement plays a vital role in the successful completion of
the project which are listed below:
3
1.7.2 Software Requirement
The feasibility analysis of the project is one of the most important things to be consid-
ered for the project development. It can be divided into the following components.
The project is mostly a software-based project and only requires a camera for the hard-
ware component which is present in every laptop and mobile. So, it is economically
feasible.
This system is a web-based system and demands laptops with cameras and internet
availability for its smooth functioning. Apart from that, this system is user-friendly and
easy to operate without requiring any advanced hardware components.
All the process in the program will work and the operation can be carried out easily by
anyone with little application knowledge. As per the operational feasibility, the project
meets its requirement criteria for the system development.
Our system is based on web-based platform that allows the user to login and register
and uses the system accordingly. Development using CNN, slow hashing, designing
4
frontend with HTML,CSS, etc. So, it is feasible.
The work schedule is shown with the help of the gantt chart in fig 1.1.
5
CHAPTER 2
LITERATURE REVIEW
6
2.1.3 Capsule-Forensics: Using Capsule Networks to Detect Forged Im-
Machine learning is widely used in software to enable an improved experience with the
user. Using Machine learning, robots can acquire skills or learn to adapt to the environ-
ment in which they are working. Robots can acquire skills such as object placement,
grasping objects, and locomotion skills through either automated learning or learning
via human intervention. The race is on for ML to be used in health care analytics. A
number of start-ups are looking at the advantages of using ML with big data to provide
health care professionals with better-informed data to enable them to make better de-
cisions.Machine learning algorithms fall into one of two learning types: supervised or
unsupervised learning.
2.2.2 Deepfake
Deep fakes a term that first emerged in 2017 to describe realistic photo, audio, video,
and other forgeries generated with artificial intelligence AI technologies could present a
variety of national security challenges in the years to come. As these technologies con-
tinue to mature, they could hold significant implications for congressional oversight,
U.S. defense authorizations and appropriations, and the regulation of social media plat-
forms.
7
CHAPTER 3
METHODOLOGY
In the web application, users can either upload a video or provide a video link. The
application then performs face extraction using the MTCNN algorithm and proceeds
with data augmentation. Face images from the video are extracted and converted to a
standardized size of 224 x 224 in RGB format.Data augmentation plays a vital role in
enhancing the training data. It involves applying various transformations to the existing
training samples, generating additional samples with slightly modified versions of the
original data.For image classification tasks, we utilize the Convolutional Vision Trans-
former (CVT), a hybrid model that combines the strengths of Convolutional Neural
Networks (CNNs) and Vision Transformers (ViTs). The CNN component is responsible
for feature learning, extracting meaningful patterns from the face images. Meanwhile,
the ViT component receives a feature map of the face images. It splits the images into
patches, applies linear projection, and employs learnable embeddings to analyze and
determine the presence of deepfake manipulation.By leveraging the synergies between
CNNs and ViTs, the CVT model offers a powerful approach to address image classifica-
tion challenges, particularly in detecting deepfake content. This integrated framework
enables the system to effectively learn and recognize both local and global features
within the face images, providing accurate assessments of potential deepfake activ-
ity.Through the web application, users can benefit from the sophisticated CVT model,
ensuring reliable detection and assessment of deepfake content, contributing to a more
trustworthy digital environment.The block diagram of the working mechanism is shown
in the fig 3.1
8
Figure 3.1: Flowchart of Working mechanism
3.2.1 Detection
DFDC was taken as the dataset for the training.The preprocessing component has two
subcomponents: the face extraction, and the data augmentation component. The face
extraction component is responsible for extracting face images from a video in a 224
x 224 RGB format whereas data augmentation involves applying a variety of transfor-
mations to the existing training data to create additional samples with slightly modified
versions of the original data.
9
3.2.3 Detection Component
The Convolutional Vision Transformer (CVT) is a hybrid model that combines the
strengths of Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs)
for image classification tasks.The proposed CViT model consists of two components:Feature
Learning (FL) and the ViT.The FL extracts learnable features from the face images and
then ViT takes in the FL as input and turns them into a sequence of image pixels for the
final detection process[1].The Convolutional Vision Transformer is shown in fig 3.2.
10
3.2.5 Feature Learning Component
FL component uses VGG structure without a fully-connected layer because the purpose
is not for classification but to extract face image features for the ViT component. The
FL component has 17 convolutional layers, with a kernel of 3 x 3. The convolutional
layers extract the low level feature of the face images. All convolutional layers have a
stride and padding of 1. Batch normalization to noramalize the output features and the
ReLU activation function for non-linearity are applied in all of the layers. The Batch
normalization function normalizes change in the distribution of the previous layers [4],
as the change in between the layers will affect the learning process of the CNN archi-
tecture. A five max-pooling of a 2x2 pixel window with stride equal to 2 is also used.
The max-pooling operation reduces dimension of image size by half. After each max-
pooling operation, the width of the convolutional layer(channel) is doubled by a factor
of 2, with the first layer having 32 channels and the last layer 512. The final output of
the FL is a 512 x 7 x 7 input images, which are fed to the ViT architecture[5].
The input to the ViT component is a feature map of the face images.This component
involves following processes:
The process of dividing an image into smaller rectangular or square regions called
patches. Each patch represents a localized region of the image and typically has fixed
dimensions.By dividing an image into patches, it becomes possible to analyze and pro-
cess specific regions of interest within the image. This approach is particularly useful
when dealing with large images or when focusing on local features or structures. To ap-
ply the self-attention mechanism in vision transformers, the image is first divided into
smaller patches or tiles. Each patch is then treated as an individual token, similar to
words in a sentence[5].
11
Linear Projection
After building the image patches, a linear projection layer is used to map the image
patch “arrays” to patch embedding “vectors”.By mapping the patches to embeddings,
we now have the correct dimensionality for input into the transformer[5].
Learnable Embeddings
The encoding stage in a vision transformer plays a crucial role in capturing relevant fea-
tures and patterns from the image patches, enabling subsequent processing and decision-
making.It uses a “blank slate” token as the sole input to a classification head pushes the
transformer to learn to encode a “general representation” of the entire sentence into that
embedding. The model must do this to enable accurate classifier predictions[5].
Positional Embeddings
The positional embeddings are learned vectors with the same dimensionality as our
patch embeddings.After creating the patch embeddings and prepending the “class” em-
bedding, we sum them all with positional embeddings.These positional embeddings are
learned during pretraining and (sometimes) during fine-tuning. During training, these
embeddings converge into vector spaces where they show high similarity to their neigh-
boring position embeddings — particularly those sharing the same column and row[5].
12
Figure 3.3: ViT process with the learnable class embedding
3.2.7 MTCNN
13
3.3 Use case diagram
14
know as Successive version model. The first increment is core product for basic re-
quirements, and in next increment, supplementary features are added.Once the core
product is analyzed by the client, there is plan development for the next increment.
Many successive iterations/ versions are implemented and delivered to the customer
until the desired system is released .The figure of the model is shown in fig 3. is shown
in fig 3.5.
15
CHAPTER 4
EXPECTED OUTPUT
The system allows users to upload a video or provide a video link in order to determine
whether the video contains deepfake content or not. The model then analyzes the video
to determine if there is any inconsistency in the video and other signs of tampering.By
processing the input, the system generates an accurate result indicating the presence or
absence of deepfake manipulation in the provided video. By using this system, users can
confidently verify the authenticity of videos and prevent the spread of misinformation.
16
REFERENCES
[1] D. Wodajo and S. Atnafu, “Deepfake video detection using convolutional vision
transformer,” arXiv preprint arXiv:2102.11126, 2021.
17