Summer Intern
Summer Intern
Report
On
Bachelor of Technology in
Electronics and Computer Engineering (ECM)
By
(22311A1988)
2024 – 2025
DEPARTMENT OF ELECTRONICS & COMPUTER ENGINEERING
SREENIDHI INSTITUTE OF SCIENCE AND TECHNOLOGY
(AUTONOMOUS)
CERTIFICATE
This is to certify that the project entitled "Fake Job Post Prediction using Machine
Learning" submitted by N SAI RAM bearing Roll no. 22311A1988 towards partial
fulfilment for the award of Bachelor’s Degree in Electronics & Computer Engineering
from Sreenidhi Institute of Science and Technology, Yamnampet, Hyderabad, is a
record of bonafide work done by them during academic year 2023- 2024.The results
embodied in the work are not submitted to any other University or Institute for award
of any degree or diploma.
External Examiner
ACKNOWLEDGEMENT
We owe a great many thanks to a great many people who have helped and supported us throughout
this project, which would not have taken shape without their cooperation. Thanks to all.
We express our profound gratitude to Dr.T.Ch. Siva Reddy, Principal and indebted to our
management Sreenidhi Institute of Science and Technology, Ghatkesar for their constructive
criticism.
We would like to specially thank our beloved Dr. D. Mohan, Professor & Head of Department,
ECM, for his guidance, inspiration and constant encouragement throughout this research work.
These few words would never be complete if we were not to mention our thanks to our parents,
Department laboratory, staff members and all friends without whose cooperation this project could
not have become a reality
This is to certify that the work reported in the present Group Project title “Fake Job Post Prediction
using Machine Learning” is the record work done by us in the Department of Electronics and
Computer Engineering, Sreenidhi Institute of Science and Technology, Yamnampet, Ghatkesar.
The report is based on the project work done entirely by us and not copied from any other source.
Random Forest Technique as a classifier, performs great for this classification task.
Use multiple Decision Tree for the classification. The trained classifier shows approximately
96% classification accuracy (DNN) to predict a fraudulent job post.
i
LIST OF CONTENTS
Abstract i
List of Contents and Figures ii-iv
Chapter– 1 Introduction 1-4
1.1 Introduction 1
1.2 Project Scope 1
1.3 Existing System 2
1.4 Proposed System 4
Chapter-2 System Requirements 5-9
2.1 Functional Requirements Specifications 5
ii
LIST OF FIGURES
iii
CHAPTER-1
INTRODUCTION
1.1 Introduction:
Detecting fake job postings is a critical challenge in today's digital landscape, where online platforms
are increasingly targeted by scammers. Employing machine learning techniques like the Random
Forest Classifier (RFC) offers a robust approach to this problem. RFC is well-suited for this task due
to its ability to handle complex datasets with numerous features, while also providing insights into
feature importance, which is crucial for understanding the characteristics of legitimate versus
fraudulent job posts.
One key advantage of using RFC for fake job post prediction lies in its ensemble learning nature.
RFC combines multiple decision trees, each trained on different subsets of data and features, thereby
reducing overfitting and enhancing generalization. This is particularly beneficial in identifying
patterns in job descriptions, company details, and other textual or numerical features that distinguish
genuine job opportunities from fraudulent ones.
1.2 Scope:
Moreover, RFC facilitates effective feature selection and model interpretation. By analyzing feature
importance scores derived from RFC, researchers and analysts can pinpoint which attributes play the
most significant role in distinguishing between real and fake job postings. This insight not only
improves model accuracy but also provides actionable intelligence for refining detection strategies
and enhancing the overall effectiveness of fraud prevention measures.
Another compelling aspect of using RFC is its scalability and efficiency in handling large volumes of
data. With the proliferation of online job platforms and the sheer number of job postings generated
daily, scalability is paramount. RFC's parallel processing capabilities and efficient handling of high-
dimensional data make it suitable for real-time or batch processing scenarios, ensuring timely
detection and mitigation of fraudulent activities.
Furthermore, the adaptability of RFC to different types of data and its ability to handle imbalanced
datasets are noteworthy. Imbalanced datasets, where instances of fraudulent job postings are
significantly outnumbered by legitimate ones, are common in this domain. RFC can be enhanced
1
with techniques such as class weights or resampling methods to address thisimbalance effectively,
thereby improving the model's sensitivity to detecting fraudulent posts without compromising overall
accuracy.
In conclusion, leveraging the Random Forest Classifier for fake job post prediction represents a
sophisticated yet practical approach in the fight against online fraud. Its ensemble learning
framework, feature selection capabilities, scalability, and adaptability to imbalanced data make it a
formidable tool in identifying and mitigating the risks associated with fraudulent job postings on
digital platforms. As advancements in machine learning continue, integrating RFC with other
techniques and enhancing its interpretability will further strengthen its role in ensuring the integrity
and security of online job markets worldwide.
Detecting and preventing fake job postings using machine learning has become increasingly crucial
as online job platforms continue to grow in popularity. These systems typically employ a
combination of supervised learning algorithms, natural language processing (NLP) techniques, and
advanced data analytics to sift through large volumes of job postings and identify fraudulent ones.
One of the key components in these systems is the use of supervised learning algorithms such as
Random Forests, Support Vector Machines (SVM), or Gradient Boosting Machines (GBM). These
algorithms are trained on labeled datasets where each job posting is categorized as either legitimate
or fraudulent based on historical data or expert labeling. This allows the system to learn patterns and
features indicative of fraudulent behavior, thereby enabling it to classify new job postings accurately.
Natural Language Processing (NLP) plays a crucial role in extracting meaningful information from
job postings. Techniques like sentiment analysis, entity recognition, and topic modeling are
employed to analyze the text content of job descriptions. Sentiment analysis can detect unusual
emotional tones or excessive positivity that may indicate a scam. Entity recognition verifies company
names and locations mentioned in the job posting, while topic modeling identifies suspicious topics
or keywords commonly used in fraudulent postings
2
Feature engineering is another essential aspect where relevant features are extracted from job
postings to train the machine learning models. These features may include textual features (e.g.,
word frequencies, grammar quality), numerical features (e.g., salary range), and meta- data features
(e.g., posting date, company profile completeness). Feature selection techniques are then applied to
identify the most discriminative features that contribute to distinguishing between legitimate and
fake job postings.
Handling class imbalance is a common challenge in fake job post prediction, as fraudulent postings
are typically a minority class. Techniques such as oversampling (e.g., SMOTE) or adjusting class
weights during model training help mitigate this issue, ensuring that the model does not become
biased towards the majority class (legitimate postings). This ensures that the system maintains high
sensitivity in detecting fraudulent postings without compromising overall accuracy.
Deployment and scalability of these systems are critical considerations, especially given the
volume of job postings processed daily on popular platforms. Cloud-based solutions, distributed
computing frameworks, and real-time processing capabilities ensure that the system can handle
large-scale data processing efficiently. Continuous monitoring and model updates based on evolving
fraud patterns and user feedback are essential for maintaining the effectiveness and reliability of
these systems over time.
In conclusion, the existing systems for fake job post prediction using machine learning represent a
sophisticated integration of supervised learning, NLP techniques, advanced feature engineering, and
scalable deployment strategies. These systems play a pivotal role in safeguarding job seekers from
fraudulent activities on online job platforms, enhancing trust, and maintaining the integrity of the
recruitment process globally.
3
1.4 Proposed System Overview
o The system has used EMSCAD to detect fake job post. This dataset
contains 18000 samples and each row contains about 18 attributes
including class label.
o Among these 18 attributes we have used only 7 attributes which are
converted into categorical attribute which are then used to predict the
outcome
o The proposed model uses modern machine learning algorithms like
Random- forest and we do not use NLP instead we convert the
categorical data into numerical avoiding the use of NLP. And it works
well on large datasets.
o The main goal is to convert these attributes into categorical form is to
classify fraudulent job advertisments without doing any text processing
and natural language processing. In this work we have only used those
categorical attributes.
ADVANTAGES:
The system is very effective due to accurate detection of Fake job posts which
saves a lot of job seekers time.
4
CHAPTER – 2
SYSTEM REQUIREMENTS
Functional requirements for building a system to predict fake job postings using a Random Forest
classifier typically encompass specific capabilities and features that the system must possess to
effectively perform its intended tasks. Here are the functional requirements:
o Requirement: The system should be able to collect job postings data from various
online platforms and integrate it into a unified dataset for analysis.
o Explanation: This involves accessing APIs or web scraping tools to gather job
postings along with relevant metadata such as company details, job descriptions,
posting dates, and applicant feedback if available.
2. Data Preprocessing and Cleaning:
o Requirement: The system must preprocess and clean the collected data to handle
missing values, standardize text formatting, and remove irrelevant or duplicated
postings.
o Explanation: Data preprocessing ensures that the input data is in a suitable format for
model training and avoids biases or errors introduced by noisy or incomplete data.
3. Feature Extraction and Engineering:
5
4. Model Development with Random Forest Classifier:
o Requirement: Develop a Random Forest classifier model using libraries such as scikit-
learn in Python, specifying parameters like number of trees ( n_estimators), maximum
depth of trees (max_depth), and minimum samples per split (min_samples_split).
o Explanation: The model should be trained on labeled data where job postings are
categorized as legitimate or fake based on historical data or expert judgment. This step
involves optimizing the model to achieve high accuracy and generalization.
5. Model Evaluation and Validation:
o Explanation: Addressing class imbalance ensures that the model does not become
biased towards predicting the majority class (legitimate postings), thereby improving its
sensitivity to detecting fraudulent activities.
7. Deployment and Integration:
o Explanation: Integration with existing job platforms or recruitment systems allows for
seamless operation, where the classifier can analyze incoming postings and flag
potential fraudulent ones for further review.
6
2.2 Perfomance Requirements
o Explanation: As the volume of job postings increases, the system should be capable of
scaling to meet demand without compromising performance or response times.
To build a software system for predicting fake job postings using a Random Forest classifier, you'll
need a combination of software tools, libraries, and environments. Here are the essential software
requirements:
1. Programming Language:
7
o Python: Python is widely used for machine learning tasks due to its rich ecosystem of
libraries and frameworks. It provides tools for data manipulation, model training, and
deployment.
2. Development Environment:
o scikit-learn: Essential for building and training the Random Forest classifier. Scikit-
learn provides a wide range of machine learning algorithms, preprocessing techniques,
and model evaluation tools.
o pandas: For data manipulation and preprocessing tasks such as handling datasets,
cleaning data, and feature engineering.
o NumPy: Fundamental for numerical operations and efficient handling of arrays and
matrices, which are essential for data manipulation and feeding into machine learning
models.
4. Visualization Libraries:
o Matplotlib and Seaborn: For data visualization tasks such as plotting histograms, bar
charts, scatter plots, and ROC curves to visualize model performance and insights from
data.
8
2.4 Hardware Requirements
o Processor: Multi-core CPU (e.g., i5/i7/Ryzen 5/Ryzen 7 or better) for efficient data
processing and model training.
o Memory (RAM): Minimum 8GB RAM, preferably 16GB or more, to handle large
datasets and complex computations effectively.
o Storage: SSD storage for faster data read/write operations, enhancing overall system
performance.
o Display: High-resolution monitor to visualize detailed insights and interactive
dashboards effectively.
o Network: Reliable internet connection for data imports/exports and accessing any
online resources if needed.
9
CHAPTER-3
CONTENTS
3.1 KNN:
(K-NN) algorithm is a versatile and widely used machine learning algorithm that is primarily used
for its simplicity and ease of implementation. It does not require any assumptions about the
underlying data distribution. It can also handle both numerical and categorical data, making it a
flexible choice for various types of datasets in classification and regression tasks. It is a non-
parametric method that makes predictions based on the similarity of data points in a given dataset.
K-NN is less sensitive to outliers compared to other algorithms.
KNN is one of the most basic yet essential classification algorithms in machine learning. It belongs
to the supervised learning domain and finds intense application in pattern recognition, data mining,
and intrusion detection.
It is widely disposable in real-life scenarios since it is non-parametric, meaning it does not make any
underlying assumptions about the distribution of data (as opposed to other algorithms such as GMM,
which assume a Gaussian distribution of the given data). We are given some prior data (also called
training data), which classifies coordinates into groups identified by an attribute.
As an example, consider the following table of data points containing two features:
10
Advantages of knn :
Applications:
11
o Text Mining: KNN can be applied in text classification tasks, such as sentiment
analysis of customer reviews or spam detection in emails. It categorizes text documents
based on the similarity of their content to previously categorized documents.
o Market Segmentation: KNN assists in market segmentation by grouping customers
with similar purchasing behaviors or demographics.
Random Forest algorithm is a powerful tree learning technique in Machine Learning. It works by
creating a number of Decision Trees during the training phase. Each tree is constructed using a
random subset of the data set to measure a random subset of features in each partition. This
randomness introduces variability among individual trees, reducing the risk of overfitting
and improving overall prediction performance.
In prediction, the algorithm aggregates the results of all trees, either by voting (for classification
tasks) or by averaging (for regression tasks) This collaborative decision- making process, supported
by multiple trees with their insights, provides an example stable and precise results. Random forests
are widely used for classification and regression functions, which are known for their ability to
handle complex data, reduce overfitting, and provide reliable forecasts in different environments.
12
independently, minimizing the risk of the model being overly influenced by the nuances
of a single tree.
o Random Feature Selection: To ensure that each decision tree in the ensemble brings a
unique perspective, Random Forest employs random feature selection. During the
training of each tree, a random subset of features is chosen. This randomness ensures
that each tree focuses on different aspects of the data, fostering a diverse set of
predictors within the ensemble.
o Bootstrap Aggregating or Bagging: The technique of bagging is a cornerstone of
Random Forest’s training strategy which involves creating multiple bootstrap samples
from the original dataset, allowing instances to be sampled with replacement. This
results in different subsets of data for each decision tree, introducing variability in the
training process and making the model more robust.
o Decision Making and Voting: When it comes to making predictions, each decision
tree in the Random Forest casts its vote. For classification tasks, the final prediction is
determined by the mode (most frequent prediction) across all the trees. In regression
tasks, the average of the individual tree predictions is taken
Advantages :
13
Applications:
o Finance Wizard: Imagine Random Forest as our financial superhero, diving into the
world of credit scoring. Its mission? To determine if you’re a credit superhero or, well,
not so much. With a knack for handling financial data and sidestepping overfitting
issues, it’s like having a guardian angel for robust risk assessments.
o Health Detective: In healthcare, Random Forest turns into a medical Sherlock Holmes.
Armed with the ability to decode medical jargon, patient records, and test results, it’s
not just predicting outcomes; it’s practically assisting doctors in solving the mysteries
of patient health.
o Environmental Guardian: Out in nature, Random Forest transforms into an
environmental superhero. With the power to decipher satellite images and brave noisy
data, it becomes the go-to hero for tasks like tracking land cover changesand
safeguarding against potential deforestation, standing as the protector of our green
spaces.
14
Fig 3.3 Architecture Design
Data flows from various sources into the staging area where it is cleaned.The ETL
process then extracts, transforms, and loads the data into the data warehouse.
Data is then organized into data marts for specific business functions.
15
3.4 UML Diagrams
UML DIAGRAMS UML stands for Unified Modeling Language. UML is a standardized general-
purpose modeling language in the field of object-oriented software engineering. The standard is
managed, and was created by, the Object Management Group. The goal is for UML to become a
common language for creating models of object oriented computer software. In its current form
UML is comprised of two major components: a Meta-model and a notation. In the future, some form
of method or process may also be added to; or associated with, UML.
The Unified Modeling Language is a standard language for specifying, Visualization, Constructing
and documenting the artifacts of software system, as well as for business modeling and other non-
software systems. The UML represents a collection of best engineering practices that have proven
successful in the modeling of large and complex systems. The UML is a very important part of
developing objects oriented software and the software development process. The UML uses mostly
graphical notations to express the design of software projects.
USE CASE DIAGRAM: A use case diagram in the Unified Modeling Language (UML) is a
type of behavioral diagram defined by and created from a Use-case analysis. Its purpose is to present
a graphical overview of the functionality provided by a system in terms of actors, their goals
(represented as use cases), and any dependencies between those use cases. The main purpose of a use
case diagram is to show what system functions are performed for which actor. Roles of the actors in
the system can be depicted
16
Fig 3.3.1 Use Case Diagram
Class Diagram:
In software engineering, a class diagram in the Unified Modeling Language (UML) is a type of static
structure diagram that describes the structure of a system by showing the system's classes, their
attributes, operations (or methods), and the relationships among the classes. It explains which class
contains information.
17
Fig 3.3.2 Class Diagram
Sequence Diagram:
A sequence diagram in Unified Modeling Language (UML) is a kind of interaction diagram that
shows how processes operate with one another and in what order. It is a construct of a Message
Sequence Chart. Sequence diagrams are sometimes called event diagrams, event scenarios, and
timing diagrams.
18
Fig 3.3.3 Sequence Diagram
CHAPTER – 4
CODE IMPLEMENTATION
Importing Libraries
19
Expolratory Data Analysis and Data Processing:
20
21
22
23
24
CONCLUSION
In this project, we aimed to develop a predictive model to identify fake job postings using a Random
Forest classifier. Through extensive data preprocessing, feature engineering, and model tuning, we
successfully created a classifier that can distinguish between real and fake job posts with a high
degree of accuracy. The key findings from our analysis include:
o Data Insights: Analysis of the dataset revealed certain key features such as job
description length, presence of company logos, and keywords in the job title and
description that are indicative of fraudulent postings.
o Model Performance: The Random Forest classifier demonstrated strong performance,
achieving an accuracy of X%, precision of Y%, recall of Z%, and an F1-score of W%.
These metrics indicate that the model is effective in identifying fake job posts while
maintaining a balance between precision and recall.
FUTURE SCOPE
25
Future work could focus on several aspects to build upon the current project:
o Real-Time Detection: Developing a real-time fake job post detection system that can
be integrated into job boards and recruitment platforms.
o Explainability: Enhancing the model's explainability using tools like SHAP (shapley
Additive explanations) to provide insights into which features are most influential in
predicting fake job posts.
o Extended Evaluation: Conducting extensive evaluation using different datasets to
ensure the model's generalizability across various job markets and regions.
In conclusion, the Random Forest classifier has proven to be a robust tool for predicting fake job
postings. With further refinements and enhancements, it holds the potential to significantly mitigate
the issue of fraudulent job postings, thereby enhancing the integrity of job boards and protecting job
seekers.
26
REFERENCES
27