0% found this document useful (0 votes)
44 views34 pages

Summer Intern

Uploaded by

prathibhakampati
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views34 pages

Summer Intern

Uploaded by

prathibhakampati
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 34

A Summer Industry Internship - I

Report
On

“Fake Job Post Prediction using Machine Learning”

Submitted in Partial Fulfilment of the Requirements

For the award of the degree of

Bachelor of Technology in
Electronics and Computer Engineering (ECM)

By

SAI RAM NARABOINA

(22311A1988)

Under the Supervision/Guidance of


Mrs.K.Naga Sailaja
Assistant professor
Dept of ECM

Department of Electronics & Computer Engineering

Sreenidhi Institute of Science &Technology (Autonomous)

2024 – 2025
DEPARTMENT OF ELECTRONICS & COMPUTER ENGINEERING
SREENIDHI INSTITUTE OF SCIENCE AND TECHNOLOGY
(AUTONOMOUS)

CERTIFICATE

This is to certify that the project entitled "Fake Job Post Prediction using Machine
Learning" submitted by N SAI RAM bearing Roll no. 22311A1988 towards partial
fulfilment for the award of Bachelor’s Degree in Electronics & Computer Engineering
from Sreenidhi Institute of Science and Technology, Yamnampet, Hyderabad, is a
record of bonafide work done by them during academic year 2023- 2024.The results
embodied in the work are not submitted to any other University or Institute for award
of any degree or diploma.

Coordinator HOD, ECM

Mrs.K.Naga Sailaja Dr.D.Mohan

Assistant professor Head of department

External Examiner
ACKNOWLEDGEMENT

We owe a great many thanks to a great many people who have helped and supported us throughout
this project, which would not have taken shape without their cooperation. Thanks to all.

We express our profound gratitude to Dr.T.Ch. Siva Reddy, Principal and indebted to our
management Sreenidhi Institute of Science and Technology, Ghatkesar for their constructive
criticism.

We would like to specially thank our beloved Dr. D. Mohan, Professor & Head of Department,
ECM, for his guidance, inspiration and constant encouragement throughout this research work.

We would like to express our deep gratitude to Mrs.K.NagaSailaja, Asssistant professor,


(Coordinator), for her timely guidance, moral support and personal supervision throughout the
project.

These few words would never be complete if we were not to mention our thanks to our parents,
Department laboratory, staff members and all friends without whose cooperation this project could
not have become a reality

N SAI RAM (22311A1988)


DECLARATION

This is to certify that the work reported in the present Group Project title “Fake Job Post Prediction
using Machine Learning” is the record work done by us in the Department of Electronics and
Computer Engineering, Sreenidhi Institute of Science and Technology, Yamnampet, Ghatkesar.
The report is based on the project work done entirely by us and not copied from any other source.

N SAI RAM (22311A1988)


ABSTRACT

In recent years, due to advancement in modern technology and social communication,


advertising new job posts has become very common issue in the present world. So, fake job
posting prediction task is going to be a great concern for all. Like many other classification
tasks, fake job posing prediction leaves a lot of challenges to face. This paper proposed to use
different machine learning techniques and classification algorithm like KNN, random forest
classifier, multilayer perceptron and deep neural network to predict a job post if it is real or
fraudulent. It was Experimented on Employment Scam Aegean Dataset (EMSCAD)
containing 18000 samples.

Random Forest Technique as a classifier, performs great for this classification task.
Use multiple Decision Tree for the classification. The trained classifier shows approximately
96% classification accuracy (DNN) to predict a fraudulent job post.

i
LIST OF CONTENTS

Abstract i
List of Contents and Figures ii-iv
Chapter– 1 Introduction 1-4
1.1 Introduction 1
1.2 Project Scope 1
1.3 Existing System 2
1.4 Proposed System 4
Chapter-2 System Requirements 5-9
2.1 Functional Requirements Specifications 5

2.2 Performance Requirements 7

2.3 Software Requirements 7

2.4 Hardware Requirements 9

Chapter-3 Contents 10-15


3.1 KNN 10
3.2 Random Forest 11
3.3 Architecture Diagrams 15
Chapter -4 Code Implementation 16-23
Conclusion 24
Future Scope 25
References 26

ii
LIST OF FIGURES

FIGURE NO. FIGURE NAME PAGE NO.

Fig: 3.1 KNN- Data points containing two features 10

Fig: 3.2 Random Forest Algorithm in ML 12

Fig: 3.3 Architecture Diagram 15

Fig: 3.3.1 Use Case Diagram 17

Fig: 3.3.2 Class Diagram 17

Fig: 3.3.3 Sequence Diagram 18

iii
CHAPTER-1
INTRODUCTION
1.1 Introduction:

Detecting fake job postings is a critical challenge in today's digital landscape, where online platforms
are increasingly targeted by scammers. Employing machine learning techniques like the Random
Forest Classifier (RFC) offers a robust approach to this problem. RFC is well-suited for this task due
to its ability to handle complex datasets with numerous features, while also providing insights into
feature importance, which is crucial for understanding the characteristics of legitimate versus
fraudulent job posts.

One key advantage of using RFC for fake job post prediction lies in its ensemble learning nature.
RFC combines multiple decision trees, each trained on different subsets of data and features, thereby
reducing overfitting and enhancing generalization. This is particularly beneficial in identifying
patterns in job descriptions, company details, and other textual or numerical features that distinguish
genuine job opportunities from fraudulent ones.

1.2 Scope:

Moreover, RFC facilitates effective feature selection and model interpretation. By analyzing feature
importance scores derived from RFC, researchers and analysts can pinpoint which attributes play the
most significant role in distinguishing between real and fake job postings. This insight not only
improves model accuracy but also provides actionable intelligence for refining detection strategies
and enhancing the overall effectiveness of fraud prevention measures.

Another compelling aspect of using RFC is its scalability and efficiency in handling large volumes of
data. With the proliferation of online job platforms and the sheer number of job postings generated
daily, scalability is paramount. RFC's parallel processing capabilities and efficient handling of high-
dimensional data make it suitable for real-time or batch processing scenarios, ensuring timely
detection and mitigation of fraudulent activities.

Furthermore, the adaptability of RFC to different types of data and its ability to handle imbalanced
datasets are noteworthy. Imbalanced datasets, where instances of fraudulent job postings are
significantly outnumbered by legitimate ones, are common in this domain. RFC can be enhanced

1
with techniques such as class weights or resampling methods to address thisimbalance effectively,
thereby improving the model's sensitivity to detecting fraudulent posts without compromising overall
accuracy.

In conclusion, leveraging the Random Forest Classifier for fake job post prediction represents a
sophisticated yet practical approach in the fight against online fraud. Its ensemble learning
framework, feature selection capabilities, scalability, and adaptability to imbalanced data make it a
formidable tool in identifying and mitigating the risks associated with fraudulent job postings on
digital platforms. As advancements in machine learning continue, integrating RFC with other
techniques and enhancing its interpretability will further strengthen its role in ensuring the integrity
and security of online job markets worldwide.

1.3 Overview of existing system:

Detecting and preventing fake job postings using machine learning has become increasingly crucial
as online job platforms continue to grow in popularity. These systems typically employ a
combination of supervised learning algorithms, natural language processing (NLP) techniques, and
advanced data analytics to sift through large volumes of job postings and identify fraudulent ones.

One of the key components in these systems is the use of supervised learning algorithms such as
Random Forests, Support Vector Machines (SVM), or Gradient Boosting Machines (GBM). These
algorithms are trained on labeled datasets where each job posting is categorized as either legitimate
or fraudulent based on historical data or expert labeling. This allows the system to learn patterns and
features indicative of fraudulent behavior, thereby enabling it to classify new job postings accurately.

Natural Language Processing (NLP) plays a crucial role in extracting meaningful information from
job postings. Techniques like sentiment analysis, entity recognition, and topic modeling are
employed to analyze the text content of job descriptions. Sentiment analysis can detect unusual
emotional tones or excessive positivity that may indicate a scam. Entity recognition verifies company
names and locations mentioned in the job posting, while topic modeling identifies suspicious topics
or keywords commonly used in fraudulent postings

2
Feature engineering is another essential aspect where relevant features are extracted from job
postings to train the machine learning models. These features may include textual features (e.g.,
word frequencies, grammar quality), numerical features (e.g., salary range), and meta- data features
(e.g., posting date, company profile completeness). Feature selection techniques are then applied to
identify the most discriminative features that contribute to distinguishing between legitimate and
fake job postings.

Handling class imbalance is a common challenge in fake job post prediction, as fraudulent postings
are typically a minority class. Techniques such as oversampling (e.g., SMOTE) or adjusting class
weights during model training help mitigate this issue, ensuring that the model does not become
biased towards the majority class (legitimate postings). This ensures that the system maintains high
sensitivity in detecting fraudulent postings without compromising overall accuracy.

Deployment and scalability of these systems are critical considerations, especially given the
volume of job postings processed daily on popular platforms. Cloud-based solutions, distributed
computing frameworks, and real-time processing capabilities ensure that the system can handle
large-scale data processing efficiently. Continuous monitoring and model updates based on evolving
fraud patterns and user feedback are essential for maintaining the effectiveness and reliability of
these systems over time.

In conclusion, the existing systems for fake job post prediction using machine learning represent a
sophisticated integration of supervised learning, NLP techniques, advanced feature engineering, and
scalable deployment strategies. These systems play a pivotal role in safeguarding job seekers from
fraudulent activities on online job platforms, enhancing trust, and maintaining the integrity of the
recruitment process globally.

3
1.4 Proposed System Overview

o The system has used EMSCAD to detect fake job post. This dataset
contains 18000 samples and each row contains about 18 attributes
including class label.
o Among these 18 attributes we have used only 7 attributes which are
converted into categorical attribute which are then used to predict the
outcome
o The proposed model uses modern machine learning algorithms like
Random- forest and we do not use NLP instead we convert the
categorical data into numerical avoiding the use of NLP. And it works
well on large datasets.
o The main goal is to convert these attributes into categorical form is to
classify fraudulent job advertisments without doing any text processing
and natural language processing. In this work we have only used those
categorical attributes.

ADVANTAGES:

The proposed is very fast and accurate.

The system is very effective due to accurate detection of Fake job posts which
saves a lot of job seekers time.

4
CHAPTER – 2

SYSTEM REQUIREMENTS

2.1 Functional Requirements Specifications

Functional requirements for building a system to predict fake job postings using a Random Forest
classifier typically encompass specific capabilities and features that the system must possess to
effectively perform its intended tasks. Here are the functional requirements:

1. Data Collection and Integration:

o Requirement: The system should be able to collect job postings data from various
online platforms and integrate it into a unified dataset for analysis.

o Explanation: This involves accessing APIs or web scraping tools to gather job
postings along with relevant metadata such as company details, job descriptions,
posting dates, and applicant feedback if available.
2. Data Preprocessing and Cleaning:

o Requirement: The system must preprocess and clean the collected data to handle
missing values, standardize text formatting, and remove irrelevant or duplicated
postings.

o Explanation: Data preprocessing ensures that the input data is in a suitable format for
model training and avoids biases or errors introduced by noisy or incomplete data.
3. Feature Extraction and Engineering:

o Requirement: Implement NLP techniques to extract meaningful features from job


postings, such as word frequencies, n-grams, sentiment analysis scores, and entity
recognition.

o Explanation: Effective feature extraction helps in capturing relevant information from


text data that can distinguish between legitimate and fake job postings. This step is
crucial for inputting structured data into the Random Forest classifier.

5
4. Model Development with Random Forest Classifier:

o Requirement: Develop a Random Forest classifier model using libraries such as scikit-
learn in Python, specifying parameters like number of trees ( n_estimators), maximum
depth of trees (max_depth), and minimum samples per split (min_samples_split).

o Explanation: The model should be trained on labeled data where job postings are
categorized as legitimate or fake based on historical data or expert judgment. This step
involves optimizing the model to achieve high accuracy and generalization.
5. Model Evaluation and Validation:

o Requirement: Evaluate the performance of the Random Forest classifier using


appropriate metrics such as accuracy, precision, recall, F1-score, and ROC curve
analysis.

o Explanation: Model evaluation ensures that the classifier effectively distinguishes


between legitimate and fake job postings. Validation against a separate test dataset
helps in assessing its generalization ability and robustness.
6. Handling Imbalanced Data:

o Requirement: Implement techniques to handle class imbalance, where fraudulent job


postings are typically fewer than legitimate ones. This may include adjusting class
weights in the Random Forest classifier or using oversampling techniques like
SMOTE.

o Explanation: Addressing class imbalance ensures that the model does not become
biased towards predicting the majority class (legitimate postings), thereby improving its
sensitivity to detecting fraudulent activities.
7. Deployment and Integration:

o Requirement: Deploy the trained Random Forest classifier into a production


environment where it can process new job postings in real-time or batch mode.

o Explanation: Integration with existing job platforms or recruitment systems allows for
seamless operation, where the classifier can analyze incoming postings and flag
potential fraudulent ones for further review.

6
2.2 Perfomance Requirements

1. Scalability and Performance:

o Requirement: Ensure scalability by leveraging cloud computing resources or


distributed processing frameworks to handle large volumes of data efficiently.

o Explanation: As the volume of job postings increases, the system should be capable of
scaling to meet demand without compromising performance or response times.

2. Monitoring and Maintenance:

o Requirement: Establish mechanisms for monitoring model performance in real-time,


detecting anomalies, and generating alerts for potential fraudulent activities.

o Explanation: Continuous monitoring allows for proactive maintenance and updates to


the classifier based on evolving fraud patterns and feedback from users or stakeholders.
3. Documentation and Reporting:

o Requirement: Generate comprehensive documentation detailing the system


architecture, data sources, preprocessing steps, model development, evaluation metrics,
and deployment procedures.

o Explanation: Clear documentation facilitates understanding, collaboration among team


members, and future enhancements or modifications to the system.

2.3 Software Requirements

To build a software system for predicting fake job postings using a Random Forest classifier, you'll
need a combination of software tools, libraries, and environments. Here are the essential software
requirements:

1. Programming Language:

7
o Python: Python is widely used for machine learning tasks due to its rich ecosystem of
libraries and frameworks. It provides tools for data manipulation, model training, and
deployment.
2. Development Environment:

o Integrated Development Environment (IDE): Choose an IDE that supports Python


development and provides features such as code completion, debugging, and project
management. Popular choices include PyCharm, VS Code, and Jupyter Notebook for
interactive development.
3. Machine Learning Libraries:

o scikit-learn: Essential for building and training the Random Forest classifier. Scikit-
learn provides a wide range of machine learning algorithms, preprocessing techniques,
and model evaluation tools.
o pandas: For data manipulation and preprocessing tasks such as handling datasets,
cleaning data, and feature engineering.
o NumPy: Fundamental for numerical operations and efficient handling of arrays and
matrices, which are essential for data manipulation and feeding into machine learning
models.
4. Visualization Libraries:

o Matplotlib and Seaborn: For data visualization tasks such as plotting histograms, bar
charts, scatter plots, and ROC curves to visualize model performance and insights from
data.

5. Documentation and Collaboration Tools:

o Jupyter Notebook: For interactive development, experimentation, and documentation


of data preprocessing steps, model training, and evaluation.
o Markdown: For creating documentation files that describe the system architecture,
data sources, preprocessing steps, model development, evaluation metrics, and
deployment procedures.

8
2.4 Hardware Requirements
o Processor: Multi-core CPU (e.g., i5/i7/Ryzen 5/Ryzen 7 or better) for efficient data
processing and model training.
o Memory (RAM): Minimum 8GB RAM, preferably 16GB or more, to handle large
datasets and complex computations effectively.
o Storage: SSD storage for faster data read/write operations, enhancing overall system
performance.
o Display: High-resolution monitor to visualize detailed insights and interactive
dashboards effectively.
o Network: Reliable internet connection for data imports/exports and accessing any
online resources if needed.

9
CHAPTER-3

CONTENTS

3.1 KNN:

(K-NN) algorithm is a versatile and widely used machine learning algorithm that is primarily used
for its simplicity and ease of implementation. It does not require any assumptions about the
underlying data distribution. It can also handle both numerical and categorical data, making it a
flexible choice for various types of datasets in classification and regression tasks. It is a non-
parametric method that makes predictions based on the similarity of data points in a given dataset.
K-NN is less sensitive to outliers compared to other algorithms.

KNN is one of the most basic yet essential classification algorithms in machine learning. It belongs
to the supervised learning domain and finds intense application in pattern recognition, data mining,
and intrusion detection.

It is widely disposable in real-life scenarios since it is non-parametric, meaning it does not make any
underlying assumptions about the distribution of data (as opposed to other algorithms such as GMM,
which assume a Gaussian distribution of the given data). We are given some prior data (also called
training data), which classifies coordinates into groups identified by an attribute.

As an example, consider the following table of data points containing two features:

Fig 3.1 Data points containing two features

10
Advantages of knn :

o Easy to implement as the complexity of the algorithm is not that high.


o Adapts Easily – As per the working of the KNN algorithm it stores all the data in
memory storage and hence whenever a new example or data point is added then the
algorithm adjusts itself as per that new example and has its contribution to the future
predictions as well.
o Few Hyperparameters – The only parameters which are required in the training of a
KNN algorithm are the value of k and the choice of the distance metric which we
would like to choose from our evaluation metric.
o No Training Phase: Unlike many other machine learning algorithms, KNN does not
require a training phase. It stores all training data and uses it during the prediction
phase. This characteristic makes it easy to update the model with new data without
having to retrain the entire model.
o Versatility: KNN can be applied to both classification and regression problems. In
classification, it predicts the class membership of a data point based on the majority
class among its k-nearest neighbors. In regression, it predicts the value of a continuous
target variable by averaging the values of its k-nearest neighbors.

Applications:

o Recommendation Systems: KNN is used extensively in recommendation systems,


such as recommending movies, products, or services to users based on the preferences
of similar users. It identifies similar users (neighbors) and recommends items that those
users have liked or purchased.
o Healthcare: In medical diagnosis, KNN can assist in identifying diseases based on
patient symptoms and medical history. By comparing a patient's symptoms with similar
cases in the dataset, KNN can suggest potential diagnoses or treatment options.
o Image Recognition: KNN is used in image recognition applications, such as facial
recognition and object detection. It classifies images by comparing them with stored
images in the dataset and identifying the closest matches.

11
o Text Mining: KNN can be applied in text classification tasks, such as sentiment
analysis of customer reviews or spam detection in emails. It categorizes text documents
based on the similarity of their content to previously categorized documents.
o Market Segmentation: KNN assists in market segmentation by grouping customers
with similar purchasing behaviors or demographics.

3.2 Random forest:

Random Forest algorithm is a powerful tree learning technique in Machine Learning. It works by
creating a number of Decision Trees during the training phase. Each tree is constructed using a
random subset of the data set to measure a random subset of features in each partition. This
randomness introduces variability among individual trees, reducing the risk of overfitting
and improving overall prediction performance.

In prediction, the algorithm aggregates the results of all trees, either by voting (for classification
tasks) or by averaging (for regression tasks) This collaborative decision- making process, supported
by multiple trees with their insights, provides an example stable and precise results. Random forests
are widely used for classification and regression functions, which are known for their ability to
handle complex data, reduce overfitting, and provide reliable forecasts in different environments.

Fig 3.2 Random Forest

o Ensemble of Decision Trees: Random Forest leverages the power of ensemble


learning by constructing an army of Decision Trees. These trees are like individual
experts, each specializing in a particular aspect of the data. Importantly, they operate

12
independently, minimizing the risk of the model being overly influenced by the nuances
of a single tree.
o Random Feature Selection: To ensure that each decision tree in the ensemble brings a
unique perspective, Random Forest employs random feature selection. During the
training of each tree, a random subset of features is chosen. This randomness ensures
that each tree focuses on different aspects of the data, fostering a diverse set of
predictors within the ensemble.
o Bootstrap Aggregating or Bagging: The technique of bagging is a cornerstone of
Random Forest’s training strategy which involves creating multiple bootstrap samples
from the original dataset, allowing instances to be sampled with replacement. This
results in different subsets of data for each decision tree, introducing variability in the
training process and making the model more robust.
o Decision Making and Voting: When it comes to making predictions, each decision
tree in the Random Forest casts its vote. For classification tasks, the final prediction is
determined by the mode (most frequent prediction) across all the trees. In regression
tasks, the average of the individual tree predictions is taken

Advantages :

o High Predictive Accuracy: Imagine Random Forest as a team of decision- making


wizards. Each wizard (decision tree) looks at a part of the problem, and together, they
weave their insights into a powerful prediction tapestry. This teamwork often results in
a more accurate model than what a single wizard could achieve.
o Resistance to Overfitting: Random Forest is like a cool-headed mentor guiding its
apprentices (decision trees). Instead of letting each apprentice memorize every detail of
their training, it encourages a more well-rounded understanding.
o Large Datasets Handling: Dealing with a mountain of data? Random Forest tackles it
like a seasoned explorer with a team of helpers (decision trees). Each helper takes on a
part of the dataset, ensuring that the expedition is not only thorough but also
surprisingly quick.

13
Applications:

o Finance Wizard: Imagine Random Forest as our financial superhero, diving into the
world of credit scoring. Its mission? To determine if you’re a credit superhero or, well,
not so much. With a knack for handling financial data and sidestepping overfitting
issues, it’s like having a guardian angel for robust risk assessments.
o Health Detective: In healthcare, Random Forest turns into a medical Sherlock Holmes.
Armed with the ability to decode medical jargon, patient records, and test results, it’s
not just predicting outcomes; it’s practically assisting doctors in solving the mysteries
of patient health.
o Environmental Guardian: Out in nature, Random Forest transforms into an
environmental superhero. With the power to decipher satellite images and brave noisy
data, it becomes the go-to hero for tasks like tracking land cover changesand
safeguarding against potential deforestation, standing as the protector of our green
spaces.

3.3 Architecture diagram:

14
Fig 3.3 Architecture Design

Data flows from various sources into the staging area where it is cleaned.The ETL
process then extracts, transforms, and loads the data into the data warehouse.

Data is then organized into data marts for specific business functions.

Business intelligence tools use data marts to generate insights.Machine learning


models are also trained on data from data marts to provide predictive insights.Both
business intelligence and predictive insights feed into decision- making processes,
improving the organization's strategic and operational effectiveness.

15
3.4 UML Diagrams

UML DIAGRAMS UML stands for Unified Modeling Language. UML is a standardized general-
purpose modeling language in the field of object-oriented software engineering. The standard is
managed, and was created by, the Object Management Group. The goal is for UML to become a
common language for creating models of object oriented computer software. In its current form
UML is comprised of two major components: a Meta-model and a notation. In the future, some form
of method or process may also be added to; or associated with, UML.

The Unified Modeling Language is a standard language for specifying, Visualization, Constructing
and documenting the artifacts of software system, as well as for business modeling and other non-
software systems. The UML represents a collection of best engineering practices that have proven
successful in the modeling of large and complex systems. The UML is a very important part of
developing objects oriented software and the software development process. The UML uses mostly
graphical notations to express the design of software projects.

USE CASE DIAGRAM: A use case diagram in the Unified Modeling Language (UML) is a
type of behavioral diagram defined by and created from a Use-case analysis. Its purpose is to present
a graphical overview of the functionality provided by a system in terms of actors, their goals
(represented as use cases), and any dependencies between those use cases. The main purpose of a use
case diagram is to show what system functions are performed for which actor. Roles of the actors in
the system can be depicted

16
Fig 3.3.1 Use Case Diagram

Class Diagram:
In software engineering, a class diagram in the Unified Modeling Language (UML) is a type of static
structure diagram that describes the structure of a system by showing the system's classes, their
attributes, operations (or methods), and the relationships among the classes. It explains which class
contains information.

17
Fig 3.3.2 Class Diagram

Sequence Diagram:
A sequence diagram in Unified Modeling Language (UML) is a kind of interaction diagram that
shows how processes operate with one another and in what order. It is a construct of a Message
Sequence Chart. Sequence diagrams are sometimes called event diagrams, event scenarios, and
timing diagrams.

18
Fig 3.3.3 Sequence Diagram

CHAPTER – 4

CODE IMPLEMENTATION
Importing Libraries

19
Expolratory Data Analysis and Data Processing:

20
21
22
23
24
CONCLUSION

In this project, we aimed to develop a predictive model to identify fake job postings using a Random
Forest classifier. Through extensive data preprocessing, feature engineering, and model tuning, we
successfully created a classifier that can distinguish between real and fake job posts with a high
degree of accuracy. The key findings from our analysis include:

o Data Insights: Analysis of the dataset revealed certain key features such as job
description length, presence of company logos, and keywords in the job title and
description that are indicative of fraudulent postings.
o Model Performance: The Random Forest classifier demonstrated strong performance,
achieving an accuracy of X%, precision of Y%, recall of Z%, and an F1-score of W%.
These metrics indicate that the model is effective in identifying fake job posts while
maintaining a balance between precision and recall.

FUTURE SCOPE
25
Future work could focus on several aspects to build upon the current project:

o Real-Time Detection: Developing a real-time fake job post detection system that can
be integrated into job boards and recruitment platforms.
o Explainability: Enhancing the model's explainability using tools like SHAP (shapley
Additive explanations) to provide insights into which features are most influential in
predicting fake job posts.
o Extended Evaluation: Conducting extensive evaluation using different datasets to
ensure the model's generalizability across various job markets and regions.

In conclusion, the Random Forest classifier has proven to be a robust tool for predicting fake job
postings. With further refinements and enhancements, it holds the potential to significantly mitigate
the issue of fraudulent job postings, thereby enhancing the integrity of job boards and protecting job
seekers.

26
REFERENCES

1. (PDF) Automatic Detection of Online Recruitment Frauds: Characteristics, Methods, and a


Public Dataset (researchgate.net) by S . Vidros , C. Kolias.

2. (PDF) An Intelligent Model for Online Recruitment Fraud Detection (researchgate.net) by


B. Alghamdi

3. (PDF) Fake Job Detection with Machine Learning: A Comparison (researchgate.net).

27

You might also like