Report
Report
Report
On
Submitted
By
Shantanu Umrani
PRN: 1032212141
1
ABSTRACT
In numerous software companies, there is a noticeable trend of employees resigning from their
positions for various reasons. When skilled and valuable employees depart, it poses significant
challenges for organizations in maintaining their operations. Consequently, it is crucial for
companies to proactively identify and assess the causes of employee turnover and formulate
suitable strategies and actions to address this issue. IBM's HR Analytics Employee Attrition
and Performance datasets are being utilized for this purpose. Missing values were dropped to
give better insights in data analysis. ANOVA and Chi-Square tests were carried out during
statistical analysis. Machine Learning algorithms such as Logistic Regression (92%), Random
Forest (89%), Support Vector Machine (93%), XGBoost (100%), CatBoost (98%), AdaBoost
(90%) and LightGBM (100%) were applied to understand, manage, and mitigate employee
attrition. Comparison of model performance was plotted on ROC Curve using True-Positive
and False-Positive Rate.
2
INDEX
3
Chapter 6 43
6 Conclusion 43
Future Work 43
7 References 44
4
LIST OF FIGURES
5
FIGURE NO. FIGURE NAME CHAPTER NAME
6
LIST OF TABLES
7
CHAPTER 1: INTRODUCTION
1.1 INTRODUCTION
Companies, both in India and other countries, face a formidable challenge in recruiting and
retaining top talent, all while dealing with talent loss through attrition, whether due to industry
downturns or voluntary turnover. Losing employees not only results in performance setbacks
but also has long-term negative impacts on companies. This includes potential disruptions to
productivity, work team dynamics, and social goodwill.
The success and competitiveness of any organization are highly dependent on its workforce,
with employees serving as the essential backbone of the company. This study aims to identify
employee attitudes, pinpoint the factors that contribute to their dissatisfaction within the
organization, and understand the reasons behind their decisions to seek alternative
employment opportunities.
By identifying and assessing the levels of employee attitudes, management can gain valuable
insights into areas that require improvement and take necessary action to reduce attrition rates.
This proactive approach is essential for sustaining a productive and harmonious work
environment while enhancing the organization's overall performance and long-term success.
8
1.2 OBJECTIVE
1. Assess the degree of employee satisfaction regarding their job and working environment.
2. Identify the elements that contribute to employee dissatisfaction with the company's
policies and guidelines.
3. Pinpoint the areas where the company is falling short or facing shortcomings.
5. Develop strategies and methods to minimize attrition rates within the organization.
1.3 MOTIVATION
This project originates from the possibility of enhancing employee contentment, cutting down
expenses, elevating organizational efficiency, and fostering a favorable workplace atmosphere.
It represents a chance to leverage data and analytics to enact substantial improvements that are
advantageous to both employees and the entire organization.
9
CHAPTER 2: SYSTEM ARCHITECTURE
The methodology for IBM HR Analytics Employee Attrition and Performance Prediction is as follows:-
10
Data Cleaning: Any missing values in the dataset are dropped using the dropna() method.
Data Visualization: Matplotlib and Seaborn libraries are used to visualize the data.
Statistical Analysis: The ANOVA Test is performed to analyze the Numerical Features'
Importance in Employee Attrition, while the Chi-Square Test to Analyze the Categorical
Feature Importance in Employee Attrition.
Data Preprocessing: The target variable 'Attrition' is mapped to binary values (1 for 'Yes'
and 0 for 'No'). Selected features are extracted from the dataset and one-hot encoded using
the get_dummies() function.
Splitting the Dataset: The dataset is split into training and testing sets using the
train_test_split() method from scikit-learn.
Implementing Machine Learning Algorithms: Logistic Regression, XGBoost, CatBoost,
AdaBoost, LightGBM, Decision Tree, and Random Forest classifiers are initialized and
trained using the training data.
Model Evaluation: The accuracy score and confusion matrix are computed to evaluate the
performance of each algorithm on the testing data.
Results: The results, including the accuracy and confusion matrix, are printed for each
algorithm.
Model Performance Comparison: The hvPlot library is used to visualize the ROC curve
diagram comparing the performance of all models used.
11
2.2 TECHNOLOGY USED
2.3 DATASET
This data set presents an employee survey from IBM, indicating if there is attrition or not. The
data set contains approximately 1500 entries. Given the limited size of the data set, the model
should only be expected to provide modest improvement in identification of attrition vs. a
random allocation of probability of attrition. IBM has gathered information on employee
satisfaction, income, seniority and some demographics. It includes the data of 1470
employees. To use a matrix structure, we changed the model to reflect the following data.
Dataset Description:
During website session, browsing information about visited pages is collected and features are
extracted as follows:
12
EmployeeNumber It is a unique number that has been 1 2068 602.02
assigned to each current and former
employee
EnvironmentSatisfaction It is all about an individual's 1 4 1.09
feelings about the work
environment and organization
culture.
HourlyRate The amount of money that is paid to 30 100 20.32
an employee for every hour worked
JobInvolvement Job involvement refers to the 1 4 0.71
degree to which a job is central to
a person's identity.
JobLevel Job levels are categories of 1 5 1.10
authority in an organization.
JobSatisfaction Job satisfaction happens when an 1 4 1.10
employee feels he or she is having
job stability.
MonthlyIncome Gross monthly income is the 1009 19999 4707.9
amount of income an employee earn 5
in one month.
MonthlyRate If a monthly rate is set, 2094 26999 7117.7
employees should be paid in 8
exchange for normal hours of work
of a full-time worker.
NumCompaniesWorked Number of other companies the 0 9 2.49
employee previously worked for
PercentSalaryHike The amount a salary is increased of 11 25 3.65
an employee in percentage
PerformanceRating Rating means gauging and 3 4 0.36
comparing the performance.
RelationshipSatisfaction It is the rate of satisfaction between 1 4 1.08
Employer employee relationship.
Table 1: Shows the numerical features along with their statistical parameters.
13
Table 2 – Categorical Features used in the User Attrition Analysis Model.
Number of
Feature Name Feature Description
Categorical
Values
Attrition Attrition in business describes a gradual but
deliberate reduction in staff numbers 2
that occurs as employees retire or resign,
[NOTE: Target Variable] (0=no, 1=yes)
Business travel is travel undertaken for work or 3
BusinessTravel
business purposes,
as opposed to other types of travel (1=No
Travel, 2=Travel Frequently, 3=Travel Rarely)
Consists three departments that contribute to the 3
Department
company's overall mission. (1=HR, 2=R&D, 3=Sales)
Education field of the employees(1=HR, 2=Life 6
EducationField
Sciences, 3=Marketing, 4=Medical Sciences,
5=others, 6= Technical)
Gender of the employee (1=Female, 2=Male) 2
Gender
MaritalStatus
Marital Status of the employee (1=divorced, 3
2=married, 3=single)
(1=Yes, 2=No) 2
Over18
(1=No, 2=Yes) 2
Overtime
Table 2: Shows the categorical features along with their number of categories.
14
CHAPTER 3: IMPLEMENTATION
3.1 DATA EXPLORATION AND PROCESSING
Compute Size:
In first step, we try to understand the dataset's size and structure at a glance by computing it’s
size.
The code reveals that the "employee_data" DataFrame contains 1,470 rows and 35 columns,
providing a quick overview of its size and structure.
Drop Columns:
In this step, we notice that 'EmployeeCount', 'Over18', 'StandardHours' have only one unique
values and 'EmployeeNumber' has 1470 unique values. These features aren't useful for us, so
we are going to drop those columns.
The code reveals that the "employee_data" DataFrame now contains 1,470 rows and 31
columns, providing a quick overview of its size and structure after dropping few columns.
15
3.2 DATA VISUALIZATION
By analyzing employee data, we can identify factors that contribute to employee attrition, such as
job satisfaction, compensation, and work-life balance. This information can be used to develop
strategies to retain top talent and reduce turnover rates. HR analytics can help identify high-
performing employees by analyzing data related to performance metrics, such as productivity,
quality, and customer satisfaction. This information can be used to develop strategies to retain top
talent and improve overall organizational performance.
Inference:
1. The employee attrition rate of this organization is 16.12%.
2. According to experts in the field of Human Resources, says that the attrition rate 4% to 6% is
normal in organization.
3. So we can say the attrition rate of the organization is at a dangerous level.
4. Therefore the organization should take measures to reduce the attrition rate.
16
2] ANALYZING EMPLOYEE ATTRITION BY GENDER.
Inference:
1. The number of male employees in the organization accounts for a higher proportion than
female employees by more than 20%.
2. Male employees are leaving more from the organization compared to female employees.
17
Inference:
Inference:
18
5] ANALYZING EMPLOYEE ATTRITION BY DEPARTMENT.
Inference:
19
Inference:
1. Employees with Average DailyRate & High Daily Rate are approximately equal.
2. But the attrition rate is very high of employees with average Daily Rate compared to the
employees with High DailyRate.
3. The attrition rate is also high of employees with low DailyRate.
4. Employees who are not getting High Daily Rate are mostly leaving the organization.
Inference:
1. In the organization there is all kind of employees staying close or staying far from the
office.
2. The feature Distance from Home doesn't follow any trend in attrition rate.
3. Employees staying close to the organization are mostly leaving compared to employees
staying far from the organization.
20
8] ANALYZING EMPLOYEE ATTRITION BY EDUCATION.
Inference:
1. Most of the employees in the organization have completed Bachelors or Masters as their education
qualification.
2. Very few employees in the organization have completed Doctorate degree as their education qualification.
3. We can observe a trend of decreasing in attrition rate as the education qualification increases.
1. Most of the employees are either from Life Science or Medical Education Field.
2. Very few employees are from Human Resources Education Field.
3. Education Fields like Human Resources, Marketing, and Technical is having very high
attrition rate.
4. This may be because of work load because there are very few employees in these
education fields compared to education field with less attrition rate.
Inference:
1. Most of the employees have rated the organization environment satisfaction High & Very
High.
2. Though the organization environment satisfaction is high still there's very high attrition in
this environment.
3. Attrition Rate increases with increase in level of environment satisfaction.
22
11] ANALYZING EMPLOYEE ATTRITION BY JOB ROLES.
Inference:
1. Most employees are working as Sales executive, Research Scientist or Laboratory
Technician in this organization.
2. Highest attrition rates are in sector of Research Director, Sales Executive, and Research
Scientist.
23
12] ANALYZING EMPLOYEE ATTRITION BY JOB LEVEL.
Inference:
1. Most of the employees in the organization are at Entry Level or Junior Level.
2. Highest Attrition is at the Entry Level.
3. As the level increases the attrition rate decreases.
24
Inference:
1. Most of the employees have rated their job satisfaction as high or very high.
2. Employees who rated their job satisfaction low are mostly leaving the organization.
3. All the categories in job satisfaction is having high attrition rate.
Inference:
1. Most of the employees are married in the organization.
2. The attrition rate is very high of employees who are divorced.
3. The attrition rate is low for employees who are single.
25
15] ANALYZING EMPLOYEE ATTRITION BY MONTHLY INCOME.
Inference:
1. Most of the employees are getting paid less than 10000 in the organization.
2. The average monthly income of employee who has left is comparatively low with
employee who is still working.
3. As the Monthly Income increases the attrition decreases.
26
Inference:
Inference:
1. Most of the employees don’t work for OverTime.
2. The feature OverTime is having a very high class imbalance due to which we can't make
any meaningful insights.
27
18] ANALYZING EMPLOYEE ATTRITION BY SALARY HIKE.
Inference:
1. Very Few employees are getting a high percent salary hike.
2. As the amount of percent salary increases the attrition rate decreases.
28
Inference:
Inference:
1. Most of the employees are having high or very high relationship satisfaction.
2. Though the relationship satisfaction is high there's a high attrition rate.
3. All the categories in this feature are having a high attrition rate.
29
21] ANALYZING EMPLOYEE ATTRITION BY WORK LIFE BALANCE.
Inference:
1. More than 60% of employees are having a better work life balance.
2. Employees with Bad Work Life Balance are having Very High Attrition Rate.
3. Other Categories is also having High attrition Rate.
30
Inference:
1. Most of the employees are having a total of 5 to 10 years of working experience. But their
Attrition Rate is also very high.
2. Employees with working experience of less than 10 years are having High Attrition Rate.
3. Employees with working experience of more than 10 years are having Less Attrition Rate.
Inference:
31
24] ANALYZING EMPLOYEE ATTRITION BY YEARS IN CURRENT ROLE.
Inference:
1. Most employees have worked for 2 to 10 years for the same role in the organization.
2. Very few employees have worked for less than 1 year or more than 10 years in the same
role.
3. Employee who has worked till 2 years in the same role are having very high attrition rate.
4. Employee who has worked for 10+ years in the same role are having low attrition rate.
32
Inference:
Inference:
1. Almost 51% of employees have worked for 2-5 years with the same manager.
2. Almost 38% of employees have worked for 5-10 years with the same manager.
3. Employee who has worked for 10+ year with the same manager are having very low
attrition rate.
4. Other Categories is having high attrition rate.
33
3.3 STATISTICAL ANALYSIS
Statistical analysis plays a crucial role in HR analytics by helping organizations make informed
decisions about their human resources and workforce management. It enables evidence-based
decision-making, enhances workforce planning strategies, and fosters a deeper understanding of the
organization's human capital dynamics.
1] Perform ANOVA Test: ANOVA test is used to analysing the impact of different numerical features
on a response categorical feature.
Inference:
The following features show a strong association with attrition, as indicated by their high
F-scores and very low p-values.
1. Age
2. DailyRate
3. HourlyRate
4. MonthlyIncome
5. MonthlyRate
6. NumCompaniesWorked
7. PercentSalaryHike
8. TotalWorkingYears
9. TrainingTimesLastYear
10. YearsAtCompany
11. YearsWithCurrManager
The following features don’t shows significant relationship with attrition because of their
moderate F-scores and extremely high p-values.
1. DistanceFromHome
2. StockOptionLevel
3. YearsInCurrentRole
4. YearsSinceLastPromotion
It is important for the organization to pay attention to the identified significant features and
consider them when implementing strategies to reduce attrition rates.
34
2] Perform CHI-SQUARE Test: CHI-SQUARE test is used to analysing the impact of different
categorical features.
Inference:
The following features showed statistically significant associations with employee attrition:
1. Department
2. EducationField
3. EnvironmentSatisfaction
4. JobInvolvement
5. JobLevel
6. JobRole
7. JobSatisfaction
8. MaritalStatus
9. OverTime
10. WorkLifeBalance
The following features did not show statistically significant associations with attrition.
1. Gender
2. Education
3. PerformanceRating
4. RelationshipSatisfaction
It is important for the organization to pay attention to the identified significant features and
consider them when implementing strategies to reduce attrition rates.
35
3.4 DATA MODELING
Data modeling plays a significant role in HR analytics when integrating machine learning
techniques. Machine learning algorithms leverage data models to make predictions,
classifications, and recommendations based on patterns and relationships found in the
HR data.
The data set was split into 70% for training and 30% for testing and we have considered Attrition
as target feature.
36
2. Random Forest Model.
37
3. Support Vector Machine Model.
Fig.3.4.3: Training and Testing results by using Support Vector Machine Model.
38
4. XGBoost Model.
5. LightGBM Model.
7. AdaBoost Model.
40
ROC Curve: An ROC curve (receiver operating characteristic curve) is a graph showing the performance
of a classification model at all classification thresholds.
The graph is well-structured and displays multiple lines of varying colors, each representing a
different machine learning model. The models include Random Forest, XGBoost, Logistic
Regression, Support Vector Machine, LightGBM, CatBoost, and AdaBoost. Each model is
also associated with an Area Under Curve (AUC) value, indicating their performance. The
graph presents the True Positive Rate on the y-axis, ranging from 0.0 to 1.0, and the False
Positive Rate on the x-axis, also ranging from 0.0 to 1.0. Here, Model like Random Forest,
XGBoost, LightGBM, CatBoost, AdaBoost have better performance comparing with Support
Vector Machine and Logistic Regression.
41
CHAPTER 4: PROJECT PLANNING
4.1 PROJECT PLANNING
42
CHAPTER 5
5.1 CONCLUSION
In the context of the previous HR analytics project on employee attrition, future work in
sentiment analysis involves implementing sentiment analysis on employee feedback data to
gain insights, monitoring sentiment in real-time, categorizing sentiments by topics, and
analyzing historical sentiment trends. In terms of dashboard development, there's a need to
create interactive, predictive, and benchmarking-enabled dashboards with custom alerts,
engagement metrics, and mobile accessibility. Additionally, user training and support, data
privacy, feedback integration, and performance monitoring are crucial aspects to ensure the
dashboard's effectiveness in facilitating data-driven HR decisions and actions while adhering
to privacy regulations.
43
REFERENCES
[1] Mishra S N, Lama D R and Pal Y 2016 Human Resource Predictive Analytics (HRPA) for HR
Management in Organizations International Journal Of Scientific & Technology Research 5(5) 33-35
[2] Hoffman M and Tadelis S 2018 People Management Skills, Employee Attrition, and Manager
Rewards: An Empirical Analysis National Bureau of Economic Research
[3] Frye A, Boomhower C, Smith M, Vitovsky L and Fabricant S 2018 Employee Attrition: What
Makes an Employee Quit? MU Data Science Review 1(1)
[5] Halkos, George & Bousinakis, Dimitrios, 2017. "The effect of stress and dissatisfaction on
employees during crisis," Economic Analysis and Policy, Elsevier, vol. 55(C), pages 25-34.
[6] Glavas, A., & Willness, C. (2020). Employee (dis)engagement in corporate social responsibility.
In D. Haski-Leventhal, L. Roza, & S. Brammer (Eds.), Employee engagement in corporate social
responsibility (pp. 10–27). Sage Publications Ltd. https://doi.org/10.4135/9781529739176.n2
[7] S. Yadav, A. Jain and D. Singh, "Early Prediction of Employee Attrition using Data
Mining Techniques," 2018 IEEE 8th International Advance Computing Conference (IACC),
Greater Noida, India, 2018, pp. 349-354, doi: 10.1109/IADCC.2018.8692137.
[8] R. Jain and A. Nayyar, "Predicting Employee Attrition using XGBoost Machine
Learning Approach," 2018 International Conference on System Modeling & Advancement in
Research Trends (SMART), Moradabad, India, 2018, pp. 113-120, doi:
10.1109/SYSMART.2018.8746940.
[9] lduayj, Sarah & Rajpoot, Kashif. (2018). Predicting Employee Attrition using Machine
Learning. 93-98. 10.1109/INNOVATIONS.2018.8605976.
44
[10] Setiawan, Irwan & Suprihanto, Suprihanto & Nugraha, Ade & Hutahaean, Jonner.
(2020). HR analytics: Employee attrition analysis using logistic regression. IOP Conference
Series: Materials Science and Engineering. 830. 032001. 10.1088/1757-899X/830/3/032001.
[11] Yadav, Sandeep & Jain, Aman & Singh, Deepti. (2018). Early Prediction of Employee
Attrition using Data Mining Techniques. 349-354. 10.1109/IADCC.2018.8692137.
[12] I. Ballal, S. Kavathekar, S. Janwe, P. Shete, and N. Bhirud, “People Leaving the Job-An
Approach for Prediction Using Machine Learning,” Int. J. Res. Anal. Rev., vol. 7, no. 1, pp.
8– 10, 2020, [Online]. Available: www.ijrar.org
[14] N. Mansor, N. S. Sani, and M. Aliff, “Machine Learning for Predicting Employee
Attrition,” Int. J. Adv. Comput. Sci. Appl., vol. 12, no. 11, pp. 435–445, 2021, doi: 10.14569/
IJACSA.2021.0121149.
45