HR Analytic Using Logistic Regression
HR Analytic Using Logistic Regression
HR Analytic Using Logistic Regression
✏Contents of notebook :-
1. Importing Libraries
2. Exploratory Data Analysis
3. Basic Data Cleaning
4. Data Visulaization
5. Data Preprocessing
6. Model Building
Importing Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
Importing Dataset
hr = pd.read_csv('HR_comma_sep.csv')
hr.head()
hr.size
149990
hr.describe()
Department salary
unique 10 3
hr.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 satisfaction_level 14999 non-null float64
1 last_evaluation 14999 non-null float64
2 number_project 14999 non-null int64
3 average_montly_hours 14999 non-null int64
4 time_spend_company 14999 non-null int64
5 Work_accident 14999 non-null int64
6 left 14999 non-null int64
7 promotion_last_5years 14999 non-null int64
8 Department 14999 non-null object
9 salary 14999 non-null object
dtypes: float64(2), int64(6), object(2)
memory usage: 1.1+ MB
There is no null values in this dataset so We don't have to perform any data cleaning for null values
hr['Department'].unique()
hr['salary'].unique()
Department
hr['Department'].value_counts()/len(hr)*100
sales 27.601840
technical 18.134542
support 14.860991
IT 8.180545
product_mng 6.013734
marketing 5.720381
RandD 5.247016
accounting 5.113674
hr 4.926995
management 4.200280
Name: Department, dtype: float64
Data Visualtization
_,_, autotexts = ax[1].pie( data.values, labels = data.index , autopct = "%.2f%%" , colors = pal)
plt.title("Department")
plt.show()
hr['salary'].value_counts()/len(hr)*100
low 48.776585
medium 42.976198
high 8.247216
Name: salary, dtype: float64
_,_, autotexts = ax[1].pie( data.values, labels = data.index , autopct = "%.2f%%" , colors = pal)
plt.title("Salary")
plt.show()
hr.left.value_counts()/len(hr)*100
0 76.191746
1 23.808254
Name: left, dtype: float64
hr['left'] = hr.left.astype('object')
_,_, autotexts = ax[1].pie( data.values, labels = data.index , autopct = "%.2f%%" , colors = pal[::-1])
plt.title("Leave")
plt.show()
ct = pd.crosstab(hr['Department'] , hr['salary'])
plt.title('Salary vs Department')
plt.xlabel('Department')
plt.ylabel('Salary')
plt.show()
_,_, autotexts = ax[1].pie( data.values, labels = data.index , autopct = "%.2f%%" , colors = pal)
plt.title("Left")
plt.show()
We can see that almost 24% employees leave the company
<AxesSubplot:xlabel='Department', ylabel='count'>
We can see here that there is no such major impact of department on retention of any employee
<AxesSubplot:xlabel='salary', ylabel='count'>
We can clearly see here that employees with higher salaries are not like to leave the company
<AxesSubplot:xlabel='left', ylabel='satisfaction_level'>
We can see here that satisfaction level is directly impact the leaving chances of the employee
<AxesSubplot:xlabel='left', ylabel='last_evaluation'>
From above chart there seem to be no impact of last_evalution on employee retention
<AxesSubplot:xlabel='left', ylabel='number_project'>
<AxesSubplot:xlabel='left', ylabel='average_montly_hours'>
From above chart there seem to be some impact of average_montly_hours on employee retention
but it is not too major but we will consider it in our analysis
<AxesSubplot:xlabel='left', ylabel='time_spend_company'>
Above bar chart shows employees with low time_spend_compmay are likely to not leave the
company
<AxesSubplot:xlabel='Work_accident', ylabel='count'>
From above chart there seem to be impact of Work_accident on employee retention
<AxesSubplot:xlabel='promotion_last_5years', ylabel='count'>
From the data analysis so far we can conclude that we will use following variables as independant
variables in our model
1. **Satisfaction Level**
2. **Average Monthly Hours**
3. **Promotion Last 5 Years**
4. **Salary**
5. **Work Accident**
Data Preprocessing
subdf = hr[['satisfaction_level','average_montly_hours','promotion_last_5years','Work_accident','salary']]
subdf.head()
df_with_dummies = pd.concat([subdf,salary_dummies],axis='columns')
df_with_dummies.head()
df_with_dummies.drop(['salary','salary_low'],axis='columns',inplace=True)
df_with_dummies.head()
0 0.38 157 0 0 0 0
1 0.80 262 0 0 0 1
2 0.11 272 0 0 0 1
3 0.72 223 0 0 0 0
4 0.37 159 0 0 0 0
X = df_with_dummies
X.head()
0 0.38 157 0 0 0 0
1 0.80 262 0 0 0 1
2 0.11 272 0 0 0 1
3 0.72 223 0 0 0 0
4 0.37 159 0 0 0 0
y = hr['left'].astype(str)
y
0 1
1 1
2 1
3 1
4 1
..
14994 1
14995 1
14996 1
14997 1
14998 1
Name: left, Length: 14999, dtype: object
Model Building
model.fit(X_train, y_train)
LogisticRegression()
ypred = model.predict(X_test)
model.score(X_test,y_test)
0.7753333333333333
Exciting Milestone: Successfully trained my first logistic regression model, one more step in my
journey into data science and predictive analytics. Looking forward to exploring more complex
algorithms and applications!
Loading [MathJax]/extensions/Safe.js