0% found this document useful (0 votes)
26 views8 pages

howxtre

Uploaded by

josephallen.abc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views8 pages

howxtre

Uploaded by

josephallen.abc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Machine Learning Lab

DA-2
Date: 17.08.2024
Name: Arnav Bahuguna
Reg: 21BCE3795

Q1. The data set Breast_Cancer (available in sklearn learn lib.) have 30 baseline
variables 'mean radius' 'mean texture' 'mean perimeter' 'mean area' 'mean smoothness'
'mean compactness' 'mean concavity' 'mean concave points' 'mean symmetry' 'mean
fractal dimension' 'radius error' 'texture error' 'perimeter error' 'area error' 'smoothness
error' 'compactness error' 'concavity error''concave points error' 'symmetry error' 'fractal
dimension error''worst radius' 'worst texture' 'worst perimeter' 'worst area''worst
smoothness' 'worst compactness' 'worst concavity’'worst concave points' 'worst
symmetry' 'worst fractal dimension' were obtained for each of n = 569 patients, as well as
the response of interest, a quantitative measure of disease progression one year after
baseline.

1. Apply standard scalar on the independent features mentioned below and Do regression
analysis for the impact on 'mean fractal dimension' by the features of 'mean texture',
'mean area' and 'mean compactness'.
2. Evaluate the performance of the regression model using R2, MSE, MAE and SSE

Q2. Download employee retention dataset from here:


https://www.kaggle.com/giripujar/hr-analytics

1. Do some exploratory data analysis to figure out which variables have direct and clear
impact on employee retention (ie, whether they leave the company or continue to work)
2. Now build logistic regression model using variables that were narrowed down in step 1
3. Measure the accuracy (precision,recal,F1 and ROC) of the model

Python Notebook:
Q1: Linear Regression Analysis
import pandas as pd
from sklearn.datasets import load_breast_cancer
import matplotlib.pyplot as plt

bs = load_breast_cancer(as_frame=True)
df = bs.data

df.head()

mean radius mean texture mean perimeter mean area mean


smoothness \
0 17.99 10.38 122.80 1001.0
0.11840
1 20.57 17.77 132.90 1326.0
0.08474
2 19.69 21.25 130.00 1203.0
0.10960
3 11.42 20.38 77.58 386.1
0.14250
4 20.29 14.34 135.10 1297.0
0.10030

mean compactness mean concavity mean concave points mean


symmetry \
0 0.27760 0.3001 0.14710
0.2419
1 0.07864 0.0869 0.07017
0.1812
2 0.15990 0.1974 0.12790
0.2069
3 0.28390 0.2414 0.10520
0.2597
4 0.13280 0.1980 0.10430
0.1809

mean fractal dimension ... worst radius worst texture worst


perimeter \
0 0.07871 ... 25.38 17.33
184.60
1 0.05667 ... 24.99 23.41
158.80
2 0.05999 ... 23.57 25.53
152.50
3 0.09744 ... 14.91 26.50
98.87
4 0.05883 ... 22.54 16.67
152.20
worst area worst smoothness worst compactness worst concavity \
0 2019.0 0.1622 0.6656 0.7119
1 1956.0 0.1238 0.1866 0.2416
2 1709.0 0.1444 0.4245 0.4504
3 567.7 0.2098 0.8663 0.6869
4 1575.0 0.1374 0.2050 0.4000

worst concave points worst symmetry worst fractal dimension


0 0.2654 0.4601 0.11890
1 0.1860 0.2750 0.08902
2 0.2430 0.3613 0.08758
3 0.2575 0.6638 0.17300
4 0.1625 0.2364 0.07678

[5 rows x 30 columns]

X = df[['mean texture', 'mean area', 'mean compactness']]


y = df['mean fractal dimension']

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,


test_size=0.3, random_state=101)

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

from sklearn.linear_model import LinearRegression

model = LinearRegression()

model.fit(X_train, y_train)
preds = model.predict(X_test)

from sklearn.metrics import mean_absolute_error, mean_squared_error,


r2_score

print("Mean absolute error: ", mean_absolute_error(preds, y_test))


print("Mean squared error: ", mean_squared_error(preds, y_test))
print("R2 score: ", r2_score(preds, y_test))
print("Sum of squared error: ", mean_squared_error(preds,
y_test)*len(y_test))

Mean absolute error: 0.002677190702938391


Mean squared error: 1.2463526001898355e-05
R2 score: 0.6673819543520271
Sum of squared error: 0.0021312629463246186
Q2: Logistic Regression Analysis
df1 = pd.read_csv('HR_comma_sep.csv')

df1.head()

satisfaction_level last_evaluation number_project


average_montly_hours \
0 0.38 0.53 2
157
1 0.80 0.86 5
262
2 0.11 0.88 7
272
3 0.72 0.87 5
223
4 0.37 0.52 2
159

time_spend_company Work_accident left promotion_last_5years


Department \
0 3 0 1 0
sales
1 6 0 1 0
sales
2 4 0 1 0
sales
3 5 0 1 0
sales
4 3 0 1 0
sales

salary
0 low
1 medium
2 medium
3 low
4 low

df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 satisfaction_level 14999 non-null float64
1 last_evaluation 14999 non-null float64
2 number_project 14999 non-null int64
3 average_montly_hours 14999 non-null int64
4 time_spend_company 14999 non-null int64
5 Work_accident 14999 non-null int64
6 left 14999 non-null int64
7 promotion_last_5years 14999 non-null int64
8 Department 14999 non-null object
9 salary 14999 non-null object
dtypes: float64(2), int64(6), object(2)
memory usage: 1.1+ MB

df1['Department'].unique()

array(['sales', 'accounting', 'hr', 'technical', 'support',


'management',
'IT', 'product_mng', 'marketing', 'RandD'], dtype=object)

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(drop='first', sparse_output=False)


encoded_features = encoder.fit_transform(df1[['salary']])

encoded_df = pd.DataFrame(encoded_features,
columns=encoder.get_feature_names_out(['salary']))
df1 = df1.join(encoded_df)
df1.drop(['salary'], axis=1, inplace=True)

df1.head()

satisfaction_level last_evaluation number_project


average_montly_hours \
0 0.38 0.53 2
157
1 0.80 0.86 5
262
2 0.11 0.88 7
272
3 0.72 0.87 5
223
4 0.37 0.52 2
159

time_spend_company Work_accident left promotion_last_5years


Department \
0 3 0 1 0
sales
1 6 0 1 0
sales
2 4 0 1 0
sales
3 5 0 1 0
sales
4 3 0 1 0
sales
salary_low salary_medium
0 1.0 0.0
1 0.0 1.0
2 0.0 1.0
3 1.0 0.0
4 1.0 0.0

plt.figure(figsize=(10,8), dpi=200)
sns.heatmap(df1.drop('Department', axis=1).corr(), cmap='coolwarm',
annot=True)

<Axes: >

X = df1[['last_evaluation', 'number_project', 'average_montly_hours',


'time_spend_company', 'salary_low']]
y = df1['left']
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.3, random_state=101)

scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

from sklearn.linear_model import LogisticRegressionCV

model = LogisticRegressionCV(class_weight='balanced')

model.fit(X_train, y_train)
preds = model.predict(X_test)

from sklearn.metrics import precision_score, recall_score, f1_score,


roc_curve

print("precision: ", precision_score(preds,y_test))


print("recall: ", recall_score(preds,y_test))
print("f1 score: ", f1_score(preds,y_test))

precision: 0.6884939195509823
recall: 0.3768561187916027
f1 score: 0.48709463931171415

fpr, tpr, thresh = roc_curve(y_test, preds)


plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')

Text(0, 0.5, 'True Positive Rate')

You might also like