howxtre
howxtre
DA-2
Date: 17.08.2024
Name: Arnav Bahuguna
Reg: 21BCE3795
Q1. The data set Breast_Cancer (available in sklearn learn lib.) have 30 baseline
variables 'mean radius' 'mean texture' 'mean perimeter' 'mean area' 'mean smoothness'
'mean compactness' 'mean concavity' 'mean concave points' 'mean symmetry' 'mean
fractal dimension' 'radius error' 'texture error' 'perimeter error' 'area error' 'smoothness
error' 'compactness error' 'concavity error''concave points error' 'symmetry error' 'fractal
dimension error''worst radius' 'worst texture' 'worst perimeter' 'worst area''worst
smoothness' 'worst compactness' 'worst concavity’'worst concave points' 'worst
symmetry' 'worst fractal dimension' were obtained for each of n = 569 patients, as well as
the response of interest, a quantitative measure of disease progression one year after
baseline.
1. Apply standard scalar on the independent features mentioned below and Do regression
analysis for the impact on 'mean fractal dimension' by the features of 'mean texture',
'mean area' and 'mean compactness'.
2. Evaluate the performance of the regression model using R2, MSE, MAE and SSE
1. Do some exploratory data analysis to figure out which variables have direct and clear
impact on employee retention (ie, whether they leave the company or continue to work)
2. Now build logistic regression model using variables that were narrowed down in step 1
3. Measure the accuracy (precision,recal,F1 and ROC) of the model
Python Notebook:
Q1: Linear Regression Analysis
import pandas as pd
from sklearn.datasets import load_breast_cancer
import matplotlib.pyplot as plt
bs = load_breast_cancer(as_frame=True)
df = bs.data
df.head()
[5 rows x 30 columns]
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
model = LinearRegression()
model.fit(X_train, y_train)
preds = model.predict(X_test)
df1.head()
salary
0 low
1 medium
2 medium
3 low
4 low
df1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 satisfaction_level 14999 non-null float64
1 last_evaluation 14999 non-null float64
2 number_project 14999 non-null int64
3 average_montly_hours 14999 non-null int64
4 time_spend_company 14999 non-null int64
5 Work_accident 14999 non-null int64
6 left 14999 non-null int64
7 promotion_last_5years 14999 non-null int64
8 Department 14999 non-null object
9 salary 14999 non-null object
dtypes: float64(2), int64(6), object(2)
memory usage: 1.1+ MB
df1['Department'].unique()
encoded_df = pd.DataFrame(encoded_features,
columns=encoder.get_feature_names_out(['salary']))
df1 = df1.join(encoded_df)
df1.drop(['salary'], axis=1, inplace=True)
df1.head()
plt.figure(figsize=(10,8), dpi=200)
sns.heatmap(df1.drop('Department', axis=1).corr(), cmap='coolwarm',
annot=True)
<Axes: >
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
model = LogisticRegressionCV(class_weight='balanced')
model.fit(X_train, y_train)
preds = model.predict(X_test)
precision: 0.6884939195509823
recall: 0.3768561187916027
f1 score: 0.48709463931171415