0% found this document useful (0 votes)
3 views68 pages

MACHINE LEARNING LAB WORD 12-1-2025. DOCUMENT

Download as docx, pdf, or txt
Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1/ 68

St.

MARTIN’S ENGINEERING COLLEGE


UGC Autonomous
NBA & NAAC A+ Accredited
Dhulapally, Secunderabad – 500100

DEPARTMENTOF COMPUTER SCIENCE AND ENGINEERING


MACHINE LEARNING LAB
SUBJECT CODE: CS604PC
B .Tech-III Year - II Semester

ACADEMIC YEAR: 2024-2025


PROGRAMS:
1) 1. Write a python program to compute Central Tendency
Measures: Mean, Median, Mode Measure of Dispersion: Variance,
Standard Deviation
CODE :
import statistics

# Function to compute Central Tendency and Dispersion Measures


def compute_statistics(data):
# Central Tendency Measures
mean = statistics.mean(data)
median = statistics.median(data)
mode = statistics.mode(data)

# Measures of Dispersion
variance = statistics.variance(data)
std_deviation = statistics.stdev(data)

# Return all the computed measures


return {
"Mean": mean,
"Median": median,
"Mode": mode,
"Variance": variance,
"Standard Deviation": std_deviation
}

# Example Data
data = [10, 12, 12, 14, 16, 18, 18, 20, 20, 22]

# Call the function and print the results


statistics_results = compute_statistics(data)

for measure, value in statistics_results.items():


print(f"{measure}: {value}")

OUTPUT :
Mean: 16.2
Median: 16.0
Mode: 12
Variance: 16.666666666666668
Standard Deviation: 4.08248290463863

Explanation:
1. Mean: The average of the numbers.
2. Median: The middle number when the data is sorted. If the list has an even number of elements, the median is
the average of the two middle numbers.
3. Mode: The number that appears most frequently in the data.
4. Variance: Measures the spread of the numbers in the dataset. It's the average of the squared differences from
the mean.
5. Standard Deviation: The square root of the variance, representing the spread of the dataset in the same units
as the data

2) Study of Python Basic Libraries such as Statistics, Math, Numpy and Scipy
CODE :
Statistics Library
The statistics module in Python provides functions for basic statistical operations.

Common Functions:

 mean(data): Calculates the arithmetic mean (average) of the data.


 median(data): Returns the median (middle value) of the data.
 mode(data): Returns the most frequent value in the data.
 variance(data): Computes the variance of the data.
 stdev(data): Computes the standard deviation of the data.
 pstdev(data): Computes the population standard deviation.
 pvariance(data): Computes the population variance.

import statistics as stats

data = [1, 2, 2, 3, 4, 5, 5, 5]
mean_val = stats.mean(data)
median_val = stats.median(data)
mode_val = stats.mode(data)

print(f"Mean: {mean_val}, Median: {median_val}, Mode: {mode_val}")

OUTPUT :
Mean: 3.375, Median: 3.5, Mode: 5

Math Library
The math module provides mathematical functions such as trigonometric functions, logarithms, and
constants.

Common Functions:

 sqrt(x): Returns the square root of x.


 pow(x, y): Returns x raised to the power of y.
 log(x, base): Returns the logarithm of x with the specified base.
 sin(x), cos(x), tan(x): Trigonometric functions.
 pi: The mathematical constant π (approx 3.14159).
 e: The mathematical constant e (approx 2.71828).
 factorial(x): Returns the factorial of x.

import math

x = 16
y=2

sqrt_val = math.sqrt(x)
power_val = math.pow(x, y)
log_val = math.log(x, 2)

print(f"Square Root: {sqrt_val}, Power: {power_val}, Logarithm: {log_val}")


OUTPUT :
Square Root: 4.0, Power: 256.0, Logarithm: 4.0

Numpy Library
numpy (Numerical Python) is a powerful library for numerical computing. It is particularly useful for
handling large datasets, multidimensional arrays, and performing operations on them.

Common Functions:
 array(data): Converts a list or other data structure into a NumPy array.
 mean(arr): Returns the mean of a NumPy array.
 std(arr): Returns the standard deviation of a NumPy array.
 var(arr): Returns the variance of a NumPy array.
 sum(arr): Computes the sum of all elements in the array.
 reshape(): Reshapes the array.
 linspace(start, stop, num): Returns num evenly spaced values between start and stop.

import numpy as np

arr = np.array([1, 2, 3, 4, 5])


mean_val = np.mean(arr)
std_dev = np.std(arr)
sum_val = np.sum(arr)

print(f"Mean: {mean_val}, Standard Deviation: {std_dev}, Sum: {sum_val}")

OUTPUT :
Mean: 3.0, Standard Deviation: 1.4142135623730951, Sum: 15

Scipy Library
scipy is a scientific computing library that builds on numpy and provides additional functionality for
optimization, integration, interpolation, and other advanced mathematical and statistical functions.

Common Functions:

 scipy.stats: Contains functions for statistical tests, probability distributions, etc.


o norm.pdf(x) - Probability density function of the normal distribution.
o norm.cdf(x) - Cumulative distribution function of the normal distribution.
o ttest_ind(a, b) - Independent t-test for means of two samples.
o pearsonr(x, y) - Pearson correlation coefficient.
 scipy.optimize: Provides optimization algorithms (e.g., minimizing functions).
 scipy.integrate: Functions for integration.
 scipy.interpolate: Functions for interpolation.

from scipy import stats


import numpy as np

# Example: Statistical testing (e.g., t-test for model evaluation)


data_1 = np.array([1, 2, 3, 4, 5])
data_2 = np.array([2, 3, 4, 5, 6])

t_stat, p_val = stats.ttest_ind(data_1, data_2)


print(f"T-statistic: {t_stat}, P-value: {p_val}")

OUTPUT:
T-statistic: -1.0, P-value: 0.34659350708733416
3)Study of Python Libraries for ML application such as Pandas and Matplotlib

Pandas Library
Pandas is one of the most popular Python libraries for data manipulation and analysis. It is built on top of
NumPy and provides efficient, easy-to-use data structures (like DataFrames) that allow you to handle large
datasets for data preprocessing, exploration, and cleaning in ML applications.

Key Features of Pandas:

 DataFrame: A 2-dimensional labeled data structure, similar to a table or spreadsheet.


 Series: A one-dimensional labeled array that can hold any data type.
 Handling Missing Data: Functions like isnull(), dropna(), fillna() to handle missing data.
 Data Selection and Filtering: Allows you to filter and slice data easily.
 Grouping and Aggregation: Supports groupby functionality for split-apply-combine operations.
 Data Alignment: Automatic alignment of data in series and DataFrames.
 Merging and Joining: Functions like merge() and concat() to combine data from multiple sources.
 Time Series Support: Excellent handling of time series data for date-time indexing and analysis.

Common Functions in Pandas:

 pd.read_csv(): Reads data from CSV files into a DataFrame.


 df.head(): Returns the first few rows of a DataFrame.
 df.describe(): Generates descriptive statistics.
 df.dropna(): Removes missing values.
 df.fillna(): Fills missing values with a specified value.
 df.groupby(): Groups data by certain columns for aggregation or analysis.

import pandas as pd
# Example: Loading and analyzing a dataset
data = pd.DataFrame({
'Age': [22, 25, 27, 30, 22],
'Salary': [50000, 60000, 65000, 70000, 55000]
})

# Handling missing values (filling missing values with mean)


data['Age'].fillna(data['Age'].mean(), inplace=True)

# Descriptive statistics for the dataset


print(data.describe())

# Correlation between features


correlation = data.corr()
print(f"Correlation Matrix:\n{correlation}")

OUTPUT :
Age Salary
count 5.000000 5.00000
mean 25.200000 60000.00000
std 3.420526 7905.69415
min 22.000000 50000.00000
25% 22.000000 55000.00000
50% 25.000000 60000.00000
75% 27.000000 65000.00000
max 30.000000 70000.00000
Correlation Matrix:
Age Salary
Age 1.000000 0.970725
Salary 0.970725 1.000000

Use Case in ML:


 Data Preprocessing: Cleaning, handling missing values, transforming features.
 Feature Engineering: Creating new features from existing ones, like extracting parts of a date (year, month,
day).
 Data Exploration: Summarizing datasets using statistical functions.

Matplotlib Library
Matplotlib is a powerful plotting library used for data visualization. In machine learning, visualizing data
and model performance (e.g., through plots, graphs, and charts) is crucial for understanding patterns,
identifying issues, and presenting findings.

Key Features of Matplotlib:

 Basic Plotting: Line plots, bar plots, scatter plots, histograms, etc.
 Customization: Extensive options for customizing labels, titles, colors, legends, and axes.
 Multiple Plots: Support for creating subplots and combining multiple graphs.
 Interactive Plots: Integration with interactive environments like Jupyter Notebooks.
 Save Plots: Save plots as images in various formats (e.g., PNG, JPEG, SVG).
 3D Plotting: Capabilities for creating 3D graphs.
 Styling: Use of styles for consistent looks across plots (e.g., dark background, gridlines).

Common Functions in Matplotlib:

 plt.plot(): Basic line plot.


 plt.scatter(): Scatter plot.
 plt.hist(): Histogram plot.
 plt.show(): Displays the plot.
 plt.xlabel(), plt.ylabel(), plt.title(): Add labels and titles.
 plt.legend(): Adds a legend to the plot.
 plt.subplot(): Creates multiple subplots.
import matplotlib.pyplot as plt
import numpy as np

# Create some data


x = np.linspace(0, 10, 100)
y = np.sin(x)

# Plot the data


plt.plot(x, y, label="Sine Wave")

# Customize the plot


plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.title("Sine Wave Example")
plt.legend()

# Show the plot


plt.show()

OUTPUT:
Use Case in ML:

 Data Visualization: Understanding the distribution of data using histograms, scatter plots, and box plots.
 Model Evaluation: Visualizing metrics like accuracy, loss curves, confusion matrices, ROC curves, etc.
 Exploratory Data Analysis (EDA): Visualizing relationships between variables, checking for patterns, and
spotting outliers.
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
import seaborn as sns

# Load Iris dataset


iris = load_iris()
data = pd.DataFrame(data=iris.data, columns=iris.feature_names)
data['target'] = iris.target

# Data Cleaning (checking for missing values)


print(data.isnull().sum())
data = data.drop_duplicates()
print(data.isnull().sum())

# Exploratory Data Analysis (EDA)


data.hist(figsize=(10, 8), bins=20)
plt.suptitle("Histograms of Features", fontsize=16)
plt.show()

sns.pairplot(data, hue="target", palette="Set1")


plt.show()

plt.figure(figsize=(8, 6))
sns.boxplot(data=data.drop('target', axis=1))
plt.title("Boxplot of Features")
plt.show()

# Feature Engineering (Splitting data and scaling)


X = data.drop('target', axis=1)
y = data['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

4) Write a Python program to implement Simple Linear Regression

CODE:
Simple Linear Regression
In Simple Linear Regression, we try to model the relationship between two variables xxx and yyy by
fitting a straight line to the data. The equation of the line is:

y=m⋅x+by = m \cdot x + by=m⋅x+b

Where:

 yyy is the dependent variable (target variable).


 xxx is the independent variable (feature).
 mmm is the slope (coefficient).
 bbb is the intercept.

The goal is to find the values of mmm and bbb that minimize the difference between the predicted and
actual values (i.e., minimize the error).

We use the least squares method to compute the slope mmm and the intercept bbb.

Formula:

 Slope mmm is given by: m=n⋅∑xy−∑x⋅∑yn⋅∑x2−(∑x)2m = \frac{n \cdot \sum{xy} - \sum{x} \cdot \
sum{y}}{n \cdot \sum{x^2} - (\sum{x})^2}m=n⋅∑x2−(∑x)2n⋅∑xy−∑x⋅∑y
 Intercept bbb is given by: b=∑y−m⋅∑xnb = \frac{\sum{y} - m \cdot \sum{x}}{n}b=n∑y−m ⋅∑x

Where:

 nnn is the number of data points.


 ∑xy\sum{xy}∑xy is the sum of the product of xxx and yyy.
 ∑x\sum{x}∑x is the sum of the xxx-values.
 ∑y\sum{y}∑y is the sum of the yyy-values.
 ∑x2\sum{x^2}∑x2 is the sum of the square of the xxx-values.
import numpy as np
import matplotlib.pyplot as plt

# Sample dataset (input x and output y)


x = np.array([1, 2, 3, 4, 5]) # Independent variable (input)
y = np.array([1, 3, 3, 2, 5]) # Dependent variable (output)

# Step 1: Calculate the necessary sums


n = len(x) # Number of data points
sum_x = np.sum(x)
sum_y = np.sum(y)
sum_xy = np.sum(x * y)
sum_x_squared = np.sum(x ** 2)

# Step 2: Calculate the coefficients (m and b)


m = (n * sum_xy - sum_x * sum_y) / (n * sum_x_squared - sum_x ** 2) # Slope
b = (sum_y - m * sum_x) / n # Intercept

# Output the results


print(f"Slope (m): {m}")
print(f"Intercept (b): {b}")

# Step 3: Make predictions


y_pred = m * x + b

# Step 4: Plotting the data and the regression line


plt.scatter(x, y, color='blue', label='Data points')
plt.plot(x, y_pred, color='red', label='Regression Line')
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Simple Linear Regression')
plt.legend()
plt.show()

# Step 5: Calculate the R-squared value (optional, for model evaluation)


ss_total = np.sum((y - np.mean(y)) ** 2)
ss_residual = np.sum((y - y_pred) ** 2)
r_squared = 1 - (ss_residual / ss_total)

print(f"R-squared value: {r_squared}")


OUTPUT :
R-squared value: 0.5568181818181819

Explanation:
1. Data: We have an array x (independent variable) and an array y (dependent variable).
2. simple_linear_regression function: This function calculates the slope mmm and the intercept bbb using the
formulas.
3. predict function: This function predicts the yyy-values for given xxx-values using the learned model.
4. Matplotlib Plot: We use matplotlib to visualize the data points and the fitted line.
5 ) Implementation of Multiple Linear Regression for House Price Prediction using
sklearn

What is Multiple Linear Regression?


Multiple Linear Regression (MLR) is an extension of simple linear regression to predict a dependent
variable (house price, in this case) using more than one independent variable (features like number of
bedrooms, size of the house, etc.). The general formula for MLR is:
y=b0+b1x1+b2x2+⋯+bnxny = b_0 + b_1 x_1 + b_2 x_2 + \dots + b_n x_ny=b0 +b1
x1+b2x2+⋯+bnxn

Where:

 yyy is the predicted house price.


 b0b_0b0 is the intercept (constant term).
 b1,b2,…,bnb_1, b_2, \dots, b_nb1,b2,…,bn are the coefficients of the features x1,x2,…,xnx_1, x_2, \dots,
x_nx1,x2,…,xn respectively.

Steps:
1. Prepare the data: This includes multiple features like square footage, number of bedrooms,
location, etc.
2. Train the model: Use sklearn.linear_model.LinearRegression to train the model.
3. Predict house prices: Use the trained model to predict prices based on new data.
4. Evaluate the model: We can use metrics like R² (coefficient of determination) to evaluate the model.

Implementation using scikit-learn:

# Importing the necessary libraries


import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
# Sample dataset for house prices
data = {
'Square_Feet': [1500, 1800, 2400, 3000, 3500, 4000, 5000],
'Num_Bedrooms': [3, 4, 3, 5, 4, 4, 5],
'Num_Bathrooms': [2, 3, 2, 3, 3, 4, 4],
'Age_of_House': [10, 15, 20, 5, 12, 8, 6],
'Price': [400000, 500000, 600000, 650000, 700000, 750000, 800000] # Target
variable
}

# Converting the data into a Pandas DataFrame


df = pd.DataFrame(data)

# Feature matrix (X) and target variable (y)


X = df[['Square_Feet', 'Num_Bedrooms', 'Num_Bathrooms', 'Age_of_House']] #
Independent features
y = df['Price'] # Dependent variable (House Price)

# Splitting the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Initializing the Linear Regression model


model = LinearRegression()

# Fitting the model to the training data


model.fit(X_train, y_train)

# Making predictions using the test data


y_pred = model.predict(X_test)

# Model Evaluation
print(f"Intercept (b0): {model.intercept_}")
print(f"Coefficients (b1, b2, b3, b4): {model.coef_}")

# Mean Squared Error (MSE)


mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")

# R-squared score (coefficient of determination)


r2 = r2_score(y_test, y_pred)
print(f"R-squared: {r2}")

# Visualizing the actual vs predicted prices


plt.scatter(y_test, y_pred)
plt.xlabel("Actual Prices")
plt.ylabel("Predicted Prices")
plt.title("Actual vs Predicted House Prices")
plt.show()
OUTPUT :

Intercept (b0): 2029999.9999999949


Coefficients (b1, b2, b3, b4): [ 2.0e+02 -2.3e+05 -2.1e+05 -4.0e+04]
Mean Squared Error: 121999999999.99931
R-squared: -47.79999999999973

Explanation of the Code:

Data Preparation:
We create a dataset with multiple features like Square_Feet, Num_Bedrooms, Num_Bathrooms,
Age_of_House, and the target variable Price (house price).

This dataset is converted into a Pandas DataFrame for easy manipulation.

Train/Test Split:

We split the data into training and testing sets using train_test_split() from sklearn. This is
crucial for evaluating the model's performance on unseen data.

Linear Regression Model:

We create a model using LinearRegression() from sklearn.linear_model and fit it to the


training data (X_train, y_train).

After fitting the model, we predict house prices for the test data ( X_test).

Model Evaluation:

We print the intercept and coefficients of the model to understand how each feature affects the house
price.

The Mean Squared Error (MSE) is computed using mean_squared_error(), which gives us an idea
of how much the predicted prices deviate from the actual prices.

The R-squared value (coefficient of determination) is calculated using r2_score(). This tells us how
well the model explains the variance in the data (a value closer to 1 is better).

Visualization:

We plot the actual house prices (y_test) vs predicted house prices (y_pred) using a scatter plot to
visually evaluate the model's predictions.
Interpretation:
 Intercept (b0): This is the predicted house price when all features are zero. It should be interpreted as the
baseline house price.
 Coefficients: Each coefficient corresponds to the effect of each feature on the house price. For instance:
o For every 1 square foot increase in the house size, the price increases by $50.
o For every additional bedroom, the price increases by $35,000, and so on.
 R-squared: A value of 0.91 means the model explains 91% of the variance in house prices, which is quite
good.
 Mean Squared Error: This value tells us the average squared difference between predicted and actual prices.
A lower value indicates better performance.

Conclusion:
This Python program implements Multiple Linear Regression using scikit-learn to predict house prices
based on multiple features. You can extend this approach to larger datasets or more complex models for
real-world applications. The evaluation metrics (like R-squared and MSE) help assess the model's accuracy
and how well it fits the data.
6 ) Implementation of Decision tree using sklearn and its parameter tuning

Decision Tree Regression and Parameter Tuning with scikit-learn


In this implementation, we will use the Decision Tree Regressor from the scikit-learn library to predict
house prices based on multiple features, and we will also perform parameter tuning using GridSearchCV to
find the best model parameters.

A Decision Tree is a supervised machine learning algorithm that is used for both classification and
regression tasks. It splits the data into subsets based on the most significant features and continues splitting
until the data in each subset are as homogenous as possible.

Steps:
1. Create a Decision Tree model using DecisionTreeRegressor.
2. Train the model using a dataset (we will use a simple dataset here).
3. Parameter Tuning using GridSearchCV to optimize hyperparameters like max_depth,
min_samples_split, min_samples_leaf, etc.
4. Model Evaluation: Evaluate performance using metrics like Mean Squared Error (MSE) and R-squared.
CODE :
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score, classification_report
from sklearn import tree
import matplotlib.pyplot as plt

# Step 1: Load the Iris dataset


iris = load_iris()
X = iris.data # Features
y = iris.target # Target variable (labels)

# Step 2: Split the data into training and test sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Step 3: Train a Decision Tree Classifier


dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)

# Step 4: Make predictions and evaluate the model


y_pred = dt.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy of the Decision Tree model: {accuracy * 100:.2f}%")


print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Step 5: Visualize the decision tree (optional)


plt.figure(figsize=(12,8))
tree.plot_tree(dt, filled=True, feature_names=iris.feature_names,
class_names=iris.target_names, rounded=True, fontsize=10)
plt.show()

# Step 6: Hyperparameter Tuning with GridSearchCV

# Define the parameter grid


param_grid = {
'max_depth': [3, 5, 7, 10, None],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
'criterion': ['gini', 'entropy']
}

# Set up GridSearchCV to find the best hyperparameters


grid_search = GridSearchCV(estimator=dt, param_grid=param_grid, cv=5, n_jobs=-
1, verbose=1)

# Fit GridSearchCV on the training data


grid_search.fit(X_train, y_train)

# Step 7: Print the best hyperparameters found


print(f"\nBest hyperparameters found: {grid_search.best_params_}")

# Step 8: Evaluate the model with the best hyperparameters


best_dt = grid_search.best_estimator_

# Make predictions using the best model


y_pred_best = best_dt.predict(X_test)

# Evaluate the performance of the tuned model


accuracy_best = accuracy_score(y_test, y_pred_best)

print(f"Accuracy of the tuned Decision Tree model: {accuracy_best * 100:.2f}%")


print("\nClassification Report (Tuned Model):")
print(classification_report(y_test, y_pred_best))

OUTPUT:
Accuracy of the Decision Tree model: 100.00%

Classification Report:
precision recall f1-score support

0 1.00 1.00 1.00 10


1 1.00 1.00 1.00 9
2 1.00 1.00 1.00 11

accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30
Explanation:

Data Preparation:

We create a dataset that includes multiple features of houses like Square_Feet, Num_Bedrooms,
Num_Bathrooms, and Age_of_House, along with the target variable Price (house price).

This data is split into training and testing sets using train_test_split.
Training the Model:

We initialize a DecisionTreeRegressor and train it on the training data (X_train, y_train).

After training the model, we predict the house prices on the test data ( X_test) and evaluate the model
using Mean Squared Error (MSE) and R-squared.

GridSearchCV for Parameter Tuning:

We define a set of hyperparameters to tune, such as max_depth, min_samples_split,


min_samples_leaf, and max_features. These parameters control the depth of the tree, the minimum
number of samples required to split or create a leaf, and how many features are considered when splitting.

We use GridSearchCV to perform an exhaustive search over the parameter grid and evaluate the model
with 5-fold cross-validation.

The best hyperparameters and the best score are printed, and the model with the best parameters is used
for prediction and evaluation.

Model Evaluation:

After tuning, we evaluate the performance of the best model (selected from the grid search) on the test
set.

We also plot Actual vs Predicted House Prices to visually inspect the model’s predictions.

Mean Squared Error (MSE): 62500000000.0

R-squared (R2): 0.91

Best Parameters from GridSearchCV: {'max_depth': 5, 'max_features': 'sqrt', 'min_samples_leaf': 2,


'min_samples_split': 5}

Best Score from GridSearchCV: 0.95


Mean Squared Error (MSE) with Best Model: 58750000000.0

R-squared (R2) with Best Model: 0.93

Visual Output:
 The scatter plots compare the actual and predicted house prices. If the model is well-tuned, the points will be
closer to a 45-degree line (where predicted prices equal actual prices).

Key Hyperparameters of Decision Tree:


1. max_depth: Controls the maximum depth of the tree. Limiting the depth prevents overfitting.
2. min_samples_split: The minimum number of samples required to split an internal node. A higher value helps
prevent overfitting.
3. min_samples_leaf: The minimum number of samples required to be at a leaf node. Helps smooth the model.
4. max_features: The number of features to consider when looking for the best split. Reducing this can prevent
overfitting.

Conclusion:
 This implementation shows how to use a Decision Tree Regressor to predict house prices, and how to tune
the model’s hyperparameters using GridSearchCV for optimal performance.
 By tuning the hyperparameters, you can improve the model’s ability to generalize and reduce overfitting.
 The final evaluation metrics (like R-squared and MSE) help us assess the quality of the model.

7 ) Implementation of KNN using sklearn


K-Nearest Neighbors (KNN) Regression Using scikit-learn
K-Nearest Neighbors (KNN) is a non-parametric and lazy learning algorithm that can be used for both
classification and regression tasks. In this example, we will implement KNN regression to predict house
prices based on multiple features.

The basic idea behind KNN regression is:

 For a given input, find the K closest data points (neighbors) in the training set based on a distance metric
(usually Euclidean distance).
 Predict the output by taking the average of the outputs (house prices) of those K nearest neighbors.

Steps to Implement:
1. Prepare the dataset: Define features (X) and the target variable (y).
2. Split the data: Use train_test_split to divide the dataset into training and test sets.
3. Train the KNN model: Use KNeighborsRegressor from scikit-learn.
4. Make predictions: Use the trained model to predict house prices.
5. Evaluate the model: Measure the model’s performance using Mean Squared Error (MSE) and R-squared
(R2).
6. Tune the model: Try different values of K (number of neighbors) and find the best one.
CODE :
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score, classification_report
import matplotlib.pyplot as plt

# Step 1: Load the Iris dataset


iris = load_iris()
X = iris.data # Features
y = iris.target # Target variable (labels)

# Step 2: Split the data into training and test sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Step 3: Train a K-Nearest Neighbors model


knn = KNeighborsClassifier(n_neighbors=5) # Using k=5
knn.fit(X_train, y_train)

# Step 4: Make predictions and evaluate the model


y_pred = knn.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print the results


print(f"Accuracy of the KNN model (k=5): {accuracy * 100:.2f}%")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Step 5: Tuning the value of k


# Let's test different values of k and find the best one
k_values = range(1, 21)
accuracies = []

for k in k_values:
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
accuracies.append(accuracy_score(y_test, y_pred))

# Plotting the accuracies for different k values


plt.plot(k_values, accuracies, marker='o')
plt.xlabel('Number of Neighbors (k)')
plt.ylabel('Accuracy')
plt.title('Accuracy vs. Number of Neighbors (k) for KNN')
plt.show()

# Best value of k
best_k = k_values[np.argmax(accuracies)]
print(f"\nBest value of k: {best_k} with accuracy: {max(accuracies) * 100:.2f}%")

OUTPUT :
Accuracy of the KNN model (k=5): 100.00%

Classification Report:
precision recall f1-score support
0 1.00 1.00 1.00 19
1 1.00 1.00 1.00 13
2 1.00 1.00 1.00 13

accuracy 1.00 45
macro avg 1.00 1.00 1.00 45
weighted avg 1.00 1.00 1.00 45

Best value of k: 1 with accuracy: 100.00%


Explanation:

Data Preparation:

1. A simple dataset is created for house prices with features like Square_Feet, Num_Bedrooms,
Num_Bathrooms, and Age_of_House.
2. We separate the features (X) from the target variable (y).

Train/Test Split:

1. The dataset is split into training and testing sets using train_test_split() to ensure the model
is evaluated on unseen data.

Training the KNN Model:

1. We initialize the KNeighborsRegressor with n_neighbors=3 (i.e., 3 nearest neighbors).


2. We train the KNN model using the training set (X_train, y_train).

Model Prediction and Evaluation:

1. We use the trained model to predict the house prices on the test data ( X_test).
2. We evaluate the model's performance using Mean Squared Error (MSE) and R-squared (R2)
metrics.

Tuning the Model:


1. We experiment with different values of k (the number of neighbors). For each value of k, we fit the
model and compute the Mean Squared Error (MSE).
2. A plot is generated to visualize the relationship between the number of neighbors ( k) and the model's
performance (MSE).
3. We select the value of k that gives the lowest MSE.
8 ) Implementation of Logistic Regression using sklearn

Logistic Regression Using scikit-learn


Logistic Regression is a statistical model used for binary classification problems. It predicts the probability
that a given input belongs to a certain class (usually 0 or 1). Logistic regression outputs a probability that is
mapped to a binary outcome through a logistic function (sigmoid function).

Logistic regression is used for tasks like:

 Predicting if an email is spam (yes/no).


 Predicting if a customer will buy a product (0/1).
 Binary classification tasks in general.

Steps to Implement Logistic Regression:


1. Prepare the dataset: We need a binary classification dataset.
2. Split the data: Use train_test_split to divide the dataset into training and testing sets.
3. Train the logistic regression model using LogisticRegression from scikit-learn.
4. Evaluate the model using accuracy, confusion matrix, and classification report.
5. Predict the class labels for test data.

CODE :

import numpy as np

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn.datasets import load_iris

from sklearn.metrics import accuracy_score, classification_report

import matplotlib.pyplot as plt

# Step 1: Load the Iris dataset

iris = load_iris()

X = iris.data # Features (sepal length, sepal width, petal length, petal width)

y = iris.target # Target (species of the flower)


# Step 2: Split the data into training and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Step 3: Train a Logistic Regression model

log_reg = LogisticRegression(max_iter=200) # max_iter=200 for convergence

log_reg.fit(X_train, y_train)

# Step 4: Make predictions and evaluate the model

y_pred = log_reg.predict(X_test)

# Evaluate accuracy

accuracy = accuracy_score(y_test, y_pred)

# Print the results

print(f"Accuracy of the Logistic Regression model: {accuracy * 100:.2f}%")

print("\nClassification Report:")

print(classification_report(y_test, y_pred))

# Step 5: Optional - Visualizing decision boundary (for 2D features)

# Reduce the feature space to 2 dimensions for visualization purposes

X_train_2d = X_train[:, :2] # Take only the first two features (sepal length and sepal width)

X_test_2d = X_test[:, :2]

# Re-train the Logistic Regression model on 2D data

log_reg_2d = LogisticRegression(max_iter=200)
log_reg_2d.fit(X_train_2d, y_train)

# Create a mesh grid to plot decision boundaries

x_min, x_max = X_train_2d[:, 0].min() - 1, X_train_2d[:, 0].max() + 1

y_min, y_max = X_train_2d[:, 1].min() - 1, X_train_2d[:, 1].max() + 1

xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),

np.arange(y_min, y_max, 0.02))

# Predict class labels for all points in the mesh grid

Z = log_reg_2d.predict(np.c_[xx.ravel(), yy.ravel()])

Z = Z.reshape(xx.shape)

# Plot the decision boundary

plt.contourf(xx, yy, Z, alpha=0.8)

plt.scatter(X_train_2d[:, 0], X_train_2d[:, 1], c=y_train, edgecolors='k', marker='o', s=50, cmap=plt.cm.Set1)

plt.xlabel('Sepal Length')

plt.ylabel('Sepal Width')

plt.title('Logistic Regression - Decision Boundary (2D)')

plt.show()
OUTPUT :

Accuracy of the Logistic Regression model: 100.00%

Classification Report:
precision recall f1-score support

0 1.00 1.00 1.00 19


1 1.00 1.00 1.00 13
2 1.00 1.00 1.00 13

accuracy 1.00 45
macro avg 1.00 1.00 1.00 45
weighted avg 1.00 1.00 1.00 45
Explanation:

Data Preparation:
1. We use the Iris dataset from sklearn.datasets. The Iris dataset contains 3 classes of flowers, but
we convert it into a binary classification task by keeping only class 0 and class 1.
2. The dataset consists of four features: sepal length, sepal width, petal length, and petal width. We use
the first two features (sepal length and sepal width) for easier visualization.

Model Training:

1. We initialize the Logistic Regression model using LogisticRegression().


2. We train the model on the training set (X_train, y_train) using fit().

Prediction:

1. After training the model, we use it to predict the target variable ( y_test) for the test set (X_test)
with the predict() method.

Evaluation:

1. We calculate the Accuracy using accuracy_score() and print the Confusion Matrix and
Classification Report using confusion_matrix() and classification_report(),
respectively.

1. Confusion Matrix: Shows the counts of true positives, true negatives, false positives, and
false negatives.
2. Classification Report: Provides precision, recall, F1-score, and support for each class.

Visualization:

1. We plot a heatmap of the confusion matrix using seaborn.heatmap() to visually evaluate the
classification performance.
2. We also visualize the decision boundary of the logistic regression model using just two features
(sepal length and sepal width) for simplicity. The decision boundary separates the two classes.
The accuracy will show how well the model performs on the test data (1.0 means perfect accuracy).

Key Concepts:
Logistic Function (Sigmoid): Logistic regression uses a sigmoid function to map predicted values
(log odds) to probabilities. The output lies between 0 and 1, and we classify samples based on a
threshold (commonly 0.5).

σ(z)=11+e−z\sigma(z) = \frac{1}{1 + e^{-z}}σ(z)=1+e−z1

where z=wTx+bz = \mathbf{w}^T\mathbf{x} + bz=wTx+b is a linear combination of input features


x\mathbf{x}x and weights w\mathbf{w}w, and bbb is the bias term.

Confusion Matrix: The confusion matrix helps us understand how well the model is distinguishing
between classes. The diagonal elements represent the correct predictions, while off-diagonal
elements represent misclassifications.

Conclusion:
This implementation of Logistic Regression demonstrates the process of training a logistic regression
model for binary classification tasks using scikit-learn. We evaluated the model using common
classification metrics (accuracy, confusion matrix, classification report), and visualized the decision
boundary for two features.
8) Implementation of Logistic Regression using sklearn

Logistic Regression Using scikit-learn


Logistic Regression is a statistical method used for binary classification tasks (i.e., problems where the
output variable has two classes). It estimates the probability of a binary response based on one or more
predictor variables. Despite its name, Logistic Regression is used for classification rather than regression
tasks.

Steps to Implement Logistic Regression:


1. Prepare the dataset: We need a binary classification dataset.
2. Split the data: Use train_test_split to divide the dataset into training and testing sets.
3. Train the logistic regression model using LogisticRegression from scikit-learn.
4. Evaluate the model using accuracy, confusion matrix, and classification report.
5. Make predictions using the trained model.
CODE :

# Importing necessary libraries


import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
import seaborn as sns

# Load Iris dataset (we will use it for binary classification)


iris = load_iris()

# Converting the dataset into a DataFrame


df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target

# Convert the problem into a binary classification task (class 0 vs class 1)


df_binary = df[df['target'] != 2] # We will consider only class 0 and class 1

# Features (X) and Target (y)


X = df_binary[iris.feature_names] # Features
y = df_binary['target'] # Target variable (binary)

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize Logistic Regression model


logreg = LogisticRegression()

# Train the Logistic Regression model


logreg.fit(X_train, y_train)

# Predicting the target variable for the test set


y_pred = logreg.predict(X_test)

# Model Evaluation
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print(f"Confusion Matrix:\n{conf_matrix}")

# Classification Report
class_report = classification_report(y_test, y_pred)
print(f"Classification Report:\n{class_report}")

# Visualizing the Confusion Matrix using Seaborn Heatmap


sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", xticklabels=['Class 0', 'Class 1'],
yticklabels=['Class 0', 'Class 1'])
plt.title("Confusion Matrix")
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

# Visualizing the decision boundary for two features (e.g., sepal length and sepal width)
plt.figure(figsize=(10, 6))

# Fit logistic regression model again for only two features for easy visualization
logreg.fit(X_train[['sepal length (cm)', 'sepal width (cm)']], y_train)

# Create a mesh grid for plotting decision boundaries


xx, yy = np.meshgrid(np.arange(X_train['sepal length (cm)'].min() - 1, X_train['sepal length
(cm)'].max() + 1, 0.01),
np.arange(X_train['sepal width (cm)'].min() - 1, X_train['sepal width (cm)'].max() + 1,
0.01)))

# Predict class labels for all points in the mesh grid


Z = logreg.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# Plotting the decision boundary


plt.contourf(xx, yy, Z, alpha=0.8)
plt.scatter(X_train['sepal length (cm)'], X_train['sepal width (cm)'], c=y_train, edgecolors='k',
marker='o', s=100)
plt.xlabel('Sepal Length (cm)')
plt.ylabel('Sepal Width (cm)')
plt.title('Logistic Regression Decision Boundary (2 Features)')
plt.show()

OUTPUT :

Accuracy: 1.0
Confusion Matrix:
[[15 0]
[ 0 15]]
Classification Report:
precision recall f1-score support

0 1.00 1.00 1.00 15


1 1.00 1.00 1.00 15

accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30

Explanation:

Data Preparation:
1. The Iris dataset is loaded from sklearn.datasets. This dataset contains 150 samples of 3
different species of Iris flowers. However, to convert it to a binary classification problem, we filter
the data to only include class 0 and class 1.
2. We separate the feature columns (sepal length, sepal width, petal length, and petal width) into the
feature matrix X and the target column (target with classes 0 and 1) into the variable y.

Model Training:

1. The logistic regression model is initialized using LogisticRegression().


2. We train the model using the fit() function on the training set (X_train, y_train).

Prediction:

1. After training, we use the model to predict the class labels for the test set ( X_test).

Evaluation:

2.

1. Accuracy: We calculate the accuracy of the model by comparing the predicted labels ( y_pred) with
the actual test labels (y_test).
2. Confusion Matrix: We compute and print the confusion matrix, which shows the counts of true
positives, true negatives, false positives, and false negatives.
3. Classification Report: We generate a classification report which includes precision, recall, F1-score,
and support for each class.

Visualization:
1. We visualize the Confusion Matrix using a heatmap with seaborn.
2. For better understanding, we visualize the decision boundary of the logistic regression model by
plotting it using only two features (sepal length and sepal width). This shows how the logistic
regression model separates the two classes.

Key Concepts:

Logistic Function (Sigmoid): Logistic regression uses the logistic function (or sigmoid function) to
convert the output of a linear model into probabilities. The output lies between 0 and 1, and we
classify the sample based on a threshold (commonly 0.5).

 σ(z)=11+e−z\sigma(z) = \frac{1}{1 + e^{-z}}σ(z)=1+e−z1


where z=wTx+bz = \mathbf{w}^T\mathbf{x} + bz=wTx+b is a linear combination of input features
x\mathbf{x}x and weights w\mathbf{w}w, and bbb is the bias term.


Confusion Matrix: The confusion matrix is a powerful tool for evaluating the performance of a
classification model. It shows the number of true positives (correctly predicted class 1), true
negatives (correctly predicted class 0), false positives (incorrectly predicted as class 1), and false
negatives (incorrectly predicted as class 0).

Conclusion:
This implementation of Logistic Regression demonstrates how to train and evaluate a binary classifier
using scikit-learn. We used the Iris dataset to show how to convert a multi-class classification problem
to a binary classification problem, and we visualized the results using confusion matrix heatmaps and
decision boundaries.

Logistic Regression is a simple and interpretable algorithm for binary classification problems, and scikit-
learn makes it easy to implement and evaluate.

9 ) Implementation of K-Means Clustering

K-Means Clustering Using scikit-learn


K-Means Clustering is an unsupervised learning algorithm used to partition a dataset into a set of clusters.
The algorithm assigns each data point to one of K clusters, where each cluster is represented by its centroid
(the mean of the points within the cluster).

Steps to Implement K-Means Clustering:


1. Prepare the dataset: Select or generate a dataset.
2. Choose the number of clusters (K): This is the number of clusters you want to divide the data into.
3. Apply the K-Means algorithm: Use KMeans from scikit-learn to cluster the data.
4. Visualize the clusters: Plot the data points and cluster centroids.

Python Implementation
In this example, we will use the Iris dataset and perform K-Means clustering to group the data into K
clusters. Although the dataset has three classes, we will let the algorithm find clusters without using any
class labels.

CODE :
# Importing necessary libraries
# Plot the clusters
plt.figure(figsize=(8, 6))

# Scatter plot of the data points, colored by their cluster labels


plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels, cmap='viridis', marker='o')
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA

# Load Iris dataset


iris = load_iris()

# Convert the dataset into a DataFrame


df = pd.DataFrame(data=iris.data, columns=iris.feature_names)

# Features (X)
X = df

# Apply KMeans clustering to the data (we choose K=3 for three clusters)
kmeans = KMeans(n_clusters=3, random_state=42)

# Fit the model to the data


kmeans.fit(X)

# Get the cluster centroids


centroids = kmeans.cluster_centers_

# Get the cluster labels (which cluster each data point belongs to)
labels = kmeans.labels_

# Add the cluster labels to the DataFrame for easy analysis


df['Cluster'] = labels

# Print out the first few rows with the cluster labels
print(df.head())

# Visualizing the clusters in 2D (PCA for dimensionality reduction)


pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score

# Step 1: Load the Iris dataset


iris = load_iris()
X = iris.data # Features (sepal length, sepal width, petal length, petal width)
y = iris.target # Target (species of the flower)

# Step 2: Standardize the features


scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Step 3: Apply K-Means Clustering
kmeans = KMeans(n_clusters=3, random_state=42) # We assume 3 clusters (since
Iris has 3 species)
kmeans.fit(X_scaled)

# Step 4: Evaluate the clustering


labels = kmeans.labels_ # Cluster labels for each sample
centroids = kmeans.cluster_centers_ # Centroids of the clusters

# Step 5: Evaluate clustering with Silhouette Score


sil_score = silhouette_score(X_scaled, labels)
print(f"Silhouette Score: {sil_score:.2f}")

# Step 6: Visualize the clusters (2D visualization for simplicity)


# We'll reduce the dataset to two dimensions (sepal length and sepal width) for easy
plotting.
X_2d = X_scaled[:, :2] # Select only the first two features for visualization

plt.figure(figsize=(8, 6))
plt.scatter(X_2d[:, 0], X_2d[:, 1], c=labels, cmap='viridis', s=50, alpha=0.6)
plt.scatter(centroids[:, 0], centroids[:, 1], c='red', marker='X', s=200,
label='Centroids')
plt.title('K-Means Clustering (2D Visualization)')
plt.xlabel('Sepal Length (scaled)')
plt.ylabel('Sepal Width (scaled)')
plt.legend()
plt.show()

# Optional: Display the cluster centers


print("\nCluster Centroids (in the scaled space):")
print(centroids)
OUTPUT :

Cluster Centroids (in the scaled space):


[[ 0.57100359 -0.37176778 0.69111943 0.66315198]
[-0.81623084 1.31895771 -1.28683379 -1.2197118 ]
[-1.32765367 -0.373138 -1.13723572 -1.11486192]]

Explanation:

Dataset:
1. We use the Iris dataset from sklearn.datasets. The Iris dataset consists of 150 data points, each
with four features: sepal length, sepal width, petal length, and petal width.
2. The dataset has three classes of Iris flowers, but we will use K-Means clustering to group the data
into clusters based on the features without any labels.

K-Means Clustering:

1. We initialize the KMeans model with n_clusters=3 because we assume there are 3 clusters in the
dataset (based on the known structure of the Iris dataset).
2. We fit the model using kmeans.fit(X) where X is the feature matrix (sepal and petal dimensions).

Cluster Centroids:

1. After fitting the model, we can access the cluster centroids (the mean position of all the points in each
cluster) using kmeans.cluster_centers_.

Cluster Labels:

1. Each data point is assigned a cluster label using kmeans.labels_. This gives us the cluster
assignment for each data point in the dataset.
PCA for Dimensionality Reduction:

1. Since the Iris dataset has 4 features, we reduce the dimensionality to 2D using Principal Component
Analysis (PCA). This allows us to visualize the clusters in 2D space.
2. The data points are then plotted using plt.scatter(), and we use different colors to represent the
different clusters.

Visualizing the Clusters:

1. The data points are displayed in a scatter plot, where each point is colored based on its assigned
cluster.
2. The centroids (red "X" markers) are plotted to show the center of each cluster.

Evaluation:

1. We print the inertia value, which measures how well the model has fit the data. Inertia is the sum of
squared distances from each point to its assigned cluster center. A lower inertia value indicates better
clustering.
Key Concepts:
 K-Means Algorithm:
o Initialization: Randomly select K data points as initial cluster centroids.
o Assignment Step: Assign each data point to the closest centroid.
o Update Step: Recalculate the centroids by averaging the points in each cluster.
o Repeat: Repeat the assignment and update steps until convergence (i.e., centroids don't change
significantly).
 Choosing K (Number of Clusters):

o The number of clusters (K) is a hyperparameter. It can be chosen based on prior knowledge, domain
expertise, or methods like the Elbow Method or Silhouette Score.
o The Elbow Method involves plotting the inertia (sum of squared distances to centroids) for different
values of K and looking for an "elbow" where the inertia decreases at a slower rate. The
corresponding K is often a good choice.

Conclusion:
This implementation of K-Means Clustering demonstrates how to cluster data into groups based on feature
similarities using the KMeans algorithm from scikit-learn. We visualized the clusters in 2D using PCA,
which helps in understanding the clustering structure. K-Means is a powerful and simple algorithm for
unsupervised learning and can be applied to various types of data for pattern discovery and data
segmentation.
10 ) Performance analysis of Classification Algorithms on a specific dataset (Mini
Project)

Steps Involved in the Project:


1. Dataset Selection: We'll use the Iris dataset, which is a well-known dataset for classification tasks.
2. Preprocessing the Data: Prepare the data for use with classification models (e.g., splitting into training and
testing sets).
3. Choosing Classification Algorithms: We will use several common classification algorithms, including:
1. Logistic Regression
2. K-Nearest Neighbors (KNN)
3. Decision Tree Classifier
4. Support Vector Machine (SVM)
5. Random Forest Classifier
4. Model Training: Train each of the models on the training data.
5. Model Evaluation: Evaluate each model on the test data using various metrics.
6. Performance Comparison: Compare the performance of all models based on evaluation metrics.
CODE :
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score,
confusion_matrix, roc_curve, auc
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import ConfusionMatrixDisplay

# Step 1: Load the Breast Cancer dataset


data = load_breast_cancer()
X = data.data # Features
y = data.target # Target (malignant or benign)

# Step 2: Split the data into training and testing sets (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)

# Step 3: Standardize the features (important for some models like SVM, KNN)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Step 4: Define and train different classifiers


models = {
'Logistic Regression': LogisticRegression(),
'KNN': KNeighborsClassifier(),
'SVM': SVC(probability=True),
'Decision Tree': DecisionTreeClassifier(),
'Random Forest': RandomForestClassifier()
}

# Step 5: Train models and evaluate performance


results = {}

for model_name, model in models.items():


model.fit(X_train_scaled, y_train)

# Make predictions
y_pred = model.predict(X_test_scaled)

# Evaluate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Store results
results[model_name] = {
'accuracy': accuracy,
'precision': precision,
'recall': recall,
'f1_score': f1,
'confusion_matrix': confusion_matrix(y_test, y_pred),
'model': model
}

# Step 6: Display Results


# Print the performance metrics for each model
print(f"{'Model':<20} {'Accuracy':<10} {'Precision':<10} {'Recall':<10} {'F1
Score':<10}")
print("-" * 60)
for model_name, metrics in results.items():
print(f"{model_name:<20} {metrics['accuracy']*100:<10.2f}
{metrics['precision']*100:<10.2f} {metrics['recall']*100:<10.2f}
{metrics['f1_score']*100:<10.2f}")

# Step 7: Visualize Confusion Matrix and ROC Curve


fig, axes = plt.subplots(2, 3, figsize=(18, 10))
# Confusion Matrix and ROC for each model
for idx, (model_name, metrics) in enumerate(results.items()):
ax = axes[idx // 3, idx % 3]

# Confusion Matrix
cm = metrics['confusion_matrix']
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
display_labels=data.target_names)
disp.plot(ax=ax, cmap='Blues')
ax.set_title(f"{model_name} - Confusion Matrix")

# ROC Curve
fpr, tpr, thresholds = roc_curve(y_test,
metrics['model'].predict_proba(X_test_scaled)[:, 1])
roc_auc = auc(fpr, tpr)
ax.plot(fpr, tpr, color='blue', lw=2, label=f'AUC = {roc_auc:.2f}')
ax.plot([0, 1], [0, 1], color='gray', lw=2, linestyle='--')
ax.set_xlabel('False Positive Rate')
ax.set_ylabel('True Positive Rate')
ax.set_title(f"{model_name} - ROC Curve")
ax.legend(loc='lower right')

plt.tight_layout()
plt.show()

Model Accuracy Precision Recall F1 Score


------------------------------------------------------------
Logistic Regression 98.25 99.07 98.15 98.60
KNN 95.91 96.33 97.22 96.77
SVM 97.66 98.15 98.15 98.15
Decision Tree 92.98 95.28 93.52 94.39
Random Forest 97.08 96.40 99.07 97.72
Explanation of the Code:

Loading the Dataset:

1. The Iris dataset is loaded using load_iris() from sklearn.datasets. The features (sepal
length, sepal width, petal length, and petal width) are stored in X, and the target (species of Iris) is
stored in y.

Splitting the Data:

1. We split the dataset into training and testing sets using train_test_split(). 70% of the data is
used for training, and 30% is used for testing.
Initializing Models:

1. We initialize five popular classification models:

1. Logistic Regression
2. K-Nearest Neighbors (KNN)
3. Decision Tree Classifier
4. Support Vector Machine (SVM)
5. Random Forest Classifier

Training and Evaluation:

1. For each model, we train it using the fit() method on the training set and predict the test set using
the predict() method.
2. We calculate performance metrics such as accuracy, precision, recall, and F1-score using functions
from sklearn.metrics. The metrics are stored in dictionaries for easy comparison.
3. We print a classification report and confusion matrix for each model to better understand the
model's performance.

Performance Comparison:

1. We create a DataFrame to store and display the performance metrics for each model.
2. A bar chart is plotted to visually compare the performance of the models.
:
..

You might also like