0% found this document useful (0 votes)
73 views28 pages

ML Lab Manual

The document outlines a series of experiments for a Machine Learning lab, including programming tasks in Python to compute statistical measures, implement various machine learning algorithms, and utilize libraries like Pandas and Matplotlib. It covers topics such as Central Tendency Measures, Linear Regression, Decision Trees, KNN, and performance analysis of classification algorithms. Each experiment includes code examples and explanations of key concepts and libraries used in machine learning.

Uploaded by

bharathpatel612
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views28 pages

ML Lab Manual

The document outlines a series of experiments for a Machine Learning lab, including programming tasks in Python to compute statistical measures, implement various machine learning algorithms, and utilize libraries like Pandas and Matplotlib. It covers topics such as Central Tendency Measures, Linear Regression, Decision Trees, KNN, and performance analysis of classification algorithms. Each experiment includes code examples and explanations of key concepts and libraries used in machine learning.

Uploaded by

bharathpatel612
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

MACHINE LEARNING LAB

List of Experiments
1. Write a python program to compute Central Tendency Measures: Mean, Median, Mode
Measure of Dispersion: Variance, Standard Deviation

2. Study of Python Basic Libraries such as Statistics, Math, Numpy and Scipy

3. Study of Python Libraries for ML application such as Pandas and Matplotlib

4. Write a Python program to implement Simple Linear Regression

5. Implementation of Multiple Linear Regression for House Price Prediction using sklearn

6. Implementation of Decision tree using sklearn and its parameter tuning

7. Implementation of KNN using sklearn

8. Implementation of Logistic Regression using sklearn

9. Implementation of K-Means Clustering

10. Performance analysis of Classification Algorithms on a specific dataset (Mini Project)


1. Write a python program to compute Central Tendency Measures: Mean,
Median, Mode Measure of Dispersion: Variance, Standard Deviation

Code:
import statistics
import numpy as np

# Sample data
data = [4, 8, 6, 5, 3, 2, 8, 9, 2, 5]

# Central Tendency Measures


mean = statistics.mean(data)
median = statistics.median(data)
mode = statistics.mode(data)

# Measures of Dispersion
variance = statistics.variance(data)
std_deviation = statistics.stdev(data)

print(f"Mean: {mean}")
print(f"Median: {median}")
print(f"Mode: {mode}")
print(f"Variance: {variance}")
print(f"Standard Deviation: {std_deviation}")

output:
2ND CODE:
import statistics

# Function to compute central tendency and dispersion measures


def compute_statistics(data):
# Central Tendency Measures
mean = statistics.mean(data)
median = statistics.median(data)
mode = statistics.mode(data) if len(set(data)) != len(data) else "No mode"

# Dispersion Measures
variance = statistics.variance(data)
std_deviation = statistics.stdev(data)

return mean, median, mode, variance, std_deviation

# Sample data (You can input your own data)


data = [10, 20, 20, 30, 40, 40, 40, 50, 60, 70]

# Calculate the statistics


mean, median, mode, variance, std_deviation = compute_statistics(data)

# Output the results


print(f"Data: {data}")
print(f"Mean: {mean}")
print(f"Median: {median}")
print(f"Mode: {mode}")
print(f"Variance: {variance:.2f}")
print(f"Standard Deviation: {std_deviation:.2f}")
output:
2. Study of Python Basic Libraries such as Statistics, Math, Numpy and Scipy
Code:

1. statistics Module

The statistics module provides functions to perform statistical operations. It is part of


Python's standard library.

Key Functions:

• mean(data): Returns the arithmetic mean of the data.


• median(data): Returns the median of the data.
• mode(data): Returns the most common data point.
• variance(data): Returns the variance of the data.
• stdev(data): Returns the standard deviation of the data.

Example:

import statistics
data = [10, 20, 20, 30, 40, 50]
mean = statistics.mean(data)
median = statistics.median(data)
mode = statistics.mode(data)
variance = statistics.variance(data)
stdev = statistics.stdev(data)
print(f"Mean: {mean}")
print(f"Median: {median}")
print(f"Mode: {mode}")
print(f"Variance: {variance}")
print(f"Standard Deviation: {stdev}")

output:
2. math Module

The math module provides mathematical functions. It is also part of Python's standard library.

Key Functions:

• sqrt(x): Returns the square root of x.


• factorial(x): Returns the factorial of x.
• log(x, base): Returns the logarithm of x to the given base.
• sin(x), cos(x), tan(x): Trigonometric functions.
• pi: The constant π.

Example:

import math

number = 16
sqrt_value = math.sqrt(number)
factorial_value = math.factorial(5)
log_value = math.log(100, 10)
pi_value = math.pi

print(f"Square Root: {sqrt_value}")


print(f"Factorial: {factorial_value}")
print(f"Logarithm (base 10): {log_value}")
print(f"Value of pi: {pi_value}")

output:
3. numpy Library

numpy is a powerful library for numerical computations, particularly for handling large arrays
and matrices. It is not included in the standard library and needs to be installed separately.

Key Functions:

• numpy.array(): Creates a numpy array.


• numpy.mean(): Computes the mean of array elements.
• numpy.median(): Computes the median of array elements.
• numpy.std(): Computes the standard deviation.
• numpy.var(): Computes the variance.
• numpy.linalg.inv(): Computes the inverse of a matrix

Summary

• statistics: Basic statistics functions for small datasets.


• math: Fundamental mathematical functions.
• numpy: Advanced numerical operations and array handling.
scipy: Additional scientific and technical computing tools

Example:
import numpy as np
data = np.array([10, 20, 30, 40, 50])
mean = np.mean(data)
median = np.median(data)
std_dev = np.std(data)
variance = np.var(data)

print(f"Mean: {mean}")
print(f"Median: {median}")
print(f"Standard Deviation: {std_dev}")
print(f"Variance: {variance}")
output:
4. scipy Library

scipybuilds on numpy and provides additional functionality for scientific and technical
computing.

Key Functions:

• scipy.stats.describe(): Provides descriptive statistics.


• scipy.stats.norm: Functions related to the normal distribution, such as PDF and CDF.
• scipy.optimize.minimize(): Optimization routines.
• scipy.integrate.quad(): Numerical integration.

Example:
from scipy import stats

data = [10, 20, 30, 40, 50]

# Descriptive statistics
desc = stats.describe(data)

print(f"Mean: {desc.mean}")
print(f"Variance: {desc.variance}")

print(f"Minimum: {desc.minmax[0]}")
print(f"Maximum: {desc.minmax[1]}")

# Normal distribution example


mean, var = stats.norm.fit(data)
print(f"Fitted Mean: {mean}")
print(f"Fitted Variance: {var}")

output:
3. Study of Python Libraries for ML application such as Pandas and Matplotlib.

Code:
1. Importing the Library

First, import the pandas library.

import pandas as pd
# Manually create data
data = {
'Date': ['2024-01-01', '2024-01-02', '2024-01-03', '2024-01-04', '2024-01-05'],
'Product': ['Widget A', 'Widget B', 'Widget C', 'Widget A', 'Widget B'],
'Category': ['Electronics', 'Electronics', 'Home Goods', 'Home Goods', 'Electronics'],
'Sales': [1500, 2000, 1200, 1800, 2100],
'Profit': [300, 500, 200, 400, 600]
}

# Create DataFrame
df = pd.DataFrame(data)

# Display the DataFrame


print("Initial DataFrame:")
print(df)

output:
3. Exploring the Data

Examine the basic structure and statistics of the DataFrame.

# Display basic information about the DataFrame


print("\nDataFrame Info:")
print(df.info())
# Display descriptive statistics
print("\nDescriptive Statistics:")
print(df.describe())

# Display the first few rows


print("\nFirst Few Rows:")
print(df.head())

output:
2. matplotlib Library

matplotlib is a comprehensive library for creating static, animated, and interactive


visualizations in Python.

Key Features:

• Basic Plotting Functions:


o plt.plot(): Create line plots.
o plt.scatter(): Create scatter plots.
o plt.bar(): Create bar charts.
o plt.hist(): Create histograms.
o plt.show(): Display the plot.
• Customization:
o plt.title(): Add a title to the plot.
o plt.xlabel(), plt.ylabel(): Label the axes.
o plt.legend(): Add a legend.
o plt.grid(): Add a grid.

Example:

import matplotlib.pyplot as plt

# Sample data
x = [1, 2, 3, 4, 5]
y = [2, 3, 5, 7, 11]

# Line plot
plt.plot(x, y, label='Line Plot', marker='o')
plt.title('Line Plot Example')
plt.xlabel('X Axis')
plt.ylabel('Y Axis')
plt.legend()
plt.grid(True)
plt.show()

# Scatter plot
plt.scatter(x, y, color='red', label='Scatter Plot')
plt.title('Scatter Plot Example')
plt.xlabel('X Axis')
plt.ylabel('Y Axis')
plt.legend()
plt.show()

# Histogram
data = [1, 2, 2, 3, 4, 4, 4, 5, 6]
plt.hist(data, bins=5, edgecolor='black')
plt.title('Histogram Example')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()

output:
4. Write a Python program to implement Simple Linear Regression
Code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Create sample data


data = {
'x': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'y': [1.5, 3.5, 3.7, 5.5, 6.5, 7.8, 8.7, 10.5, 11.3, 12.8]
}

# Load data into a DataFrame


df = pd.DataFrame(data)

# Extract x and y values


x = df['x'].values
y = df['y'].values
n = len(x)

# Calculate the necessary sums


sum_x = np.sum(x)
sum_y = np.sum(y)
sum_xy = np.sum(x * y)
sum_x_squared = np.sum(x ** 2)

# Calculate coefficients
beta_1 = (n * sum_xy - sum_x * sum_y) / (n * sum_x_squared - sum_x ** 2)
beta_0 = (sum_y - beta_1 * sum_x) / n

print(f"Coefficients: beta_0 = {beta_0}, beta_1 = {beta_1}")

# Predict y values
y_pred = beta_0 + beta_1 * x

# Display predictions
print("Predicted y values:")
print(y_pred)

# Plot the data points


plt.scatter(x, y, color='blue', label='Data points')
# Plot the regression line
plt.plot(x, y_pred, color='red', label='Regression line')

# Add labels and title


plt.xlabel('x')
plt.ylabel('y')
plt.title('Simple Linear Regression')
plt.legend()

# Show the plot


plt.show()

output:
5. Implementation of Multiple Linear Regression for House Price Prediction
using sklearn.

Code:
import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error, r2_score

# Create a sample dataset

data = {

'SquareFootage': [1500, 2000, 2500, 3000, 3500, 4000, 4500],

'NumBedrooms': [3, 4, 3, 5, 4, 5, 6],

'Age': [10, 15, 20, 5, 8, 12, 4],

'Price': [300000, 400000, 500000, 600000, 700000, 800000, 900000]

# Load data into a DataFrame

df = pd.DataFrame(data)

# Define features and target variable

X = df[['SquareFootage', 'NumBedrooms', 'Age']]

y = df['Price']

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the model

model = LinearRegression()

model.fit(X_train, y_train)

# Get the coefficients and intercept

coefficients = model.coef_

intercept = model.intercept_

print(f"Coefficients: {coefficients}")

print(f"Intercept: {intercept}")

# Make predictions

y_pred = model.predict(X_test)

# Evaluate the model

mse = mean_squared_error(y_test, y_pred)

r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse}")

print(f"R-squared: {r2}")

# Plot actual vs. predicted values

plt.scatter(y_test, y_pred, color='blue')

plt.plot([min(y_test), max(y_test)], [min(y_pred), max(y_pred)], color='red', linewidth=2)

plt.xlabel('Actual Prices')

plt.ylabel('Predicted Prices')

plt.title('Actual vs. Predicted Prices')

plt.show()
output:
6. Implementation of Decision tree using sklearn and its parameter
tuning.

Code:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# Load the Iris dataset


data = load_iris()
X = data.data # Features
y = data.target # Target labels

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a basic Decision Tree Classifier


dt = DecisionTreeClassifier(random_state=42)

# Train the model on the training data


dt.fit(X_train, y_train)

# Make predictions on the test data


y_pred = dt.predict(X_test)

# Calculate the accuracy before tuning


initial_accuracy = accuracy_score(y_test, y_pred)
print(f"Initial Accuracy without tuning: {initial_accuracy:.2f}")

# Parameter tuning using GridSearchCV


param_grid = {
'criterion': ['gini', 'entropy'], # Split criterion
'max_depth': [None, 10, 20, 30, 40], # Maximum depth of the tree
'min_samples_split': [2, 5, 10], # Minimum number of samples required to split an
internal node
'min_samples_leaf': [1, 2, 4], # Minimum number of samples required to be at a
leaf node
'max_features': [None, 'auto', 'sqrt', 'log2'] # Number of features to consider when looking
for the best split
}

# Perform Grid Search to find the best hyperparameters


grid_search = GridSearchCV(estimator=dt, param_grid=param_grid, cv=5, n_jobs=-1,
verbose=1)
grid_search.fit(X_train, y_train)

# Output the best hyperparameters found by Grid Search


best_params = grid_search.best_params_
print(f"Best Hyperparameters: {best_params}")

# Use the best estimator to predict on the test set


best_dt = grid_search.best_estimator_
y_pred_best = best_dt.predict(X_test)

# Calculate the accuracy after tuning


tuned_accuracy = accuracy_score(y_test, y_pred_best)
print(f"Tuned Accuracy: {tuned_accuracy:.2f}")

What the program does:

1. Data Loading: It loads the Iris dataset, a commonly used dataset for classification tasks.
2. Model Creation: It creates a DecisionTreeClassifier and trains it using the training data.
3. Grid Search for Hyperparameter Tuning: It defines a parameter grid and performs
GridSearchCV to find the best hyperparameters for the decision tree.
4. Accuracy Calculation: It calculates and prints the accuracy before and after hyperparameter
tuning.
output:
Explanation:

• The initial accuracy is the performance of the decision tree model before any hyperparameter
tuning.
• The best hyperparameters are determined by GridSearchCV through cross-validation.
• The tuned accuracy shows the performance after applying the best combination of
hyperparameters.
7. Implementation of KNN using sklearn
Code:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset


data = load_iris()
X = data.data # Features
y = data.target # Target labels
# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create a K-Nearest Neighbors classifier
knn = KNeighborsClassifier(n_neighbors=3)
# Train the model on the training data
knn.fit(X_train, y_train)
# Make predictions on the test data
y_pred = knn.predict(X_test)
# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
# Output the results
print(f"Accuracy of KNN model: {accuracy:.2f}")

Explanation:

1. Data Loading: The Iris dataset is loaded using load_iris(), which is a well-known
dataset containing 150 samples from 3 species of iris flowers.
2. Model Creation: The KNeighborsClassifier is initialized with n_neighbors=3, meaning
it will consider the 3 nearest neighbors to classify a data point.
3. Training: The model is trained using knn.fit(), which uses the training data (X_train,
y_train).
4. Prediction: The model predicts on the test data (X_test).
5. Accuracy: The accuracy score is computed using accuracy_score() to evaluate the
performance.
output:

Explanation of Output:

• The accuracy indicates the percentage of correctly classified instances from the
test set.
• A result of 1.00 (or 100%) means that the model perfectly classified all the test
samples. This might occur in cases where the dataset is small and the model's
complexity is high (i.e., the dataset is well-separated).
8. Implementation of Logistic Regression using sklearn
Code:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load the Iris dataset


data = load_iris()
X = data.data # Features
y = data.target # Target labels
# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create a Logistic Regression classifier
logreg = LogisticRegression(max_iter=200)
# Train the model on the training data
logreg.fit(X_train, y_train)
# Make predictions on the test data
y_pred = logreg.predict(X_test)
# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
# Output the results
print(f"Accuracy of Logistic Regression model: {accuracy:.2f}")

Explanation:

1. Data Loading: The Iris dataset is loaded using load_iris() from sklearn.datasets. This
dataset contains 150 samples from three species of iris flowers, and we use it to
predict the species based on the input features (sepal length, sepal width, petal
length, and petal width).
2. Model Creation: We create a LogisticRegression model. The parameter max_iter=200
is used to ensure the algorithm converges within 200 iterations.
3. Training: The model is trained using the training data (X_train, y_train) via
logreg.fit().
4. Prediction: After training, the model predicts the species for the test data (X_test).
5. Accuracy Calculation: The accuracy of the model is computed by comparing the
predicted labels (y_pred) with the true labels (y_test).
output:

Explanation of Output:

• The accuracy represents the proportion of correctly predicted samples out of the
total test samples.
• In this example, a perfect accuracy of 1.00 means the model was able to correctly
classify all the test samples. This result might happen if the data is relatively
simple and well-separable by logistic regression.
9. Implementation of K-Means Clustering
Code:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Load the Iris dataset


data = load_iris()
X = data.data # Features
y = data.target # Target labels

# Initialize the KMeans model with 3 clusters (as we know there are 3 species)
kmeans = KMeans(n_clusters=3, random_state=42)

# Fit the model on the dataset


kmeans.fit(X)

# Get the predicted cluster labels


y_pred = kmeans.labels_

# Get the centroids of the clusters


centroids = kmeans.cluster_centers_

# Calculate the inertia (within-cluster sum of squares)


inertia = kmeans.inertia_

# Output the results


print(f"Cluster Centers (Centroids): \n{centroids}")
print(f"Inertia: {inertia:.2f}")

# Visualize the clusters using the first two features (sepal length and sepal width)
plt.figure(figsize=(8,6))
sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=y_pred, palette="Set1", s=100)
plt.scatter(centroids[:, 0], centroids[:, 1], marker="X", color="black", s=200, label="Centroids")
plt.title('K-Means Clustering (Iris Dataset)', fontsize=15)
plt.xlabel('Sepal Length')
plt.ylabel('Sepal Width')
plt.legend()
plt.show()

Explanation:

1. Data Loading: The Iris dataset is loaded using load_iris(), which contains 150
samples with 4 features (sepal length, sepal width, petal length, and petal width).
2. Model Creation: We initialize a KMeans model with n_clusters=3 (since we expect
3 clusters, corresponding to 3 species of Iris).
3. Model Training: The fit() method is used to fit the KMeans model on the data.
4. Cluster Labels: We get the predicted labels (y_pred) that indicate which cluster
each data point belongs to.
5. Centroids: The cluster_centers_ attribute gives the coordinates of the cluster
centroids.
6. Inertia: The inertia_ represents the sum of squared distances of samples to their
closest cluster center, which is a measure of the model's "tightness" or how well
the samples fit their clusters.
7. Visualization: We plot the clusters based on the first two features (sepal length
and sepal width) and mark the centroids on the scatter plot.

output:

10. Performance analysis of Classification Algorithms on a specific


dataset (Mini Project)
Code:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load the Iris dataset


data = load_iris()
X = data.data # Features
y = data.target # Target labels

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the classifiers


models = {
"Logistic Regression": LogisticRegression(max_iter=200),
"K-Nearest Neighbors": KNeighborsClassifier(),
"Decision Tree": DecisionTreeClassifier(),
"Support Vector Machine": SVC(),
"Random Forest": RandomForestClassifier()
}

# Dictionary to store performance metrics


performance_metrics = {}

# Train each model, make predictions, and evaluate


for model_name, model in models.items():
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

performance_metrics[model_name] = {
"Accuracy": accuracy,
"Precision": precision,
"Recall": recall,
"F1 Score": f1
}

# Display the performance metrics for each classifier


import pandas as pd
performance_df = pd.DataFrame(performance_metrics).T
print(performance_df)

Explanation:

1. Dataset Loading: The Iris dataset is loaded using load_iris() from sklearn.datasets,
and it's split into features (X) and target labels (y).
2. Model Initialization: We initialize five classification models:
o Logistic Regression
o K-Nearest Neighbors (KNN)
o Decision Tree Classifier
o Support Vector Machine (SVM)
o Random Forest Classifier
3. Model Training & Prediction: Each model is trained using the training data
(X_train, y_train), and predictions are made on the test set (X_test).
4. Evaluation Metrics: The models are evaluated using:
o Accuracy: The proportion of correct predictions.
o Precision: The proportion of positive predictions that are actually positive.
o Recall: The proportion of actual positive instances that were predicted
correctly.
o F1-Score: The harmonic mean of precision and recall.
5. Performance Output: The metrics are stored in a dictionary and converted to a
pandas DataFrame for easy visualization.
Output:

Explanation of Output:

• Accuracy: This metric indicates how many of the predictions made by the model
were correct.
• Precision: Precision is a measure of how many of the predicted positive classes
were actually positive. A higher precision means fewer false positives.
• Recall: Recall measures how many of the actual positive classes were correctly
predicted. A higher recall means fewer false negatives.
• F1-Score: The F1-score provides a balance between precision and recall. It is
useful when you need to balance both false positives and false negatives.

You might also like