ML Lab Manual
ML Lab Manual
List of Experiments
1. Write a python program to compute Central Tendency Measures: Mean, Median, Mode
Measure of Dispersion: Variance, Standard Deviation
2. Study of Python Basic Libraries such as Statistics, Math, Numpy and Scipy
5. Implementation of Multiple Linear Regression for House Price Prediction using sklearn
Code:
import statistics
import numpy as np
# Sample data
data = [4, 8, 6, 5, 3, 2, 8, 9, 2, 5]
# Measures of Dispersion
variance = statistics.variance(data)
std_deviation = statistics.stdev(data)
print(f"Mean: {mean}")
print(f"Median: {median}")
print(f"Mode: {mode}")
print(f"Variance: {variance}")
print(f"Standard Deviation: {std_deviation}")
output:
2ND CODE:
import statistics
# Dispersion Measures
variance = statistics.variance(data)
std_deviation = statistics.stdev(data)
1. statistics Module
Key Functions:
Example:
import statistics
data = [10, 20, 20, 30, 40, 50]
mean = statistics.mean(data)
median = statistics.median(data)
mode = statistics.mode(data)
variance = statistics.variance(data)
stdev = statistics.stdev(data)
print(f"Mean: {mean}")
print(f"Median: {median}")
print(f"Mode: {mode}")
print(f"Variance: {variance}")
print(f"Standard Deviation: {stdev}")
output:
2. math Module
The math module provides mathematical functions. It is also part of Python's standard library.
Key Functions:
Example:
import math
number = 16
sqrt_value = math.sqrt(number)
factorial_value = math.factorial(5)
log_value = math.log(100, 10)
pi_value = math.pi
output:
3. numpy Library
numpy is a powerful library for numerical computations, particularly for handling large arrays
and matrices. It is not included in the standard library and needs to be installed separately.
Key Functions:
Summary
Example:
import numpy as np
data = np.array([10, 20, 30, 40, 50])
mean = np.mean(data)
median = np.median(data)
std_dev = np.std(data)
variance = np.var(data)
print(f"Mean: {mean}")
print(f"Median: {median}")
print(f"Standard Deviation: {std_dev}")
print(f"Variance: {variance}")
output:
4. scipy Library
scipybuilds on numpy and provides additional functionality for scientific and technical
computing.
Key Functions:
Example:
from scipy import stats
# Descriptive statistics
desc = stats.describe(data)
print(f"Mean: {desc.mean}")
print(f"Variance: {desc.variance}")
print(f"Minimum: {desc.minmax[0]}")
print(f"Maximum: {desc.minmax[1]}")
output:
3. Study of Python Libraries for ML application such as Pandas and Matplotlib.
Code:
1. Importing the Library
import pandas as pd
# Manually create data
data = {
'Date': ['2024-01-01', '2024-01-02', '2024-01-03', '2024-01-04', '2024-01-05'],
'Product': ['Widget A', 'Widget B', 'Widget C', 'Widget A', 'Widget B'],
'Category': ['Electronics', 'Electronics', 'Home Goods', 'Home Goods', 'Electronics'],
'Sales': [1500, 2000, 1200, 1800, 2100],
'Profit': [300, 500, 200, 400, 600]
}
# Create DataFrame
df = pd.DataFrame(data)
output:
3. Exploring the Data
output:
2. matplotlib Library
Key Features:
Example:
# Sample data
x = [1, 2, 3, 4, 5]
y = [2, 3, 5, 7, 11]
# Line plot
plt.plot(x, y, label='Line Plot', marker='o')
plt.title('Line Plot Example')
plt.xlabel('X Axis')
plt.ylabel('Y Axis')
plt.legend()
plt.grid(True)
plt.show()
# Scatter plot
plt.scatter(x, y, color='red', label='Scatter Plot')
plt.title('Scatter Plot Example')
plt.xlabel('X Axis')
plt.ylabel('Y Axis')
plt.legend()
plt.show()
# Histogram
data = [1, 2, 2, 3, 4, 4, 4, 5, 6]
plt.hist(data, bins=5, edgecolor='black')
plt.title('Histogram Example')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()
output:
4. Write a Python program to implement Simple Linear Regression
Code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Calculate coefficients
beta_1 = (n * sum_xy - sum_x * sum_y) / (n * sum_x_squared - sum_x ** 2)
beta_0 = (sum_y - beta_1 * sum_x) / n
# Predict y values
y_pred = beta_0 + beta_1 * x
# Display predictions
print("Predicted y values:")
print(y_pred)
output:
5. Implementation of Multiple Linear Regression for House Price Prediction
using sklearn.
Code:
import pandas as pd
import numpy as np
data = {
df = pd.DataFrame(data)
y = df['Price']
model = LinearRegression()
model.fit(X_train, y_train)
coefficients = model.coef_
intercept = model.intercept_
print(f"Coefficients: {coefficients}")
print(f"Intercept: {intercept}")
# Make predictions
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
print(f"R-squared: {r2}")
plt.xlabel('Actual Prices')
plt.ylabel('Predicted Prices')
plt.show()
output:
6. Implementation of Decision tree using sklearn and its parameter
tuning.
Code:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score
# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
1. Data Loading: It loads the Iris dataset, a commonly used dataset for classification tasks.
2. Model Creation: It creates a DecisionTreeClassifier and trains it using the training data.
3. Grid Search for Hyperparameter Tuning: It defines a parameter grid and performs
GridSearchCV to find the best hyperparameters for the decision tree.
4. Accuracy Calculation: It calculates and prints the accuracy before and after hyperparameter
tuning.
output:
Explanation:
• The initial accuracy is the performance of the decision tree model before any hyperparameter
tuning.
• The best hyperparameters are determined by GridSearchCV through cross-validation.
• The tuned accuracy shows the performance after applying the best combination of
hyperparameters.
7. Implementation of KNN using sklearn
Code:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
Explanation:
1. Data Loading: The Iris dataset is loaded using load_iris(), which is a well-known
dataset containing 150 samples from 3 species of iris flowers.
2. Model Creation: The KNeighborsClassifier is initialized with n_neighbors=3, meaning
it will consider the 3 nearest neighbors to classify a data point.
3. Training: The model is trained using knn.fit(), which uses the training data (X_train,
y_train).
4. Prediction: The model predicts on the test data (X_test).
5. Accuracy: The accuracy score is computed using accuracy_score() to evaluate the
performance.
output:
Explanation of Output:
• The accuracy indicates the percentage of correctly classified instances from the
test set.
• A result of 1.00 (or 100%) means that the model perfectly classified all the test
samples. This might occur in cases where the dataset is small and the model's
complexity is high (i.e., the dataset is well-separated).
8. Implementation of Logistic Regression using sklearn
Code:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
Explanation:
1. Data Loading: The Iris dataset is loaded using load_iris() from sklearn.datasets. This
dataset contains 150 samples from three species of iris flowers, and we use it to
predict the species based on the input features (sepal length, sepal width, petal
length, and petal width).
2. Model Creation: We create a LogisticRegression model. The parameter max_iter=200
is used to ensure the algorithm converges within 200 iterations.
3. Training: The model is trained using the training data (X_train, y_train) via
logreg.fit().
4. Prediction: After training, the model predicts the species for the test data (X_test).
5. Accuracy Calculation: The accuracy of the model is computed by comparing the
predicted labels (y_pred) with the true labels (y_test).
output:
Explanation of Output:
• The accuracy represents the proportion of correctly predicted samples out of the
total test samples.
• In this example, a perfect accuracy of 1.00 means the model was able to correctly
classify all the test samples. This result might happen if the data is relatively
simple and well-separable by logistic regression.
9. Implementation of K-Means Clustering
Code:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
# Initialize the KMeans model with 3 clusters (as we know there are 3 species)
kmeans = KMeans(n_clusters=3, random_state=42)
# Visualize the clusters using the first two features (sepal length and sepal width)
plt.figure(figsize=(8,6))
sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=y_pred, palette="Set1", s=100)
plt.scatter(centroids[:, 0], centroids[:, 1], marker="X", color="black", s=200, label="Centroids")
plt.title('K-Means Clustering (Iris Dataset)', fontsize=15)
plt.xlabel('Sepal Length')
plt.ylabel('Sepal Width')
plt.legend()
plt.show()
Explanation:
1. Data Loading: The Iris dataset is loaded using load_iris(), which contains 150
samples with 4 features (sepal length, sepal width, petal length, and petal width).
2. Model Creation: We initialize a KMeans model with n_clusters=3 (since we expect
3 clusters, corresponding to 3 species of Iris).
3. Model Training: The fit() method is used to fit the KMeans model on the data.
4. Cluster Labels: We get the predicted labels (y_pred) that indicate which cluster
each data point belongs to.
5. Centroids: The cluster_centers_ attribute gives the coordinates of the cluster
centroids.
6. Inertia: The inertia_ represents the sum of squared distances of samples to their
closest cluster center, which is a measure of the model's "tightness" or how well
the samples fit their clusters.
7. Visualization: We plot the clusters based on the first two features (sepal length
and sepal width) and mark the centroids on the scatter plot.
output:
# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
performance_metrics[model_name] = {
"Accuracy": accuracy,
"Precision": precision,
"Recall": recall,
"F1 Score": f1
}
Explanation:
1. Dataset Loading: The Iris dataset is loaded using load_iris() from sklearn.datasets,
and it's split into features (X) and target labels (y).
2. Model Initialization: We initialize five classification models:
o Logistic Regression
o K-Nearest Neighbors (KNN)
o Decision Tree Classifier
o Support Vector Machine (SVM)
o Random Forest Classifier
3. Model Training & Prediction: Each model is trained using the training data
(X_train, y_train), and predictions are made on the test set (X_test).
4. Evaluation Metrics: The models are evaluated using:
o Accuracy: The proportion of correct predictions.
o Precision: The proportion of positive predictions that are actually positive.
o Recall: The proportion of actual positive instances that were predicted
correctly.
o F1-Score: The harmonic mean of precision and recall.
5. Performance Output: The metrics are stored in a dictionary and converted to a
pandas DataFrame for easy visualization.
Output:
Explanation of Output:
• Accuracy: This metric indicates how many of the predictions made by the model
were correct.
• Precision: Precision is a measure of how many of the predicted positive classes
were actually positive. A higher precision means fewer false positives.
• Recall: Recall measures how many of the actual positive classes were correctly
predicted. A higher recall means fewer false negatives.
• F1-Score: The F1-score provides a balance between precision and recall. It is
useful when you need to balance both false positives and false negatives.