Workshop on “Machine Learning Concepts
and Applications”
P.S.R.Engineering College,
Sevalpatti, Sivakasi
Dr.R.Meena Prakash, Associate
9/16/2021
Professor/PSREC
Session 2
Python for Machine Learning algorithms and
applications
Dr.R.Meena Prakash
Associate Professor/ECE
P.S.R.Engineering College,
Sevalpatti, Sivakasi
Dr.R.Meena Prakash, Associate
9/16/2021
Professor/PSREC
Contents
• Anaconda Installation
• Numpy, SciPy, matplotlib, Pandas, openCV
• Sci-kit learn
• K-Nearest Neighbors Classification
• Linear Regression
• SVM Classifier
• K-Means Image Segmentation
• PCA and Logistic Regression Classifier
Dr.R.Meena Prakash, Associate
9/16/2021
Professor/PSREC
Anaconda Installation
https://www.anaconda.com/products/individual
• Anaconda Individual Edition is the world’s most popular
Python distribution platform with over 20 million users
worldwide.
• Over 7,500 data science and machine learning packages
are available. With the conda-install command
thousands of open-source packages can be installed.
Dr.R.Meena Prakash, Associate
9/16/2021
Professor/PSREC
Numpy, SciPy and matplotlib, OpenCV
• NumPy provides support of highly optimized
multidimensional arrays. These are the basic data
structure of most state-of-the art algorithms.
• SciPy use these arrays to provide a set of fast
numerical recipes. SciPy contains modules for
optimization, linear algebra, integration, interpolation,
special functions, FFT, signal and image processing
• matplotlib is feature-rich library to plot high-quality
graphs using Python.
• OpenCV is popular library for Computer Vision.
• Pandas – Python data analysis and manipulation tool;
Great tool for using Excel with Python
Dr.R.Meena Prakash, Associate
9/16/2021
Professor/PSREC
• Scalars (0D tensors)
A tensor that contains only one number is called
a scalar.
• Vectors (1D tensors)
• An array of numbers is called a vector, or 1D
tensor. A 1D tensor is said to have exactly
one axis.
• Matrices (2D tensors)
An array of vectors is a matrix, or 2D tensor. A
matrix has two axes (often referred to
rows and columns).
Dr.R.Meena Prakash, Associate
9/16/2021
Professor/PSREC
• >>> import numpy as np
• >>> a = np.array([0,1,2,3,4,5])
• >>> a
• array([0, 1, 2, 3, 4, 5])
• >>> a.ndim
• 1
• >>> a.shape
• (6,)
Dr.R.Meena Prakash, Associate
9/16/2021
Professor/PSREC
• >>> b = a.reshape((3,2))
• >>> b
• array([[0, 1],
• [2, 3],
• [4, 5]])
• >>> b.ndim
• 2
• >>> b.shape
• (3, 2)
Dr.R.Meena Prakash, Associate
9/16/2021
Professor/PSREC
• c=a.reshape(3,1,2) • [[[0 1]]
• print(c) • [[2 3]]
• print(c.ndim)
• print(c.shape) • [[4 5]]]
• 3
• npdata=np.arange(3)
• (3, 1, 2)
• print(npdata) • [0 1 2]
• npdata=np.arange(40) • [[ 0 1 2 3 4 5 6 7]
• npdata.shape=(5,8) • [ 8 9 10 11 12 13 14 15]
• [16 17 18 19 20 21 22 23]
• print(npdata) • [24 25 26 27 28 29 30 31]
• [32 33 34 35 36 37 38 39]]
Dr.R.Meena Prakash, Associate
9/16/2021
Professor/PSREC
SciPy Packages
Dr.R.Meena Prakash, Associate
9/16/2021
Professor/PSREC
• import scipy as sp
• data = sp.genfromtxt("web_traffic.tsv", delimiter="\t")
• print(data[:10])
• print(data.shape)
• x = data[:,0] 743 hrs – web traffic
• y = data[:,1] In word pad file
Missing Values - 8
• n=sp.sum(sp.isnan(y))
• print(n)
• x = x[~sp.isnan(y)]
• y = y[~sp.isnan(y)]
• print(x)
• print(y)
• print(x.shape)
Dr.R.Meena Prakash, Associate
9/16/2021
Professor/PSREC
• import matplotlib.pyplot as plt
• # plot the (x,y) points with dots of size 10
• plt.scatter(x, y, s=10)
• plt.title("Web traffic over the last month")
• plt.xlabel("Time")
• plt.ylabel("Hits/hour")
• plt.xticks([w*7*24 for w in range(10)],['week %i' % w
for w in range(10)])
• plt.autoscale(tight=True)
• # draw a slightly opaque, dashed grid
• plt.grid(True, linestyle='-', color='0.75')
• plt.show()
Dr.R.Meena Prakash, Associate
9/16/2021
Professor/PSREC
Dr.R.Meena Prakash, Associate
9/16/2021
Professor/PSREC
Scikit-learn
• Scikit library is the standard library for many
machine learning tasks including classification
• fit(features, labels): This is the learning step
and fits the parameters of the model
• predict(features): This method can only be
called after fit and returns a prediction for one
or more inputs
• Conda install scikit-learn
Dr.R.Meena Prakash, Associate
9/16/2021
Professor/PSREC
Cross Validation
• Cross-validation is a re-sampling procedure used to
evaluate machine learning models on a limited data
sample.
• Shuffle the dataset randomly.
• Split the dataset into k groups
• For each unique group:
– Take the group as a hold out or test data set
– Take the remaining groups as a training data set
– Fit a model on the training set and evaluate it on the test set
– Retain the evaluation score and discard the model
• Summarize the skill of the model using the sample of
model evaluation scores. In general k=5 or 10
Dr.R.Meena Prakash, Associate
9/16/2021
Professor/PSREC
Cross Validation
Dr.R.Meena Prakash, Associate
9/16/2021
Professor/PSREC
K-Nearest Neighbor Algorithm
• KNN is a non-parametric
algorithm in which there
is no assumption for
underlying data
distribution like GMM
• K is the number of
nearest neighbors
The steps include
• Calculate distance
• Find closest neighbors
• Vote for labels
Dr.R.Meena Prakash, Associate
9/16/2021
Professor/PSREC
K Neighbors Classifier for classification of iris data
• from matplotlib import pyplot as plt
• import numpy as np
• from sklearn.datasets import load_iris
• from sklearn.model_selection import
train_test_split
• from sklearn.model_selection import KFold
• data = load_iris()
Dr.R.Meena Prakash, Associate
9/16/2021
Professor/PSREC
• features = data.data
• feature_names = data.feature_names
• target = data.target
• target_names = data.target_names
• labels = target_names[target]
• from sklearn.neighbors import
KNeighborsClassifier
• (Features are the length and width of sepals
and petals)
Dr.R.Meena Prakash, Associate
9/16/2021
Professor/PSREC
• classifier = KNeighborsClassifier(n_neighbors=1)
• X=features
• y=target
• X_train, X_test, y_train, y_test = train_test_split(
• X, y, test_size=0.33,
random_state=42)
• classifier.fit(X_train, y_train)
• print(classifier.score(X_test, y_test))
• (random state – seed to the random generator)
• Output : Accuracy=0.98
Dr.R.Meena Prakash, Associate
9/16/2021
Professor/PSREC
Linear Regression
• import numpy as np
• import matplotlib.pyplot as plt
• import pandas as pd
• dataset = pd.read_csv('Position_Salaries.csv')
• X = dataset.iloc[:, 1:2].values
• y = dataset.iloc[:, 2].values
• from sklearn.tree import DecisionTreeRegressor
• regressor =
DecisionTreeRegressor(random_state=0)
• (iloc in pandas is used to select rows and columns in
Pandas dataframe by row numbers)
Dr.R.Meena Prakash, Associate
9/16/2021
Professor/PSREC
• regressor.fit(X,y)
• n=np.array([6.5]).reshape(1, 1)
• y_pred = regressor.predict(n)
• plt.scatter(X, y, color = 'red')
• plt.plot(X, regressor.predict(X), color = 'blue')
• plt.title('Regression Model')
• plt.xlabel('Position level')
• plt.ylabel('Salary')
• plt.show()
Dr.R.Meena Prakash, Associate
9/16/2021
Professor/PSREC
• X_grid = np.arange(min(X), max(X), 0.01)
• X_grid = X_grid.reshape((len(X_grid), 1))
• plt.scatter(X, y, color = 'red')
• plt.plot(X_grid, regressor.predict(X_grid), color =
'blue')
• plt.title('Example of Decision Regression Model')
• plt.xlabel('Position level')
• plt.ylabel('Salary')
• plt.show()
• (arange- evenly spaced values within the given
range)
Dr.R.Meena Prakash, Associate
9/16/2021
Professor/PSREC
Output
Dr.R.Meena Prakash, Associate
9/16/2021
Professor/PSREC
SVM Classifier
• from sklearn import datasets
• from sklearn.model_selection import train_test_split
• iris = datasets.load_iris()
• X = iris.data # we only take the first two features.
• y = iris.target
• from sklearn.svm import SVC
• model = SVC(kernel='linear', C=1E10)( # C is the penalty
parameter of error term)
• X_train, X_test, y_train, y_test = train_test_split(
• X, y, test_size=0.33, random_state=42)
• model.fit(X_train, y_train)
• print(model.score(X_test, y_test))
(For 2 features, accuracy=0.84, For 4 features, accuracy=0.98)
Dr.R.Meena Prakash, Associate
9/16/2021
Professor/PSREC
K-Means Segmentation of image
• import cv2
• import numpy as np
• import matplotlib.pyplot as plt
• image = cv2.imread("image.jpg")
• image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
• pixel_values = image.reshape((-1, 3))
• pixel_values = np.float32(pixel_values)
• print(pixel_values.shape)
• criteria = (cv2.TERM_CRITERIA_EPS +
cv2.TERM_CRITERIA_MAX_ITER, 100, 0.2)
Dr.R.Meena Prakash, Associate
9/16/2021
Professor/PSREC
• k=3
• _, labels, (centers) = cv2.kmeans(pixel_values,
k, None, criteria, 10,
cv2.KMEANS_RANDOM_CENTERS)
• # convert back to 8 bit values
• centers = np.uint8(centers)
• # flatten the labels array
• labels = labels.flatten()
Dr.R.Meena Prakash, Associate
9/16/2021
Professor/PSREC
• # convert all pixels to the color of the centroids
• segmented_image = centers[labels.flatten()]
• # reshape back to the original image dimension
• segmented_image =
segmented_image.reshape(image.shape)
• # show the image
• plt.imshow(image)
• plt.show()
• plt.imshow(segmented_image)
• plt.show()
Dr.R.Meena Prakash, Associate
9/16/2021
Professor/PSREC
Output
Dr.R.Meena Prakash, Associate
9/16/2021
Professor/PSREC
Principal Component Analysis
• import numpy as np
• import matplotlib.pyplot as plt
• import pandas as pd
• # importing or loading the dataset
• dataset = pd.read_csv('wine.csv')
• # distributing the dataset into two
components X and Y
• X = dataset.iloc[:, 1:13].values
• y = dataset.iloc[:, 0].values
Dr.R.Meena Prakash, Associate
9/16/2021
Professor/PSREC
• from sklearn.model_selection import
train_test_split
• X_train, X_test, y_train, y_test = train_test_split(X,
y, test_size = 0.2, random_state = 0)
• # performing preprocessing part
• from sklearn.preprocessing import StandardScaler
• sc = StandardScaler()
• X_train = sc.fit_transform(X_train)
• X_test = sc.transform(X_test)
• # Applying PCA function on training
• # and testing set of X component
Dr.R.Meena Prakash, Associate
9/16/2021
Professor/PSREC
• from sklearn.decomposition import PCA
• from sklearn.linear_model import LogisticRegression
• pca = PCA(n_components = 2)
• X1_train = pca.fit_transform(X_train)
• X1_test = pca.transform(X_test)
• print(X.shape)
• print(X1_train.shape)
• variance = pca.explained_variance_ratio_
• classifier = LogisticRegression(random_state = 0)
• classifier.fit(X1_train, y_train)
• y_pred = classifier.predict(X1_test)
Dr.R.Meena Prakash, Associate
9/16/2021
Professor/PSREC
• print(classifier.score(X1_test, y_test))
• classifier.fit(X_train, y_train)
• y1_pred = classifier.predict(X_test)
• # making confusion matrix between
• # test set of Y and predicted value.
• print(classifier.score(X_test, y_test))
• print(np.shape(X_train))
• print(np.shape(X1_train))
• plt.figure(figsize=(8,6))
• plt.scatter(X1_train[:,0],X1_train[:,1],s=10,c=y_train,cmap='r
ainbow')
• plt.xlabel('First principal component')
• plt.ylabel('Second Principal Component')
Dr.R.Meena Prakash, Associate
9/16/2021
Professor/PSREC
Linear Vs logistic regression
Dr.R.Meena Prakash, Associate
9/16/2021
Professor/PSREC
SVM
Dr.R.Meena Prakash, Associate
9/16/2021
Professor/PSREC
References
• Python Data science Hand book –
Jake VanderPlas
• Building Machine Learning Systems with
Python – Luis Pedro Coelho, Willi Richert
• Deep Learning – Ian Goodfellow, Yoshua
Bengio, Aaron Courville
• Statistics and Machine Learning and Python –
Edouard Duchesnay, Tommy Lofstedt
• Other Web Resources
Dr.R.Meena Prakash, Associate
9/16/2021
Professor/PSREC