0% found this document useful (0 votes)
66 views

ML in Python Part-2

This document discusses various techniques for pre-processing data and evaluating machine learning models in Python. It covers: - Standardizing and normalizing numerical data for modeling using scikit-learn. - Resampling methods like k-fold cross-validation to evaluate model accuracy on unseen data. - Metrics for evaluating regression and classification algorithms, including accuracy, log loss, and RMSE. - Spot-checking algorithms like kNN, linear regression, and random forests on sample datasets. - Comparing the performance of different algorithms like logistic regression and LDA to select the best model.

Uploaded by

Usman Ali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views

ML in Python Part-2

This document discusses various techniques for pre-processing data and evaluating machine learning models in Python. It covers: - Standardizing and normalizing numerical data for modeling using scikit-learn. - Resampling methods like k-fold cross-validation to evaluate model accuracy on unseen data. - Metrics for evaluating regression and classification algorithms, including accuracy, log loss, and RMSE. - Spot-checking algorithms like kNN, linear regression, and random forests on sample datasets. - Comparing the performance of different algorithms like logistic regression and LDA to select the best model.

Uploaded by

Usman Ali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 21

Machine Learning in Python 2

Dr. Hafeez
Prepare for Modeling
Pre-processing data
• Raw data might not be ready in the best shape
for modeling
• Pre-processing data is required
• To best present inherit structure in the data to
model algo
• What python offers to pre-process data?
scikit-learn offers
• Two standard idioms for transforming data
– Fit and multiple transform
– Combined fit-and-transform
• Techniques to prepare data for modeling
– Standardize numerical data (mean=0 and stdev=1)
• Through scale and center options
– Normalize numerical data (0-1)
• Through range option
– Explore advanced feature engineering
• Binarizing
Example: Pima inidans diabetes dataset
• Calculate parameters to standardize the data
• Create a standardize copy of input data
• Standardize data (mean=0, stdev=1)
• From sklearn.preprocessing import StandarScaler
• Import pandas
• Import numpy
• url=https://goo.gl/bDdBiA
• names = ['preg', 'plas', 'pres', 'skin', 'test',
'mass', 'pedi', 'age', 'class']
Code
• dataframe = pandas.read_csv(url, names=names)
• array = dataframe.values
• # separate array into input and output components
• X = array[:,0:8]
• Y = array[:,8]
• scaler = StandardScaler().fit(X)
• rescaledX = scaler.transform(X)
• # summarize transformed data
• numpy.set_printoptions(precision=3)
• print(rescaledX[0:5,:])
 
Resampling Methods: Algorithm evaluation

• The data split used to train a machine learning algo is


called training dataset
 Problem:
• However, such data split cannot be used to provide
reliable estimates of accuracy for the model on
new/unseen data
• Nonetheless, whole idea of creating model was to
enable predictions on new data
 Solution:
• Use resampling methods
Resampling Methods: Algorithm Evaluation

• Use statistical methods called resampling


methods
• Split your training data into further subsets
• Use some of the subsets for training and
remaining subsets to estimate the accuracy of
the model on unseen data
Resampling Methods: Algorithm Evaluation

• In nutshell:
• Split dataset into training and test sets
• estimate accuracy of an ML algo using k-fold cross
validation
– Splits training data into k subsets
• Estimate accuracy of an ML algo using leave one out
cross validation
 Next, use scikit-learn to estimate accuracy of Logistic
regression on Pima Indians of diabetes using 10-fold
cross validation
Evaluate using cross validation
• from pandas import read_csv
• from sklearn.model_selection import Kfold
• from sklearn.model_selection import
cross_val_score
• from sklearn.linear_model import
LogisticRegression
• url = "https://goo.gl/bDdBiA" names = ['preg',
'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age',
'class']
Evaluate using cross validation
• dataframe = read_csv(url, names=names)
• array = dataframe.values
• X = array[:,0:8]
• Y = array[:,8]
• kfold = KFold(n_splits=10, random_state=7)
• model = LogisticRegression()
• results = cross_val_score(model, X, Y, cv=kfold)
• print("Accuracy: %.3f%% (%.3f%%)" %
(results.mean()*100.0, results.std()*100.0))
Algorithm evaluation metrics
• Metrics to harness the ML algorithms in scikit-learn library
– cross_val_score()
– Defaults can be used for regression and classification problems
• Practice accuracy and kappa metrics on a classification
problem
• Practice how to generate confusion matrix and a
classification report
• Practice how to use RMSE and Rsquared metrics on a
regression problem
Algorithm evaluation metrics
• Calculate LogLoss metric on Pima Indians onset of diabetes dataset
• from pandas import read_csv
• from sklearn.model_selection import Kfold
• from sklearn.model_selection import cross_val_score
• from sklearn.linear_model import LogisticRegression
• url = "https://goo.gl/bDdBiA" names = ['preg', 'plas', 'pres', 'skin',
'test', 'mass', 'pedi', 'age', 'class’]
• dataframe = read_csv(url, names=names)
• array = dataframe.values
• X = array[:,0:8]
• Y = array[:,8]
Algorithm evaluation metrics
• kfold = Kfold(n_splits=10, random_state=7)
• model = LogisticRegression()
• scoring = 'neg_log_loss’
• results = cross_val_score(model, X, Y,
• cv=kfold, scoring=scoring)
• print("Logloss: %.3f (%.3f)" % (results.mean(),
results.std())​)​
Spot-Check ML Algorithm
• Difficult to know which ML Algo will perform best on the data
beforehand
• Trail and error, also spot-checking
• Scikit-learn library provides tools to compare the estimated
accuracy of these algos
• Spot-check linear algorithm on a dataset
– Linear regression, logistic regression and linear discriminant analysis
• Spot-check non-linear algorithm on a dataset
– kNN, SVM and CART
• Spot-check sophisticated ensemble algo on a dataset
– Random forest and stochastic gradient boosting
Spot-checking example
• k-Nearest Neighbor algo on Boston House Price
dataset
• #kNN Regression
• from pandas import read_csv  from
sklearn.model_selection import KFold from
sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsRegressor
url = "https://goo.gl/FmJUSM" names = ['CRIM', 'ZN',
'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
'PTRATIO', 'B', 'LSTAT', 'MEDV']
Spot-checking example
• k-Nearest Neighbor algo on Boston House Price dataset
• dataframe = read_csv(url, delim_whitespace=True,
names=names)
• array = dataframe.values
• X = array[:,0:13] Y = array[:,13]
• kfold = KFold(n_splits=10, random_state=7) model =
KNeighborsRegressor()
• scoring = 'neg_mean_squared_error'
• results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
• print(results.mean())
Model comparison and selection
• Next, you need to compare estimated
performance of different algos and then,
select the best one
• Compare linear algos with each other for a
given dataset
• Compare non-linear algos with each other for
a given dataset
• Create plots of the results comparing algos
Model comparison and selection
• Example shows logistic regression and linear discriminant analysis
on Pima Indians diabetes dataset
• # Compare Algorithms
• from pandas import read_csv
• from sklearn.model_selection import KFold
• from sklearn.model_selection import cross_val_score
• from sklearn.linear_model import LogisticRegression
• from sklearn.discriminant_analysis import
LinearDiscriminantAnalysis
• # load dataset url = "https://goo.gl/bDdBiA"
• names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
Model comparison and selection
• dataframe = read_csv(url, names=names) array = dataframe.values
• X = array[:,0:8]
• Y = array[:,8]
• # prepare models
• models = []
• models.append(('LR', LogisticRegression()))
• models.append(('LDA', LinearDiscriminantAnalysis()))
• # evaluate each model in turn
• results = []
• names = []
• scoring = 'accuracy'
Model comparison and selection
• for name, model in models:     
– kfold = KFold(n_splits=10,random_state=7)
– cv_results = cross_val_score(model, X,
Y,cv=kfold,scoring=scoring)     
– results.append(cv_results)     
– names.append(name)     
– msg="%s:%f(%f)"%(name,cv_results.mean(),
cv_results.std())     
– print(msg)
Algorithm Tuning

You might also like