0% found this document useful (0 votes)

21 views9 pages

Data Preprocessing

The document outlines a comprehensive guide for data preprocessing and implementing various machine learning algorithms including K-NN, Decision Trees, Naive Bayes, Random Forest, and Linear Regression. It details steps such as importing libraries, handling missing values, encoding categorical data, splitting datasets into training and testing sets, and visualizing results. Each algorithm is illustrated with code snippets for training, predicting, and evaluating performance using confusion matrices and visualizations.

Uploaded by

Bharath Shivashankar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views9 pages

Data Preprocessing

Uploaded by

Bharath Shivashankar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Data Preprocessisng:-

1. Get Data Set

2. Import important libraries:-

import numpy as np :- for number calculations and array manupulation

import matplotlib.pyplot as plt:- for pictorial representation of results

import pandas as pd:- read and manupulate the data , for series operations

3. Import dataset:- (sir’s example)

data.csv/xls

dataset=pd.read_csv(‘Data.csv’)

> create matrix of all independent variables(sir’s example)

x = datset.iloc[:, :-1].values

> create matrix of dependent variables(sir’s example)
y = datset.iloc[:, 3].values

4. Handaling missing values

taking care of missing data from :-
> from sklearn.preprocessing import Imputer (sklearn is a ML lib for multiple
jobs,
Imputer use to find the missing values
rememberer caps I)

> imputer = Imputer(missing_values =’NaN’,strategy = ’mean’, axis=0)
imputer = imputer.fit(x[:,1:3])

>x[:,1:3] = imputer.transform(x[:,1:3])

5. Categorical Data:-
Encoding Categorical Data:
#Encoding the independent variable:-
> from sklearn.preprocessing import LabelEncoder,OneHotEncoder
(LabelEncoder will give numbers to entities of same
category)

> labelencoder_x = LabelEncoder()
x[:,0] = labelencoder_x.fit_transform(x[:,0])(here it will enocde the first
column values as 0,1,2...)

> onehotencoder = OneHotEncoder(categorical_features=[0])

> x=onehotencoder.fit_transform(x).toarray() (to encode x in terms of o’s and
1’s and other values in
exponential form)
x

>labelencoder_y=LabelEncoder() (encoding y)
y = labelencoder_y.fit_transform(y)
6. Spliting Training and Test Data:-

> from sklearn.cross_validation import train_test_split
note:- (cross validation is library for spliting the whole data set in training and testing
data..... inside which we call the train and test
class for spliting the data)

>x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state = 0)
(splits the whole data set for 80% data for tarining, 20% data for testing,,
random state maintains the consistency in the train and test data,if not then
every time it takes duffrent set if values)

try to keep the train test in between the range of 20-30% as <20 results in
overfitting and more than 30 leads to error

7. Future scaling:- (is used to scale large values in small space.. like putting the two
numbers and the square of their diuffrence in the same graph)
(Note:- all values will be scaled between -1 to +1)

> from sklearn.preprocessing import Standard Scaler
sc_x = Standard Scaler
x_train = sc_x.fit_transform(x_train) #fit.transform-- use only for training data
x_test = sc_x.transform(x_test)
## x_train=always a dependent variable
## standard scaler is a class that scales all the values based on volume of ,model...
## fit()- generate learning model parameters from training data (only makes
machine to learn) going to make the object ready
##transform()-- applied upon model to generate transform data set..

Mnote:_ fit_transform() can only be applied on standard scaler functions

**k-nn algorithm:-
from sklearn.neighbors import KneighborsClassifier
classifier = KneighborsClassifier(n_neighbors=5, metric=’minkowski’, p=2)
classifier.fit(X_train,Y_train)
-------till here machine if fit with trianing data and machines learns with training data----

## in sklearn neighbors is a library in which we have kneighbors classifiers

## kneighbors takes some values=== n-neighbors are number of neighbours... a prime
number
metrics == defines the type of method being used
p=2 means using euclidean distance
------ for testing and predictiong------
y_pred = classifier.predict(X_test) ## predicts only on x_test values given before
y_pred

**making the confusion matrix----

from sklearn.metrics import confusion_matrix ## confusion_matrix is a fnc

cm = confusion_matrix(y_test,y_pred)
cm
gives out a confusion matrix with [TP,FP,FN,TN] format/....

*STEP 8:- Visualizing the Training and Test data set results

from matplotlib.colors import ListedColormap

x_set,y_set=x_train,y_train

x1,x2=
np.meshgrid(np.arange(start=x_set[:,0].min()-1,stop=x_set[:,0].max()+1,step=0.01),np.ara
nge(start=x_set[:,1].min()-1,stop=x_set[:,1].max()+1,step=0.01))

plt.contourf(x1,x2,classifier.predict(np.array([x1.ravel(),x2.ravel()]).T).reshape(x1.shape),al
pha=0.75,cmap=ListedColormap((‘red’,’green’)))

plt.xlim(x1.min(),x1.max())
plt.ylim(x2.min(),x2.max())

for i,j in enumerate(np.uniquely(y_set)):

plt.scatter(x_set[y_set==j,0],x_set[y_set==j,1],c=ListedColormap((‘red’,’green’))(i),label=j)
plt.title(‘K-NN(Training set)’)
plt.xlabel(‘Age’)
plt.yalbel(‘Estimnated sAlary’)
plt.legend()
plt.show()

**
*STEP 9:- Visualizing the Training and Test data set results

from matplotlib.colors import ListedColormap

x_set,y_set=x_train,y_train
x1,x2=
np.meshgrid(np.arange(start=x_set[:,0].min()-1,stop=x_set[:,0].max()+1,step=0.01),np.ara
nge(start=x_set[:,1].min()-1,stop=x_set[:,1].max()+1,step=0.01))

plt.contourf(x1,x2,classifier.predict(np.array([x1.ravel(),x2.ravel()]).T).reshape(x1.shape),al
pha=0.75,cmap=ListedColormap((‘red’,’green’)))

plt.xlim(x1.min(),x1.max())
plt.ylim(x2.min(),x2.max())

for i,j in enumerate(np.uniquely(y_set)):

**decision treee:--

dataset=pd.read_csv('Social_Network_Ads.csv')
x = dataset.iloc[:,[2,3]].values
y = dataset.iloc[:, 4].values

from sklearn.cross_validation import train_test_split

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.25,random_state = 0)

from sklearn.preprocessing import StandardScaler

sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)

from sklearn.tree import DecisionTreeClassifier

classifier = DecisionTreeClassifier(criterion = 'entropy',random_state=0)
classifier.fit(x_train,y_train)

y_pred = classifier.predict(x_test)

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test,y_pred)
cm

training plot:-
from matplotlib.colors import ListedColormap
x_set,y_set=x_train,y_train

x1,x2=
np.meshgrid(np.arange(start=x_set[:,0].min()-1,stop=x_set[:,0].max()+1,step=0.01),np.ara
nge(start=x_set[:,1].min()-1,stop=x_set[:,1].max()+1,step=0.01))

plt.contourf(x1,x2,classifier.predict(np.array([x1.ravel(),x2.ravel()]).T).reshape(x1.shape),al
pha = 0.75,cmap = ListedColormap(('red','green')))

plt.xlim(x1.min(),x1.max())
plt.ylim(x2.min(),x2.max())

for i,j in enumerate(np.unique(y_set)):

plt.scatter(x_set[y_set==j,0],x_set[y_set==j,1],c=ListedColormap(('red','green'))(i),label=j)
plt.title('Decison Tree(Training set)')
plt.xlabel('Age')
plt.ylabel('Estimnated sAlary')
plt.legend()
plt.show()

Test plot :--

from matplotlib.colors import ListedColormap

x_set,y_set=x_test,y_test

x1,x2=
np.meshgrid(np.arange(start=x_set[:,0].min()-1,stop=x_set[:,0].max()+1,step=0.01),np.ara
nge(start=x_set[:,1].min()-1,stop=x_set[:,1].max()+1,step=0.01))

plt.contourf(x1,x2,classifier.predict(np.array([x1.ravel(),x2.ravel()]).T).reshape(x1.shape),al
pha = 0.75,cmap = ListedColormap(('red','green')))

plt.xlim(x1.min(),x1.max())
plt.ylim(x2.min(),x2.max())

for i,j in enumerate(np.unique(y_set)):

plt.scatter(x_set[y_set==j,0],x_set[y_set==j,1],c=ListedColormap(('red','green'))(i),label=j)
plt.title('Decison Tree(Test set)')
plt.xlabel('Age')
plt.ylabel('Estimnated sAlary')
plt.legend()
plt.show()
***Naive Bayes
dataset=pd.read_csv('Social_Network_Ads.csv')
x = dataset.iloc[:,[2,3]].values
y = dataset.iloc[:, 4].values

from sklearn.cross_validation import train_test_split

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.25,random_state = 0)

from sklearn.preprocessing import StandardScaler

sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)

from sklearn.naive_bayes import GaussianNB

classifier = GaussianNB()
classifier.fit(x_train,y_train)

y_pred = classifier.predict(x_test)

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test,y_pred)
cm

training plot:-
from matplotlib.colors import ListedColormap

x_set,y_set=x_train,y_train

x1,x2=
np.meshgrid(np.arange(start=x_set[:,0].min()-1,stop=x_set[:,0].max()+1,step=0.01),np.ara
nge(start=x_set[:,1].min()-1,stop=x_set[:,1].max()+1,step=0.01))

plt.contourf(x1,x2,classifier.predict(np.array([x1.ravel(),x2.ravel()]).T).reshape(x1.shape),al
pha = 0.75,cmap = ListedColormap(('red','green')))

plt.xlim(x1.min(),x1.max())
plt.ylim(x2.min(),x2.max())

for i,j in enumerate(np.unique(y_set)):

plt.scatter(x_set[y_set==j,0],x_set[y_set==j,1],c=ListedColormap(('red','green'))(i),label=j)
plt.title('Naive Bayes(Training set)')
plt.xlabel('Age')
plt.ylabel('Estimnated sAlary')
plt.legend()
plt.show()
Test plot :--
from matplotlib.colors import ListedColormap

x_set,y_set=x_test,y_test

x1,x2=
np.meshgrid(np.arange(start=x_set[:,0].min()-1,stop=x_set[:,0].max()+1,step=0.01),np.ara
nge(start=x_set[:,1].min()-1,stop=x_set[:,1].max()+1,step=0.01))

plt.contourf(x1,x2,classifier.predict(np.array([x1.ravel(),x2.ravel()]).T).reshape(x1.shape),al
pha = 0.75,cmap = ListedColormap(('red','green')))

plt.xlim(x1.min(),x1.max())
plt.ylim(x2.min(),x2.max())

for i,j in enumerate(np.unique(y_set)):

plt.scatter(x_set[y_set==j,0],x_set[y_set==j,1],c=ListedColormap(('red','green'))(i),label=j)
plt.title('Naive Bayes(Test set)')
plt.xlabel('Age')
plt.ylabel('Estimnated sAlary')
plt.legend()
plt.show()

**Random forest
dataset=pd.read_csv('Social_Network_Ads.csv')
x = dataset.iloc[:,[2,3]].values
y = dataset.iloc[:, 4].values

from sklearn.cross_validation import train_test_split

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.25,random_state = 0)

from sklearn.preprocessing import StandardScaler

sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)

from sklearn.ensemble import RandomForestClassifier

classifier = RandomForestClassifier(n_estimators = 10,criterion =
'entropy',random_state=0)
classifier.fit(x_train,y_train)

y_pred = classifier.predict(x_test)

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test,y_pred)
cm
from matplotlib.colors import ListedColormap

x_set,y_set=x_train,y_train

x1,x2=
np.meshgrid(np.arange(start=x_set[:,0].min()-1,stop=x_set[:,0].max()+1,step=0.01),np.ara
nge(start=x_set[:,1].min()-1,stop=x_set[:,1].max()+1,step=0.01))

plt.contourf(x1,x2,classifier.predict(np.array([x1.ravel(),x2.ravel()]).T).reshape(x1.shape),al
pha = 0.75,cmap = ListedColormap(('red','green')))

plt.xlim(x1.min(),x1.max())
plt.ylim(x2.min(),x2.max())

for i,j in enumerate(np.unique(y_set)):

from matplotlib.colors import ListedColormap

x_set,y_set=x_test,y_test

x1,x2=
np.meshgrid(np.arange(start=x_set[:,0].min()-1,stop=x_set[:,0].max()+1,step=0.01),np.ara
nge(start=x_set[:,1].min()-1,stop=x_set[:,1].max()+1,step=0.01))

plt.contourf(x1,x2,classifier.predict(np.array([x1.ravel(),x2.ravel()]).T).reshape(x1.shape),al
pha = 0.75,cmap = ListedColormap(('red','green')))

plt.xlim(x1.min(),x1.max())
plt.ylim(x2.min(),x2.max())

for i,j in enumerate(np.unique(y_set)):

Linear Regression:-
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

dataset=pd.read_csv('Salary_Data.csv')
x = dataset.iloc[:,:-1].values
y = dataset.iloc[:,1].values

from sklearn.cross_validation import train_test_split

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.25,random_state = 0)

from sklearn.linear_model import LinearRegression

regressor = LinearRegression()
regressor.fit(x_train,y_train)

y_pred = regressor.predict(x_test)

plt.scatter(x_train,y_train, color='red')
plt.plot(x_train,regressor.predict(x_train),color = 'blue')
plt.title('sal vs exp (Test set)')
plt.xlabel('Age')
plt.ylabel('Estimnated sAlary')
plt.legend()
plt.show()

GOC MARINA REVIEWER (1) Updated PDF
100% (4)
GOC MARINA REVIEWER (1) Updated PDF
125 pages
ABS Schema Electrica
No ratings yet
ABS Schema Electrica
5 pages
Dalton MP Glossary
No ratings yet
Dalton MP Glossary
14 pages
ML Manual With Outputs
No ratings yet
ML Manual With Outputs
30 pages
Shobit Sharma (2124399) ML Lab File PDF
No ratings yet
Shobit Sharma (2124399) ML Lab File PDF
19 pages
Machine Learning Assignment
No ratings yet
Machine Learning Assignment
7 pages
Linearregression SVM
No ratings yet
Linearregression SVM
3 pages
16BCB0126 VL2018195002535 Pe003
No ratings yet
16BCB0126 VL2018195002535 Pe003
40 pages
Unit2 ML Programs
No ratings yet
Unit2 ML Programs
7 pages
LAB-4 Report
No ratings yet
LAB-4 Report
21 pages
Python For Data Science IA 1 Programs
No ratings yet
Python For Data Science IA 1 Programs
14 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
11 pages
Python Code For KNN Classifier 1. Initial Message
No ratings yet
Python Code For KNN Classifier 1. Initial Message
7 pages
Scikit-Learn: Scikit-Learn Is An Open Source Python Library That
100% (1)
Scikit-Learn: Scikit-Learn Is An Open Source Python Library That
1 page
SVM K NN MLP With Sklearn Jupyter NoteBo
No ratings yet
SVM K NN MLP With Sklearn Jupyter NoteBo
22 pages
Document 4
No ratings yet
Document 4
3 pages
Ann Experiential Learning
No ratings yet
Ann Experiential Learning
43 pages
ML Codes
No ratings yet
ML Codes
9 pages
ML Cheatsheet
No ratings yet
ML Cheatsheet
4 pages
C2W3 Lab 01 Model Evaluation and Selection
No ratings yet
C2W3 Lab 01 Model Evaluation and Selection
21 pages
C2W3 Lab 01 Model Evaluation and Selection
No ratings yet
C2W3 Lab 01 Model Evaluation and Selection
21 pages
Scikit-Learn Cheat Sheet Python For Data Science: Preprocessing The Data Evaluate Your Model's Performance
100% (1)
Scikit-Learn Cheat Sheet Python For Data Science: Preprocessing The Data Evaluate Your Model's Performance
1 page
ML Algorithms
100% (1)
ML Algorithms
1 page
Machine Learnin
100% (2)
Machine Learnin
23 pages
Machine Learning Lab
No ratings yet
Machine Learning Lab
20 pages
Exp9 10
No ratings yet
Exp9 10
4 pages
Scikit-Learn Cheat Sheet
No ratings yet
Scikit-Learn Cheat Sheet
1 page
Scikit-Learn Cheat Sheet
No ratings yet
Scikit-Learn Cheat Sheet
1 page
Progress of GRADIENT BOOSTING ALGORITHM FOR ELECTRICITY THEFT DETECTION IN POWER UTILITIES
No ratings yet
Progress of GRADIENT BOOSTING ALGORITHM FOR ELECTRICITY THEFT DETECTION IN POWER UTILITIES
10 pages
Classification Review
No ratings yet
Classification Review
8 pages
Machine Learning Model Building
No ratings yet
Machine Learning Model Building
6 pages
Sample Code
No ratings yet
Sample Code
8 pages
ADS - Phase 3
No ratings yet
ADS - Phase 3
34 pages
ML Lab Manual
No ratings yet
ML Lab Manual
12 pages
ML Lab 01999676272
No ratings yet
ML Lab 01999676272
12 pages
Udacity Machine Learning Analysis Supervised Learning
100% (1)
Udacity Machine Learning Analysis Supervised Learning
504 pages
Case Study - Classifier
No ratings yet
Case Study - Classifier
5 pages
Machine Learning: Supervised /unsupervised
No ratings yet
Machine Learning: Supervised /unsupervised
33 pages
Python For Data Science IA 1 Programs
No ratings yet
Python For Data Science IA 1 Programs
14 pages
Aiml 5-8
No ratings yet
Aiml 5-8
19 pages
Linear Regression (Code)
No ratings yet
Linear Regression (Code)
9 pages
ML LAB
No ratings yet
ML LAB
29 pages
Project-4 (KNN CLASSIFICATION) (2) PRANAB
No ratings yet
Project-4 (KNN CLASSIFICATION) (2) PRANAB
2 pages
Cheat Sheet: Python For Data Science
100% (1)
Cheat Sheet: Python For Data Science
1 page
DM ML Practical
No ratings yet
DM ML Practical
13 pages
ML Functions
No ratings yet
ML Functions
12 pages
AI ML - Cycle 2 Programs
No ratings yet
AI ML - Cycle 2 Programs
15 pages
Mlda - Lab
No ratings yet
Mlda - Lab
35 pages
CP4252 Lab Manual
No ratings yet
CP4252 Lab Manual
13 pages
Machine Learning Cheatsheet
No ratings yet
Machine Learning Cheatsheet
5 pages
Codes For Project
No ratings yet
Codes For Project
8 pages
ML PDF
No ratings yet
ML PDF
30 pages
1
No ratings yet
1
13 pages
Scikit Learn What Were Covering
No ratings yet
Scikit Learn What Were Covering
15 pages
Data Modeling - Cheatsheet
No ratings yet
Data Modeling - Cheatsheet
9 pages
MlLabManualdocx 2024 09 04 22 02 58
No ratings yet
MlLabManualdocx 2024 09 04 22 02 58
19 pages
Aiml Ex 4-7
No ratings yet
Aiml Ex 4-7
8 pages
Lab Manual
No ratings yet
Lab Manual
9 pages
Scikit Learn
No ratings yet
Scikit Learn
17 pages
Random Forest
No ratings yet
Random Forest
5 pages
Scikit Learn Cheat Sheet Python
No ratings yet
Scikit Learn Cheat Sheet Python
1 page
Inspection Report On Dr. Wee-Lim Sim Medical Practice
No ratings yet
Inspection Report On Dr. Wee-Lim Sim Medical Practice
6 pages
Iccp
No ratings yet
Iccp
8 pages
Physics Excel Term 2 2024
No ratings yet
Physics Excel Term 2 2024
13 pages
API RP 577 - Question Table 6 PDF
100% (1)
API RP 577 - Question Table 6 PDF
9 pages
Andre-Marie Ampere
No ratings yet
Andre-Marie Ampere
2 pages
Lab #1 Data & Answer Sheet Updated Jan - 12!23!1
No ratings yet
Lab #1 Data & Answer Sheet Updated Jan - 12!23!1
6 pages
Fexofenadine Bioequivalence
No ratings yet
Fexofenadine Bioequivalence
3 pages
English - Mock 2 Reading Writing
No ratings yet
English - Mock 2 Reading Writing
6 pages
English 8 - Learning Packet - Lesson 1
No ratings yet
English 8 - Learning Packet - Lesson 1
5 pages
Summative Test
No ratings yet
Summative Test
3 pages
Chapter 1 Karnataka Economy
100% (1)
Chapter 1 Karnataka Economy
13 pages
Sri Surya Degree College, Nagari: I. Answer Any FIVE The Following Questions. 5 3 15 M
No ratings yet
Sri Surya Degree College, Nagari: I. Answer Any FIVE The Following Questions. 5 3 15 M
2 pages
Vestfrost Solutions: Service and Maintenance Instructions: VLS200/300/350/400
No ratings yet
Vestfrost Solutions: Service and Maintenance Instructions: VLS200/300/350/400
30 pages
A Review of Hydraulic Performance and Design Metho
No ratings yet
A Review of Hydraulic Performance and Design Metho
19 pages
Detailed Lesson Plan Drugs
No ratings yet
Detailed Lesson Plan Drugs
7 pages
Ministry of Rural Development, Govt of India: Mission Antyodaya Survey 2020
No ratings yet
Ministry of Rural Development, Govt of India: Mission Antyodaya Survey 2020
37 pages
Maha Ganapathi
No ratings yet
Maha Ganapathi
6 pages
Computational Modelling in Drug Discovery
No ratings yet
Computational Modelling in Drug Discovery
44 pages
Singapore Constr Company Details
100% (1)
Singapore Constr Company Details
79 pages
Door Lock System
100% (1)
Door Lock System
14 pages
Low Voltage Switchgear and Control Gear Application Guide
No ratings yet
Low Voltage Switchgear and Control Gear Application Guide
152 pages
Types of Sectioning
No ratings yet
Types of Sectioning
26 pages
George Adamson - Baba Ya Simpa (Father of Lions)
No ratings yet
George Adamson - Baba Ya Simpa (Father of Lions)
10 pages
1 - Final Review 2
No ratings yet
1 - Final Review 2
13 pages
Bài tập đặt câu hỏi WH
No ratings yet
Bài tập đặt câu hỏi WH
5 pages
Phytochrome Concept
No ratings yet
Phytochrome Concept
3 pages
Extraembryonic Membranes
No ratings yet
Extraembryonic Membranes
4 pages