Data Preprocessisng:-
1. Get Data Set
2. Import important libraries:-
import numpy as np :- for number calculations and array manupulation
import matplotlib.pyplot as plt:- for pictorial representation of results
import pandas as pd:- read and manupulate the data , for series operations
3. Import dataset:- (sir’s example)
data.csv/xls
dataset=pd.read_csv(‘Data.csv’)
> create matrix of all independent variables(sir’s example)
x = datset.iloc[:, :-1].values
> create matrix of dependent variables(sir’s example)
y = datset.iloc[:, 3].values
4. Handaling missing values
taking care of missing data from :-
> from sklearn.preprocessing import Imputer (sklearn is a ML lib for multiple
jobs,
Imputer use to find the missing values
rememberer caps I)
> imputer = Imputer(missing_values =’NaN’,strategy = ’mean’, axis=0)
imputer = imputer.fit(x[:,1:3])
>x[:,1:3] = imputer.transform(x[:,1:3])
5. Categorical Data:-
Encoding Categorical Data:
#Encoding the independent variable:-
> from sklearn.preprocessing import LabelEncoder,OneHotEncoder
(LabelEncoder will give numbers to entities of same
category)
> labelencoder_x = LabelEncoder()
x[:,0] = labelencoder_x.fit_transform(x[:,0])(here it will enocde the first
column values as 0,1,2...)
> onehotencoder = OneHotEncoder(categorical_features=[0])
> x=onehotencoder.fit_transform(x).toarray() (to encode x in terms of o’s and
1’s and other values in
exponential form)
x
>labelencoder_y=LabelEncoder() (encoding y)
y = labelencoder_y.fit_transform(y)
6. Spliting Training and Test Data:-
> from sklearn.cross_validation import train_test_split
note:- (cross validation is library for spliting the whole data set in training and testing
data..... inside which we call the train and test
class for spliting the data)
>x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state = 0)
(splits the whole data set for 80% data for tarining, 20% data for testing,,
random state maintains the consistency in the train and test data,if not then
every time it takes duffrent set if values)
try to keep the train test in between the range of 20-30% as <20 results in
overfitting and more than 30 leads to error
7. Future scaling:- (is used to scale large values in small space.. like putting the two
numbers and the square of their diuffrence in the same graph)
(Note:- all values will be scaled between -1 to +1)
> from sklearn.preprocessing import Standard Scaler
sc_x = Standard Scaler
x_train = sc_x.fit_transform(x_train) #fit.transform-- use only for training data
x_test = sc_x.transform(x_test)
## x_train=always a dependent variable
## standard scaler is a class that scales all the values based on volume of ,model...
## fit()- generate learning model parameters from training data (only makes
machine to learn) going to make the object ready
##transform()-- applied upon model to generate transform data set..
Mnote:_ fit_transform() can only be applied on standard scaler functions
**k-nn algorithm:-
from sklearn.neighbors import KneighborsClassifier
classifier = KneighborsClassifier(n_neighbors=5, metric=’minkowski’, p=2)
classifier.fit(X_train,Y_train)
-------till here machine if fit with trianing data and machines learns with training data----
## in sklearn neighbors is a library in which we have kneighbors classifiers
## kneighbors takes some values=== n-neighbors are number of neighbours... a prime
number
metrics == defines the type of method being used
p=2 means using euclidean distance
------ for testing and predictiong------
y_pred = classifier.predict(X_test) ## predicts only on x_test values given before
y_pred
**making the confusion matrix----
from sklearn.metrics import confusion_matrix ## confusion_matrix is a fnc
cm = confusion_matrix(y_test,y_pred)
cm
gives out a confusion matrix with [TP,FP,FN,TN] format/....
*STEP 8:- Visualizing the Training and Test data set results
from matplotlib.colors import ListedColormap
x_set,y_set=x_train,y_train
x1,x2=
np.meshgrid(np.arange(start=x_set[:,0].min()-1,stop=x_set[:,0].max()+1,step=0.01),np.ara
nge(start=x_set[:,1].min()-1,stop=x_set[:,1].max()+1,step=0.01))
plt.contourf(x1,x2,classifier.predict(np.array([x1.ravel(),x2.ravel()]).T).reshape(x1.shape),al
pha=0.75,cmap=ListedColormap((‘red’,’green’)))
plt.xlim(x1.min(),x1.max())
plt.ylim(x2.min(),x2.max())
for i,j in enumerate(np.uniquely(y_set)):
plt.scatter(x_set[y_set==j,0],x_set[y_set==j,1],c=ListedColormap((‘red’,’green’))(i),label=j)
plt.title(‘K-NN(Training set)’)
plt.xlabel(‘Age’)
plt.yalbel(‘Estimnated sAlary’)
plt.legend()
plt.show()
**
*STEP 9:- Visualizing the Training and Test data set results
from matplotlib.colors import ListedColormap
x_set,y_set=x_train,y_train
x1,x2=
np.meshgrid(np.arange(start=x_set[:,0].min()-1,stop=x_set[:,0].max()+1,step=0.01),np.ara
nge(start=x_set[:,1].min()-1,stop=x_set[:,1].max()+1,step=0.01))
plt.contourf(x1,x2,classifier.predict(np.array([x1.ravel(),x2.ravel()]).T).reshape(x1.shape),al
pha=0.75,cmap=ListedColormap((‘red’,’green’)))
plt.xlim(x1.min(),x1.max())
plt.ylim(x2.min(),x2.max())
for i,j in enumerate(np.uniquely(y_set)):
plt.scatter(x_set[y_set==j,0],x_set[y_set==j,1],c=ListedColormap((‘red’,’green’))(i),label=j)
plt.title(‘K-NN(Training set)’)
plt.xlabel(‘Age’)
plt.yalbel(‘Estimnated sAlary’)
plt.legend()
plt.show()
**decision treee:--
dataset=pd.read_csv('Social_Network_Ads.csv')
x = dataset.iloc[:,[2,3]].values
y = dataset.iloc[:, 4].values
from sklearn.cross_validation import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.25,random_state = 0)
from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion = 'entropy',random_state=0)
classifier.fit(x_train,y_train)
y_pred = classifier.predict(x_test)
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,y_pred)
cm
training plot:-
from matplotlib.colors import ListedColormap
x_set,y_set=x_train,y_train
x1,x2=
np.meshgrid(np.arange(start=x_set[:,0].min()-1,stop=x_set[:,0].max()+1,step=0.01),np.ara
nge(start=x_set[:,1].min()-1,stop=x_set[:,1].max()+1,step=0.01))
plt.contourf(x1,x2,classifier.predict(np.array([x1.ravel(),x2.ravel()]).T).reshape(x1.shape),al
pha = 0.75,cmap = ListedColormap(('red','green')))
plt.xlim(x1.min(),x1.max())
plt.ylim(x2.min(),x2.max())
for i,j in enumerate(np.unique(y_set)):
plt.scatter(x_set[y_set==j,0],x_set[y_set==j,1],c=ListedColormap(('red','green'))(i),label=j)
plt.title('Decison Tree(Training set)')
plt.xlabel('Age')
plt.ylabel('Estimnated sAlary')
plt.legend()
plt.show()
Test plot :--
from matplotlib.colors import ListedColormap
x_set,y_set=x_test,y_test
x1,x2=
np.meshgrid(np.arange(start=x_set[:,0].min()-1,stop=x_set[:,0].max()+1,step=0.01),np.ara
nge(start=x_set[:,1].min()-1,stop=x_set[:,1].max()+1,step=0.01))
plt.contourf(x1,x2,classifier.predict(np.array([x1.ravel(),x2.ravel()]).T).reshape(x1.shape),al
pha = 0.75,cmap = ListedColormap(('red','green')))
plt.xlim(x1.min(),x1.max())
plt.ylim(x2.min(),x2.max())
for i,j in enumerate(np.unique(y_set)):
plt.scatter(x_set[y_set==j,0],x_set[y_set==j,1],c=ListedColormap(('red','green'))(i),label=j)
plt.title('Decison Tree(Test set)')
plt.xlabel('Age')
plt.ylabel('Estimnated sAlary')
plt.legend()
plt.show()
***Naive Bayes
dataset=pd.read_csv('Social_Network_Ads.csv')
x = dataset.iloc[:,[2,3]].values
y = dataset.iloc[:, 4].values
from sklearn.cross_validation import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.25,random_state = 0)
from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(x_train,y_train)
y_pred = classifier.predict(x_test)
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,y_pred)
cm
training plot:-
from matplotlib.colors import ListedColormap
x_set,y_set=x_train,y_train
x1,x2=
np.meshgrid(np.arange(start=x_set[:,0].min()-1,stop=x_set[:,0].max()+1,step=0.01),np.ara
nge(start=x_set[:,1].min()-1,stop=x_set[:,1].max()+1,step=0.01))
plt.contourf(x1,x2,classifier.predict(np.array([x1.ravel(),x2.ravel()]).T).reshape(x1.shape),al
pha = 0.75,cmap = ListedColormap(('red','green')))
plt.xlim(x1.min(),x1.max())
plt.ylim(x2.min(),x2.max())
for i,j in enumerate(np.unique(y_set)):
plt.scatter(x_set[y_set==j,0],x_set[y_set==j,1],c=ListedColormap(('red','green'))(i),label=j)
plt.title('Naive Bayes(Training set)')
plt.xlabel('Age')
plt.ylabel('Estimnated sAlary')
plt.legend()
plt.show()
Test plot :--
from matplotlib.colors import ListedColormap
x_set,y_set=x_test,y_test
x1,x2=
np.meshgrid(np.arange(start=x_set[:,0].min()-1,stop=x_set[:,0].max()+1,step=0.01),np.ara
nge(start=x_set[:,1].min()-1,stop=x_set[:,1].max()+1,step=0.01))
plt.contourf(x1,x2,classifier.predict(np.array([x1.ravel(),x2.ravel()]).T).reshape(x1.shape),al
pha = 0.75,cmap = ListedColormap(('red','green')))
plt.xlim(x1.min(),x1.max())
plt.ylim(x2.min(),x2.max())
for i,j in enumerate(np.unique(y_set)):
plt.scatter(x_set[y_set==j,0],x_set[y_set==j,1],c=ListedColormap(('red','green'))(i),label=j)
plt.title('Naive Bayes(Test set)')
plt.xlabel('Age')
plt.ylabel('Estimnated sAlary')
plt.legend()
plt.show()
**Random forest
dataset=pd.read_csv('Social_Network_Ads.csv')
x = dataset.iloc[:,[2,3]].values
y = dataset.iloc[:, 4].values
from sklearn.cross_validation import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.25,random_state = 0)
from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 10,criterion =
'entropy',random_state=0)
classifier.fit(x_train,y_train)
y_pred = classifier.predict(x_test)
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,y_pred)
cm
from matplotlib.colors import ListedColormap
x_set,y_set=x_train,y_train
x1,x2=
np.meshgrid(np.arange(start=x_set[:,0].min()-1,stop=x_set[:,0].max()+1,step=0.01),np.ara
nge(start=x_set[:,1].min()-1,stop=x_set[:,1].max()+1,step=0.01))
plt.contourf(x1,x2,classifier.predict(np.array([x1.ravel(),x2.ravel()]).T).reshape(x1.shape),al
pha = 0.75,cmap = ListedColormap(('red','green')))
plt.xlim(x1.min(),x1.max())
plt.ylim(x2.min(),x2.max())
for i,j in enumerate(np.unique(y_set)):
plt.scatter(x_set[y_set==j,0],x_set[y_set==j,1],c=ListedColormap(('red','green'))(i),label=j)
plt.title('Naive Bayes(Training set)')
plt.xlabel('Age')
plt.ylabel('Estimnated sAlary')
plt.legend()
plt.show()
from matplotlib.colors import ListedColormap
x_set,y_set=x_test,y_test
x1,x2=
np.meshgrid(np.arange(start=x_set[:,0].min()-1,stop=x_set[:,0].max()+1,step=0.01),np.ara
nge(start=x_set[:,1].min()-1,stop=x_set[:,1].max()+1,step=0.01))
plt.contourf(x1,x2,classifier.predict(np.array([x1.ravel(),x2.ravel()]).T).reshape(x1.shape),al
pha = 0.75,cmap = ListedColormap(('red','green')))
plt.xlim(x1.min(),x1.max())
plt.ylim(x2.min(),x2.max())
for i,j in enumerate(np.unique(y_set)):
plt.scatter(x_set[y_set==j,0],x_set[y_set==j,1],c=ListedColormap(('red','green'))(i),label=j)
plt.title('Naive Bayes(Test set)')
plt.xlabel('Age')
plt.ylabel('Estimnated sAlary')
plt.legend()
plt.show()
Linear Regression:-
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset=pd.read_csv('Salary_Data.csv')
x = dataset.iloc[:,:-1].values
y = dataset.iloc[:,1].values
from sklearn.cross_validation import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.25,random_state = 0)
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(x_train,y_train)
y_pred = regressor.predict(x_test)
plt.scatter(x_train,y_train, color='red')
plt.plot(x_train,regressor.predict(x_train),color = 'blue')
plt.title('sal vs exp (Test set)')
plt.xlabel('Age')
plt.ylabel('Estimnated sAlary')
plt.legend()
plt.show()