Chandigarh Group of Colleges College of Engineering Landran, Mohali
Chandigarh Group of Colleges College of Engineering Landran, Mohali
Chandigarh Group of Colleges College of Engineering Landran, Mohali
College Of Engineering
Landran,Mohali
Machine Learning
Lab File (BTCS619-18)
1. Binning Method:
This method works on sorted data in order to smooth it. The whole data
is divided into segments of equal size and then various methods are
performed to complete the task. Each segmented is handled separately.
One can replace all data in a segment by its mean or boundary values
can be used to complete the task.
2. Regression:
Here data can be made smooth by fitting it to a regression function.The
regression used may be linear (having one independent variable) or
multiple (having multiple independent variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may be
undetected or it will fall outside the clusters.
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable
for mining process. This involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0
or 0.0 to 1.0)
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of
attributes to help the mining process.
3. Discretization:
This is done to replace the raw values of numeric attribute by interval
levels or conceptual levels.
4. Concept Hierarchy Generation:
Here attributes are converted from level to higher level in hierarchy. For
Example-The attribute “city” can be converted to “country”.
3. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data.
While working with huge volume of data, analysis became harder in such
cases. In order to get rid of this, we uses data reduction technique. It aims to
increase the storage efficiency and reduce data storage and analysis costs.
The various steps to data reduction are:
1. Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data
cube.
2. Attribute Subset Selection:
The highly relevant attributes should be used, rest all can be discarded. For
performing attribute selection, one can use level of significance and p-
value of the attribute. The attribute having p-value greater than significance
level can be discarded.
3. Numerosity Reduction:
This enable to store the model of data instead of whole data, for example:
Regression Models.
4. Dimensionality Reduction:
This reduce the size of data by encoding mechanisms.It can be lossy or
lossless. If after reconstruction from compressed data, original data can be
retrieved, such reduction are called lossless reduction else it is called lossy
reduction. The two effective methods of dimensionality reduction are:
Wavelet transforms and PCA (Principal Component Analysis).
Program :
import pandas as pd
import numpy as np
data = pd.read_csv("data1.csv")
print("data\n")
print(data.head())
indi = []
for index,item in data.iterrows():
if(str(item["RM"]) == 'nan'):
indi.append(index)
print("Index where data is null\n")
print(indi)
# using function
print("Using function")
null_data = pd.isnull(data["RM"])
print(null_data)
print(data[null_data])
# Train-Test Splitting
# using function
print("Using Function")
from sklearn.model_selection import train_test_split
train_set_using_function, test_set_using_function = train_test_split(data, test_s
ize=0.2, random_state=42) # , random_state=42
print(f"Rows in train set: {len(train_set_using_function)}\nRows in test set: {le
n(test_set_using_function)}\n")
Output :
EXPERIMENT-2
AIM: Implement Simple Linear Regression.
Simple Linear Regression is a type of Regression algorithms that models the
relationship between a dependent variable and a single independent variable.
The relationship shown by a Simple Linear Regression model is linear or a
sloped straight line, hence it is called Simple Linear Regression.
Here,
• Y is a dependent variable.
• X is an independent variable.
• β0 and β1 are the regression coefficients.
• β0 is the intercept or the bias that fixes the offset to a line.
• β1 is the slope or weight that specifies the factor by which X has an
impact on Y.
Case-01: β1 < 0
Case-03: β1 > 0
import pandas
from pandas import DataFrame
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from scipy.stats import pearsonr
data = pandas.read_csv('Linear_regression_basic.csv')
print("Dataset : \n",data)
print(data.describe())
x = DataFrame(data,columns=['x'])
y = DataFrame(data,columns=['y'])
# plt.figure(figsize=(10,10))
plt.title('LINEAR REGRESSION')
plt.xlabel('X axis') #label of X axis
plt.ylabel('Y axis') #label of Y axis
plt.ylim(0,7) #for Y axis limit
plt.xlim(0,8) #for X axis limit
plt.grid() #for grid
# plt.scatter(x,y,alpha=(0.7)) #Visibility of point
plt.scatter(x,y,color='green',s=50) #s for size
plt.show()
print("Data is x : \n",data['x'])
print("Data in y : \n",data['y'])
# Correlation Coefficient
corr,_ = pearsonr(data['x'],data['y'])
print("Correlation Coefficient : ",corr)
regression = LinearRegression()
regression.fit(x,y)
print("Regression Cofficient : ",regression.coef_)
# Intercept
print("Regression Intercept : ",regression.intercept_)
plt.figure(figsize=(10,6))
plt.title('LINEAR REGRESSION')
plt.xlabel('x --->')
plt.ylabel('y --->')
plt.ylim(0,15)
plt.xlim(0,15)
plt.scatter(x,y,alpha=(0.5))
plt.plot(x,regression.predict(x),color = "red",linewidth=2)
plt.show()
print("Accuracy score : ",regression.score(x,y))
print(data)
X = DataFrame(data,columns=['x'])
Y = DataFrame(data,columns=['y'])
plt.figure(figsize=(10,6))
plt.title('LINEAR REGRESSION NEW')
plt.xlabel('x --->')
plt.ylabel('y --->')
plt.ylim(0,15)
plt.xlim(0,15)
plt.scatter(X,Y)
plt.plot(X,regression.predict(X),color = "red",linewidth=2)
plt.show()
Output :
EXPERIMENT-3
AIM: Simulate Multiple Linear Regression.
In multiple linear regression, the dependent variable depends on more than one
independent variables.
Here,
• Y is a dependent variable.
• X1, X2, …., Xn are independent variables.
• β0, β1,…, βn are the regression coefficients.
βj (1<=j<=n) is the slope or weight that specifies the factor by
which Xj has an impact on Y.
Program :
import pandas as pd
import numpy as np
from sklearn import linear_model
data = pd.read_csv("mlr.csv")
print("\n Dataset\n")
print(data.head())
# Train-Test Splitting
reg = linear_model.LinearRegression()
reg.fit(train_set[["area","bedrooms","age"]],train_set["price"])
Output:
EXPERIMENT-4
AIM: Implement Decision Trees .
o Step-1: Begin the tree with the root node, says S, which contains the
complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection
Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the best
attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the
dataset created in step -3. Continue this process until a stage is reached
where you cannot further classify the nodes and called the final node as a
leaf node.
Implementation:
import pandas as pd
df = pd.read_csv("titanic.csv")
print(df.head(5))
print("Shape : ",df.shape)
data = pd.DataFrame(df,columns=["Pclass","Sex","Age","Fare","Survived"])
print(data.head())
input = data.drop('Survived',axis=1)
input.Sex = input.Sex.map({'male':1,'female':2})
target = data.Survived
print("Input Data set : \n")
print(input)
print("Target Data set : \n")
print(target)
print(input.info())
input.Age = input.Age.fillna(input.Age.mean())
model = tree.DecisionTreeClassifier()
model.fit(X_train,y_train)
print("predicted values on test set : \n")
print(model.predict(X_test))
print("Score : ",model.score(X_test,y_test))
print("Confusion Matrix : ")
y_predicted = model.predict(X_test)
cm = confusion_matrix(y_test, y_predicted)
print(cm)
import seaborn as sn
import matplotlib.pyplot as plt
plt.figure(figsize = (10,7))
sn.heatmap(cm, annot=True)
plt.xlabel('Predicted')
plt.ylabel('Truth')
Dataset :
The greater number of trees in the forest leads to higher accuracy and
prevents the problem of overfitting.
The Working process can be explained in the below steps and diagram:
Step-2: Build the decision trees associated with the selected data points
(Subsets).
Step-3: Choose the number N for decision trees that you want to build.
Step-5: For new data points, find the predictions of each decision tree, and
assign the new data points to the category that wins the majority votes.
Python Code:
# Random Forest Classification
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random
_state = 0)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
Output:
[[63 5]
[ 4 28]]
EXPERIMENT-6
AIM: Simulate Naïve Bayes algorithm.
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which
can be described as:
Bayes' Theorem:
Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed
event B.
Working of Naïve Bayes' Classifier can be understood with the help of the
below example:
Program :
import numpy as np
# Import LabelEncoder
from sklearn import preprocessing
#Import Gaussian Naive Bayes model
from sklearn.naive_bayes import GaussianNB
windy_encoded=np.array(le.fit_transform(windy))
label=le.fit_transform(play)
temp=np.array(temp)
label=np.array(label)
for r in range(0,len(wheather)):
print ("%10s\t%d\t%10s\t%d\t%10s\t%d\t%10s\t%d\t%10s\t%d"%(wheather[
r],wheather_encoded[r],temp[r],temp_encoded[r],humidity[r],humidity_encode
d[r],windy[r],windy_encoded[r],play[r],label[r]))
#Predict Output
predicted= model.predict([[2,1,0,0]]) # 0:Overcast, 2:Mild
if(predicted==0): print ("Predicted Value:", predicted,"\tPlay: NO")
if(predicted==1): print ("Predicted Value:", predicted,"\tPlay: YES")
Output:
EXPERIMENT-7
AIM: Implement K-Nearest Neighbors (K-NN), k-means.
K-NN Algorithm:
o K-Nearest Neighbour is one of the simplest Machine Learning algorithms
based on Supervised Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and
available cases and put the new case into the category that is most similar
to the available categories.
o K-NN algorithm stores all the available data and classifies a new data
point based on the similarity. This means when new data appears then it
can be easily classified into a well suite category by using K- NN
algorithm.
o K-NN algorithm can be used for Regression as well as for Classification
but mostly it is used for the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make
any assumption on underlying data.
o It is also called a lazy learner algorithm because it does not learn from
the training set immediately instead it stores the dataset and at the time of
classification, it performs an action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it
gets new data, then it classifies that data into a category that is much
similar to the new data.
K-NN Algorithm:
import pandas as pd
from sklearn.datasets import load_iris
iris = load_iris()
print("IRIS\nFeature Names:\n",iris.feature_names)
print("\nTarget Names:\n",iris.target_names)
df = pd.DataFrame(iris.data,columns=iris.feature_names)
df['target'] = iris.target
print("\nDATASET:\n",df.head())
print("Shape: ",df.shape)
It allows us to cluster the data into different groups and a convenient way to
discover the categories of groups in the unlabeled dataset on its own without the
need for any training.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-
number of clusters, and repeats the process until it does not find the best
clusters. The value of k should be predetermined in this algorithm.
k-Means ALGORITHM:
Step-2: Select random K points or centroids. (It can be other from the input
dataset).
Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the
new closest centroid of each cluster.
Program:
from sklearn.cluster import KMeans
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from matplotlib import pyplot as plt
df = pd.read_csv("income.csv")
print("DATAFRAME:\n",df.head())
plt.scatter(df.Age,df['Income($)'])
plt.xlabel('Age')
plt.ylabel('Income($)')
plt.show()
km = KMeans(n_clusters=3)
y_predicted = km.fit_predict(df[['Age','Income($)']])
print("Y predicted:",y_predicted)
df['cluster']=y_predicted
print("New dataframe:\n",df.head())
print("\nCluster Centers : ",km.cluster_centers_)
df1 = df[df.cluster==0]
df2 = df[df.cluster==1]
df3 = df[df.cluster==2]
plt.scatter(df1.Age,df1['Income($)'],color='green')
plt.scatter(df2.Age,df2['Income($)'],color='red')
plt.scatter(df3.Age,df3['Income($)'],color='black')
plt.scatter(km.cluster_centers_[:,0],km.cluster_centers_[:,1],color='purple',mark
er='*',label='centroid')
plt.xlabel('Age')
plt.ylabel('Income ($)')
plt.legend()
plt.show()
scaler.fit(df[['Income($)']])
df['Income($)'] = scaler.transform(df[['Income($)']])
scaler.fit(df[['Age']])
df['Age'] = scaler.transform(df[['Age']])
print("\nDataframe :\n",df.head())
plt.scatter(df.Age,df['Income($)'])
plt.show()
km = KMeans(n_clusters=3)
y_predicted = km.fit_predict(df[['Age','Income($)']])
print("\nY predicted : ",y_predicted)
df['cluster']=y_predicted
print("\nDataset\n",df.head())
df1 = df[df.cluster==0]
df2 = df[df.cluster==1]
df3 = df[df.cluster==2]
plt.scatter(df1.Age,df1['Income($)'],color='green')
plt.scatter(df2.Age,df2['Income($)'],color='red')
plt.scatter(df3.Age,df3['Income($)'],color='black')
plt.scatter(km.cluster_centers_[:,0],km.cluster_centers_[:,1],color='purple',mark
er='*',label='centroid')
plt.legend()
plt.show()
Output:
EXPERIMENT-8
AIM: Deploy Support Vector Machine, Apriori algorithm.
Support Vector Machine
The goal of the SVM algorithm is to create the best line or decision boundary
that can segregate n-dimensional space into classes so that we can easily put the
new data point in the correct category in the future. This best decision boundary
is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane.
These extreme cases are called as support vectors, and hence algorithm is
termed as Support Vector Machine.
Program:
import pandas as pd
from sklearn.datasets import load_digits
digits = load_digits()
print("\nRBF model\n")
from sklearn.svm import SVC
rbf_model = SVC(kernel='rbf')
rbf_model.fit(X_train, y_train)
print("RBF model Score",rbf_model.score(X_test,y_test))
print("\nUsing Linear kernel\n")
linear_model = SVC(kernel='linear')
linear_model.fit(X_train,y_train)
print("linear model Score",linear_model.score(X_test,y_test))
Output:
APRIORI ALGORITHM:
The Apriori algorithm uses frequent itemsets to generate association rules, and
it is designed to work on the databases that contain transactions. With the help
of these association rule, it determines how strongly or how weakly two objects
are connected. This algorithm uses a breadth-first search and Hash Tree to
calculate the itemset associations efficiently. It is the iterative process for
finding the frequent itemsets from the large dataset.
This algorithm was given by the R. Agrawal and Srikant in the year 1994. It is
mainly used for market basket analysis and helps to find those products that can
be bought together. It can also be used in the healthcare field to find drug
reactions for patients.
Program:
import numpy as np
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules
data = pd.read_excel('Online_Retail.xlsx')
print("DataSet:",data.head())
print("Data Columns : ",data.columns)
print("Data Shape : ",data.shape)
def hot_encode(x):
if(x<= 0):
return 0
if(x>= 1):
return 1
basket_encoded = basket_France.applymap(hot_encode)
basket_France = basket_encoded
print(rules.head())
Output:
From the above output, it can be seen that paper cups and plates are
bought together in France.
EXPERIMENT-9
A deliberate activation function for every hidden layer. In this simple neural
network Python tutorial, we’ll employ the Sigmoid activation function.
There are several types of neural networks. In this project, we are going to create
the feed-forward or perception neural networks. This type of ANN relays data
directly from the front to the back.
Python Code:
from joblib.numpy_pickle_utils import xrange
from numpy import *
class NeuralNet(object):
def __init__(self):
# Generate random numbers
random.seed(1)
# Train the neural network and adjust the weights each time.
def train(self, inputs, outputs, training_iterations):
for iteration in xrange(training_iterations):
# Pass the training set through the network.
output = self.learn(inputs)
Output:
EXPERIMENT-10
AIM: Implement the Genetic Algorithm Code.
Genetic Algorithm (GA) is a search-based optimization technique based on the
principles of Genetics and Natural Selection. It is frequently used to find
optimal or near-optimal solutions to difficult problems which otherwise would
take a lifetime to solve. It is frequently used to solve optimization problems, in
research, and in machine learning.
WORKING OF GENETIC ALGORITHM:
1. Initial Population– Initialize the population randomly based on the data.
2. Fitness function– Find the fitness value of the each of the chromosomes(a
chromosome is a set of parameters which define a proposed solution to the
problem that the genetic algorithm is trying to solve)
3. Selection– Select the best fitted chromosomes as parents to pass the genes
for the next generation and create a new population
4. Cross-over– Create new set of chromosome by combining the parents and
add them to new population set
5. Mutation– Perform mutation which alters one or more gene values in a
chromosome in the new population set generated. Mutation helps in getting
more diverse opportunity. Obtained population will be used in the next
generation
Repeat step 2-5 again for each generation.
Python Code:
import numpy as np
import pandas as pd
import random
import matplotlib.pyplot
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
#import the breast cancer dataset
from sklearn.datasets import load_breast_cancer
cancer=load_breast_cancer()
df = pd.DataFrame(cancer['data'],columns=cancer['feature_names'])
label=cancer["target"]
#splitting the model into training and testing set
X_train, X_test, y_train, y_test = train_test_split(df,
label, test_size=0.30,
random_state=101)
#training a logistics regression model
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)
predictions = logmodel.predict(X_test)
print("Accuracy = "+ str(accuracy_score(y_test,predictions)))
#defining various steps required for the genetic algorithm
def initilization_of_population(size,n_feat):
population = []
for i in range(size):
chromosome = np.ones(n_feat,dtype=np.bool)
chromosome[:int(0.3*n_feat)]=False
np.random.shuffle(chromosome)
population.append(chromosome)
return population
def fitness_score(population):
scores = []
for chromosome in population:
logmodel.fit(X_train.iloc[:,chromosome],y_train)
predictions = logmodel.predict(X_test.iloc[:,chromosome])
scores.append(accuracy_score(y_test,predictions))
scores, population = np.array(scores), np.array(population)
inds = np.argsort(scores)
return list(scores[inds][::-1]), list(population[inds,:][::-1])
def selection(pop_after_fit,n_parents):
population_nextgen = []
for i in range(n_parents):
population_nextgen.append(pop_after_fit[i])
return population_nextgen
def crossover(pop_after_sel):
population_nextgen=pop_after_sel
for i in range(len(pop_after_sel)):
child=pop_after_sel[i]
child[3:7]=pop_after_sel[(i+1)%len(pop_after_sel)][3:7]
population_nextgen.append(child)
return population_nextgen
def mutation(pop_after_cross,mutation_rate):
population_nextgen = []
for i in range(0,len(pop_after_cross)):
chromosome = pop_after_cross[i]
for j in range(len(chromosome)):
if random.random() < mutation_rate:
chromosome[j]= not chromosome[j]
population_nextgen.append(chromosome)
#print(population_nextgen)
return population_nextgen
def generations(size,n_feat,n_parents,mutation_rate,n_gen,X_train,
X_test, y_train, y_test):
best_chromo= []
best_score= []
population_nextgen=initilization_of_population(size,n_feat)
for i in range(n_gen):
scores, pop_after_fit = fitness_score(population_nextgen)
print(scores[:2])
pop_after_sel = selection(pop_after_fit,n_parents)
pop_after_cross = crossover(pop_after_sel)
population_nextgen = mutation(pop_after_cross,mutation_rate)
best_chromo.append(pop_after_fit[0])
best_score.append(scores[0])
return best_chromo,best_score
Output: