Jntuk R20 ML
Jntuk R20 ML
Experiment-1:
Implement and demonstrate the FIND-S algorithm for finding the
most specific hypothesis based on a given set of training data
samples. Read the training data from a .CSV file.
Experiment-2:
For a given set of training data examples stored in a .CSV file,
implement and demonstrate the Candidate-Elimination algorithm to
output a description of the set of all hypotheses consistent with
the training examples.
Experiment-3:
Write a program to demonstrate the working of the decision tree
based ID3 algorithm. Use an appropriate data set for building the
decision tree and apply this knowledge to classify a new sample.
Experiment-4:
Exercises to solve the real-world problems using the following
machine learning methods: a) Linear Regression
b) Logistic Regression
c) Binary Classifier
Experiment-5:
Develop a program for Bias, Variance, Remove duplicates , Cross
Validation
Experiment-6:
Write a program to implement Categorical Encoding, One-hot
Encoding
Experiment-7:
Build an Artificial Neural Network by implementing the Back
propagation algorithm and test the same using appropriate data
sets.
Experiment-8:
Write a program to implement k-Nearest Neighbor algorithm to
classify the iris data set. Print both correct and wrong
predictions.
Experiment-9:
Implement the non-parametric Locally Weighted Regression
algorithm in order to fit data points. Select appropriate data set
for your experiment and draw graphs
Experiment-10:
Assuming a set of documents that need to be classified, use the
naïve Bayesian Classifier model to perform this task. Built-in
Java classes/API can be used to write the program. Calculate the
accuracy, precision, and recall for your data set.
Experiment-11:
Apply EM algorithm to cluster a Heart Disease Data Set. Use the
same data set for clustering using k-Means algorithm. Compare the
results of these two algorithms and comment on the quality of
clustering. You can add Java/Python ML library classes/API in the
program.
Experiment-12:
Exploratory Data Analysis for Classification using Pandas or
Matplotlib.
Experiment-13:
Write a Python program to construct a Bayesian network considering
medical data. Use this model to
demonstrate the diagnosis of heart patients using standard Heart
Disease Data Set
Experiment-14:
Write a program to Implement Support Vector Machines and Principle
Component Analysis
Experiment-15:
Write a program to Implement Principle Component Analysis
.
EXPERIMENT-1
Implement and demonstrate the FIND-S algorithm for finding the
most specific hypothesis based on a given set of training data
samples. Read the training data from a .CSV file.
DATASET:-
for j in range(0,num_attributes):
hypothesis[j] = a[0][j];
DATASET:-
.
HIT DOWNLOAD FOR ‘data3.csv’ DOWNLOAD
def entropy(S):
attr=list(set(S))
if len(attr)==1:
return 0
counts=[0,0]
for i in range(2):
counts[i]=sum([1 for x in S if attr[i]==x])/(len(S)*1.0)
sums=0
for cnt in counts:
sums+=-1*cnt*math.log(cnt,2)
return sums
def compute_gain(data,col):
attr,dic = subtables(data,col,delete=False)
total_size=len(data)
entropies=[0]*len(attr)
ratio=[0]*len(attr)
total_entropy=entropy([row[-1] for row in data])
for x in range(len(attr)):
ratio[x]=len(dic[attr[x]])/(total_size*1.0)
entropies[x]=entropy([row[-1] for row in dic[attr[x]]])
total_entropy-=ratio[x]*entropies[x]
return total_entropy
def build_tree(data,features):
lastcol=[row[-1] for row in data]
if(len(set(lastcol)))==1:
node=Node("")
node.answer=lastcol[0]
return node
n=len(data[0])-1
gains=[0]*n
for col in range(n):
gains[col]=compute_gain(data,col)
split=gains.index(max(gains))
node=Node(features[split])
fea = features[:split]+features[split+1:]
attr,dic=subtables(data,split,delete=True)
for x in range(len(attr)):
child=build_tree(dic[attr[x]],fea)
node.children.append((attr[x],child))
return node
def print_tree(node,level):
if node.answer!="":
print(" "*level,node.answer)
return
print(" "*level,node.attribute)
for value,n in node.children:
print(" "*(level+1),value)
print_tree(n,level+2)
def classify(node,x_test,features):
if node.answer!="":
print(node.answer)
return
pos=features.index(node.attribute)
for value, n in node.children:
if x_test[pos]==value:
classify(n,x_test,features)
'''Main program'''
#This is main program that calls previously defined functions
dataset,features=load_csv("data3.csv")
node1=build_tree(dataset,features)
print("The decision tree for the dataset using ID3 algorithmis")
print_tree(node1,0)
#load second dataset to test the model
testdata,features=load_csv("data3_test.csv")
for xtest in testdata:
print("\n The test instance:",xtest)
print("The label for test instance:",end="")
classify(node1,xtest,features)
OUTPUT:-
The decision tree for the dataset using ID3 algorithmis
Outlook
Rain
Wind
Strong
No
Weak
Yes
Overcast
Yes
Sunny
Humidity
Normal
Yes
High
No
PROGRAM:-
#EXP-4(a)
#pip install sklearn , matplotlib , numpy , pandas ,scikit-learn
import numpy as np
import pandas as pd
import matplotlib.pyplot as mtp
data_set=pd.read_csv(r'salary.csv')
#print(data_set)
x=data_set.iloc[:,:-1].values
y=data_set.iloc[:,:-1].values
from sklearn.model_selection import train_test_split
x_train , x_test , y_train , y_test
=train_test_split(x,y,test_size=1/3 , random_state=0)
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(x_train,y_train)
y_pred = regressor.predict(x_test)
x_pred = regressor.predict(x_train)
mtp.scatter(x_train,y_train, color="green")
mtp.plot(x_train,x_pred,color="red")
mtp.title("salary vs experence(Training Dataset }")
mtp.xlabel("Years of Experence")
mtp.ylabel("salary (in rupee)")
mtp.show()
mtp.scatter(x_test,y_test, color="blue")
mtp.plot(x_train,x_pred,color="red")
mtp.title("salary vs experence(Training Dataset }")
mtp.xlabel("Years of Experence")
mtp.ylabel("salary (in rupee)")
mtp.show()
B. Logistic Regression PROGRAM
DATASET:-
PROGRAM:-
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
#importing datasets
data_set= pd.read_csv('user_data.csv')
#print(data_set)
x= data_set.iloc[:, [3,4]].values
y= data_set.iloc[:, 5].values
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25,
random_state=0)
from sklearn.preprocessing import StandardScaler
st_x= StandardScaler()
x_train= st_x.fit_transform(x_train)
x_test= st_x.transform(x_test)
from sklearn.linear_model import LogisticRegression
classifier= LogisticRegression(random_state=0)
classifier.fit(x_train, y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='warn', n_jobs=None, penalty='l2',
random_state=0, solver='warn', tol=0.0001, verbose=0,
warm_start=False)
y_pred= classifier.predict(x_test)
from sklearn.metrics import confusion_matrix
cm= confusion_matrix(y_test,y_pred)
dataset = load_breast_cancer(as_frame=True)
dataset['data'].head()
dataset['target'].head()
dataset['target'].value_counts()
X = dataset['data']
y = dataset['target']
from sklearn.model_selection import train_test_split
ss_train = StandardScaler()
X_train = ss_train.fit_transform(X_train)
ss_test = StandardScaler()
X_test = ss_test.fit_transform(X_test)
#print(predictions)
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, predictions)
models = {}
# Logistic Regression
from sklearn.linear_model import LogisticRegression
models['Logistic Regression'] = LogisticRegression()
# Decision Trees
from sklearn.tree import DecisionTreeClassifier
models['Decision Trees'] = DecisionTreeClassifier()
# Random Forest
from sklearn.ensemble import RandomForestClassifier
models['Random Forest'] = RandomForestClassifier()
# Naive Bayes
from sklearn.naive_bayes import GaussianNB
models['Naive Bayes'] = GaussianNB()
# K-Nearest Neighbors
from sklearn.neighbors import KNeighborsClassifier
models['K-Nearest Neighbor'] = KNeighborsClassifier()
# Calculate metrics
accuracy[key] = accuracy_score(predictions, y_test)
precision[key] = precision_score(predictions, y_test)
recall[key] = recall_score(predictions, y_test)
import pandas as pd
df_model
ax = df_model.plot.barh()
ax.legend(
ncol=len(models.keys()),
bbox_to_anchor=(0, 1),
loc='lower left',
prop={'size': 14}
)
plt.tight_layout()
OUTPUT:-
True Positive(TP) = 86
False Positive(FP) = 2
True Negative(TN) = 51
False Negative(FN) = 4
Accuracy of the binary classifier = 0.958
Experiment-5
Develop a program for Bias, Variance, Remove duplicates ,
Cross Validation
PROGRAM:-
#NO DATASET
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# generate some sample data
X = np.random.rand(100, 10)
y = np.random.rand(100)
# split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2)
# train a linear regression model on the training data
model = LinearRegression()
model.fit(X_train, y_train)
# calculate the mean squared error on the test data
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean squared error: {mse:.3f}")
# calculate the bias and variance
y_pred_train = model.predict(X_train)
bias = np.mean((y_pred_train - y_train) ** 2)
variance = np.mean((y_pred - y_test) ** 2)
print(f"Bias: {bias:.3f}")
print(f"Variance: {variance:.3f}")
# remove duplicates from the data
X_no_duplicates, indices = np.unique(X, axis=0,
return_index=True)
y_no_duplicates = y[indices]
print(f"Number of duplicates removed: {X.shape[0] -
X_no_duplicates.shape[0]}")
# perform k-fold cross-validation
from sklearn.model_selection import KFold
kf = KFold(n_splits=5)
mse_scores = []
for train_index, test_index in kf.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
mse_scores.append(mse)
print(f"Cross-validation mean squared error:
{np.mean(mse_scores):.3f}")
OUTPUT:-
Mean squared error: 0.073
Bias: 0.075
Variance: 0.073
Number of duplicates removed: 0
Cross-validation mean squared error: 0.095
Experiment-6
Write a program to implement Categorical Encoding, One-hot
Encoding
PROGRAM:
#NO DATASET
import pandas as pd
# create a sample dataframe with categorical variables
data = {'gender': ['male', 'female', 'male', 'male', 'female']}
df = pd.DataFrame(data)
# perform categorical encoding using pandas' 'astype' method
df['gender_encoded'] = df['gender'].astype('category').cat.codes
print(df)
# perform one-hot encoding using pandas' 'get_dummies' function
df_onehot = pd.get_dummies(df, columns=['gender'])
print(df_onehot)
OUTPUT:
gender gender_encoded
0 male 1
1 female 0
2 male 1
3 male 1
4 female 0
gender_encoded gender_female gender_male
0 1 0 1
1 0 1 0
2 1 0 1
3 1 0 1
4 0 1 0
Experiment-7:
Build an Artificial Neural Network by implementing the Back
propagation algorithm and test the same using appropriate data
sets.
PROGRAM:-
#NO DATASET
import numpy as np
X = np.array(([2, 9], [1, 5], [3, 6]), dtype=float)
y = np.array(([92], [86], [89]), dtype=float)
X = X/np.amax(X,axis=0) # maximum of X array longitudinally y =
y/100
#Sigmoid Function
def sigmoid (x):
return (1/(1 + np.exp(-x)))
#Derivative of Sigmoid Function
def derivatives_sigmoid(x):
return x * (1 - x)
#Variable initialization
epoch=7000 #Setting training iterations
lr=0.1 #Setting learning rate
inputlayer_neurons = 2 #number of features in data set
hiddenlayer_neurons = 3 #number of hidden layers neurons
output_neurons = 1
#weight and bias initialization
wh=np.random.uniform(size=(inputlayer_neurons,hiddenlayer_neurons
))
bh=np.random.uniform(size=(1,hiddenlayer_neurons))
wout=np.random.uniform(size=(hiddenlayer_neurons,output_neurons))
bout=np.random.uniform(size=(1,output_neurons))
# draws a random range of numbers uniformly of dim x*y
#Forward Propagation
for i in range(epoch):
hinp1=np.dot(X,wh)
hinp=hinp1 + bh
hlayer_act = sigmoid(hinp)
outinp1=np.dot(hlayer_act,wout)
outinp= outinp1+ bout
output = sigmoid(outinp)
#Backpropagation
EO = y-output
outgrad = derivatives_sigmoid(output)
d_output = EO* outgrad
EH = d_output.dot(wout.T)
hiddengrad = derivatives_sigmoid(hlayer_act)
#how much hidden layer wts contributed to error
d_hiddenlayer = EH * hiddengrad
wout += hlayer_act.T.dot(d_output) *lr
# dotproduct of nextlayererror and currentlayerop
bout += np.sum(d_output, axis=0,keepdims=True) *lr
wh += X.T.dot(d_hiddenlayer) *lr
#bh += np.sum(d_hiddenlayer, axis=0,keepdims=True) *lr
print("Input: \n" + str(X))
print("Actual Output: \n" + str(y))
print("Predicted Output: \n",output)
OUTPUT{-
Input:
[[0.66666667 1. ]
[0.33333333 0.55555556]
[1. 0.66666667]]
Actual Output:
[[92.]
[86.]
[89.]]
Predicted Output:
[[0.8855704 ]
[0.87086645]
[0.88298625]]
Experiment-8:
Write a program to implement k-Nearest Neighbor algorithm to
classify the iris data set. Print both correct and wrong
predictions
PROGRAM:-
#NO DATASET
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
# Load Iris dataset and split into train/test sets
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data,
iris.target, test_size=0.2,random_state=42)
# Initialize k-NN classifier
k = 3
knn = KNeighborsClassifier(n_neighbors=k)
# Fit the classifier to the training data
knn.fit(X_train, y_train)
# Predict labels for test set
y_pred = knn.predict(X_test)
# Print correct and wrong predictions
correct = 0
wrong = 0
for i in range(len(y_test)):
if y_pred[i] == y_test[i]:
correct += 1
print(f"Correct prediction: Actual class {y_test[i]},
Predicted class {y_pred[i]}")
else:
wrong += 1
y = msg.labelnum
from sklearn.model_selection import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y)
from sklearn.feature_extraction.text import CountVectorizer
count_v = CountVectorizer()
Xtrain_dm = count_v.fit_transform(Xtrain)
Xtest_dm = count_v.transform(Xtest)
df =
pd.DataFrame(Xtrain_dm.toarray(),columns=count_v.get_feature_name
s_out())
print(df[0:5])
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(Xtrain_dm, ytrain)
pred = clf.predict(Xtest_dm)
for doc, p in zip(Xtrain, pred):
p = 'pos' if p == 1 else 'neg'
print("%s -> %s" % (doc, p))
from sklearn.metrics import accuracy_score, confusion_matrix,
precision_score,recall_score
print('Accuracy Metrics: \n')
print('Accuracy: ', accuracy_score(ytest, pred))
print('Recall: ', recall_score(ytest, pred))
print('Precision: ', precision_score(ytest, pred))
print('Confusion Matrix: \n', confusion_matrix(ytest, pred))
OUTPUT:-
Total Instances of Dataset: 18
about an awesome bad beers best boss can dance deal ...
these \
0 0 0 0 0 0 0 0 0 0 0 ...
0
1 0 0 0 0 0 1 0 0 0 0 ...
0
2 0 0 0 1 0 0 0 0 0 0 ...
0
3 0 0 0 0 0 0 0 0 1 0 ...
0
4 0 0 0 0 0 0 0 0 0 0 ...
0
[5 rows x 43 columns]
I went to my enemy's house today -> pos
This is my best work -> pos
That is a bad locality to stay -> neg
I love to dance -> pos
I do not like this restaurant -> pos
Accuracy Metrics:
Accuracy: 0.8
Recall: 1.0
Precision: 0.75
Confusion Matrix:
[[1 1]
[0 3]]
Experiment-11:
Apply EM algorithm to cluster a Heart Disease Data Set. Use the
same data set for clustering using k-Means algorithm. Compare the
results of these two algorithms and comment on the quality of
clustering. You can add Java/Python ML library classes/API in the
program.
PROGRAM:-
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
# Load the Heart Disease Data Set
data = pd.DataFrame({
"Age": [40, 49, 37, 48],
"Sex": ["M", "F", "M", "F"],
"ChestPainType": ["ATA", "NAP", "ATA", "ASY"],
"RestingBP": [140, 160, 130, 138],
"Cholesterol": [289, 180, 283, 214],
"FastingBS": [0, 0, 0, 0],
"RestingECG": ["Normal", "Normal", "ST", "Normal"],
"MaxHR": [172, 156, 98, 108],
"ExerciseAngina": ["N", "N", "N", "Y"],
"Oldpeak": [0, 1, 0, 1.5],
"ST_Slope": ["Up", "Flat", "Up", "Flat"],
"HeartDisease": [0, 1, 0, 1]
})
# Preprocess the data
# Handle missing values
data = data.dropna()
# Encode categorical features
le = LabelEncoder()
data["Sex"] = le.fit_transform(data["Sex"])
data["ChestPainType"] = le.fit_transform(data["ChestPainType"])
data["RestingECG"] = le.fit_transform(data["RestingECG"])
data["ExerciseAngina"] = le.fit_transform(data["ExerciseAngina"])
data["ST_Slope"] = le.fit_transform(data["ST_Slope"])
# Scale the data
scaler = StandardScaler()
data = scaler.fit_transform(data)
# Apply the EM algorithm
gmm = GaussianMixture(n_components=2)
gmm.fit(data)
em_labels = gmm.predict(data)
# Apply the k-Means algorithm
kmeans = KMeans(n_clusters=2, random_state=0)
kmeans.fit(data)
kmeans_labels = kmeans.predict(data)
# Evaluate the quality of the clustering results
print("Silhouette score for EM algorithm:",
silhouette_score(data,em_labels))
print("Silhouette score for k-Means algorithm:",
silhouette_score(data,kmeans_labels))
OUTPUT:-
PROGRAM:-
import pandas as pd
import matplotlib.pyplot as plt
# Load the data into a Pandas dataframe
data = pd.read_csv('dataset.csv')
# Get a summary of the data
print(data.describe())
# Plot histograms of the numerical features
data.hist(bins=10, figsize=(20,15))
plt.show()
# Plot a scatter matrix of the numerical features
from pandas.plotting import scatter_matrix
scatter_matrix(data, figsize=(20,15))
plt.show()
# Plot a bar chart of the loan purposes
data['loan_purpose'].value_counts().plot(kind='bar')
plt.show()
# Plot a pie chart of the labels
data['label\t\t'].value_counts().plot(kind='pie',
autopct='%1.1f%%')
plt.show()
OUTPUT:-
is_first_loan total_credit_card_limit \
count 29.000000 29.000000
mean 0.517241 4658.620690
std 0.508548 1864.234282
min 0.000000 2500.000000
25% 0.000000 3000.000000
50% 1.000000 4100.000000
75% 1.000000 5900.000000
max 1.000000 7900.000000
avg_percentage_credit_card_limit_used_last_year saving_amount
\
count 29.000000 29.000000
mean 0.665862 1551.172414
std 0.213366 865.010201
min 0.220000 88.000000
25% 0.520000 1058.000000
50% 0.690000 1310.000000
75% 0.860000 1958.000000
max 0.950000 3866.000000
dependent_number label\t\t
count 29.000000 29.000000
mean 3.758621 0.344828
std 2.898955 0.483725
min 0.000000 0.000000
25% 1.000000 0.000000
50% 3.000000 0.000000
75% 6.000000 1.000000
max 8.000000 1.000000
Experiment-13:
Write a Python program to construct a Bayesian network considering
medical data. Use this model to
demonstrate the diagnosis of heart patients using standard Heart
Disease Data Set
DATASET:-
ca thal Heartdisease
0 0 6 0
1 3 3 2
2 2 7 1
3 0 3 0
4 2 3 3
Output:-
PCA(n_components=2)
Accuracy: 0.98
Experiment-15:
Write a program to Implement Principle Component Analysis
PROGRAM:-
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
# Load the Iris dataset as an example
iris = load_iris()
X = iris.data
y = iris.target
# Instantiate the PCA object with the number of components
pca = PCA(n_components=2)
# Fit and transform the data using PCA
X_pca = pca.fit_transform(X)
# Create a new dataframe with the PCA results and the target
variable
df = pd.DataFrame(data=X_pca, columns=['PC1', 'PC2'])
df['target'] = y
# Plot the PCA results
import matplotlib.pyplot as plt
plt.figure(figsize=(8,6))
targets = [0, 1, 2]
colors = ['r', 'g', 'b']
for target, color in zip(targets,colors):
indicesToKeep = df['target'] == target
plt.scatter(df.loc[indicesToKeep, 'PC1']
, df.loc[indicesToKeep, 'PC2']
, c = color
, s = 50)
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.legend(targets)
plt.show()
OUTPUT:-