Experiment 7 Ids

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

EXPERIMENT-7 IDS

October 2, 2023

[ ]: KNN CLASSIFIER
URK21CS2056
[ ]: AIM:
To Implement the KNN classification for the given data set and analyse
theperformance of the classifier with different K values.

[ ]: DESCRIPTION:
K-nearest neighbors (KNN) algorithm is a type of supervised ML algorithm
which can be used for both classification as well as regression
predictive problems.
However, it is mainly used for classification predictive problems
in industry.

The following two properties would define KNN well:

- Lazy learning algorithm: KNN is a lazy learning algorithm because it does


not have a specialized training phase and uses all the data for training
while classification.
- Non-parametric learning algorithm: KNN is also a non-parametric learning
algorithm because it doesn’t assume anything about the underlying data.

Working of KNN Algorithm:

K-nearest neighbors (KNN) algorithm uses ‘feature similarity’ to predict


the values of new datapoints which further means that the new data
pointwill be assigned a value based on how closely it matches the
points in the training set.

We can understand its working with the help of following steps:

Step1: For implementing any algorithm, we need dataset. So, during the
first step of KNN, load the training as well as test data.
Step2: Choose the value of K i.e. the nearest data points. K can
beany integer.
Step3: For each point in the test data do the following:
3.1 : Calculate the distance between test data and each row
oftraining data with the help of any of the method

1
namely: Euclidean, Manhattan or Hamming distance.
The most commonly used method to calculate distance is
Euclidean.
3.2 : Now, based on the distance value, sort them in ascending
order.
3.3 : Next, it will choose the top K rows from the sorted array.
3.4: Now, it will assign a class to the test point based on
most frequent class of these rows.
Step4: End

Performance Metrics for Classification Problems:

Various performance metrics that can be used to evaluate predictions


for classification problems are given below:

Confusion Matrix:
It is the easiest way to measure the performance of a
classification problem where the output can be of two or more
type of classes.
A confusion matrix is nothing but a table with two dimensionsviz.
“Actual” and “Predicted” and furthermore, both the dimensions
have “True Positives (TP)”, “True Negatives (TN)”,
“False Positives (FP)”, “False Negatives (FN)”.

True Positives (TP): It is the case when both actual class


& predicted class of data point is 1.
True Negatives (TN): It is the case when both actual class
& predicted class of data point is 0.
False Positives (FP): It is the case when actual class of
data point is 0 & predicted class of data point is 1.
False Negatives (FN): It is the case when actual class
ofdata point is 1 & predicted class of data point is
0.

Classification Accuracy:
accuracy_score function of sklearn.metrics is used
tocompute accuracy of the classification model.

Accuracy= (TP+TN)/(TP+FP+FN+TN)

Precision:
Precision, used in document retrievals, may be defined
as the number of correct documents returned by
classification model.

Precision= TP/(TP+FP)

2
Recall or Sensitivity:
Recall may be defined as the number of
positivesreturned by classification model.
It can be calculated from the confusion matrix

Recall= TP/(TP+FN)

Specificity:
Specificity, may be defined as the number
of negatives returned by the classification
model.

Specificity= TN/(TN+FP)

F1 Score:
This score will give us the harmonic mean of
precision and recall.
F1 score is having equal relative contribution
ofprecision and recall.

F1 = 2 * (precision * recall) / (precision + recall)

AUC (Area Under ROC curve):


AUC (Area Under Curve)-ROC (Receiver Operating
Characteristic) is a performance metric, based on
varying threshold values, for classification problems.
roc_auc_score function of sklearn.metrics is used to
compute AUC-ROC.

[ ]: a. Develop a KNN classification model for the Cancer dataset using the␣
↪scikit-learna.

Use the columns: 'radius_mean', 'texture_mean', 'perimeter_mean',␣


↪'area_mean',

'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave␣


↪points_mean',

'symmetry_mean', 'fractal_dimension_mean' as the independent variables.


[1]: import pandas as pd
df= pd.read_csv('Cancer.csv')
[2]: import numpy as np
from sklearn.model_selection import
train_test_splitfrom sklearn.preprocessing
import StandardScaler from sklearn.neighbors
import KNeighborsClassifier from sklearn.metrics
import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report, roc_curve, auc

3
import matplotlib.pyplot as plt

[3]: df.dropna(inplace = True)


x=df.iloc[:, 1:-1].values
print(x[0:10])

[[1.799e+01 1.038e+01 1.228e+02 1.001e+03 1.184e-01 2.776e-01 3.001e-


011.471e-01 2.419e-01 7.871e-02]
[2.057e+01 1.777e+01 1.329e+02 1.326e+03 8.474e-02 7.864e-02 8.690e-
027.017e-02 1.812e-01 5.667e-02]
[1.969e+01 2.125e+01 1.300e+02 1.203e+03 1.096e-01 1.599e-01 1.974e-
011.279e-01 2.069e-01 5.999e-02]
[1.142e+01 2.038e+01 7.758e+01 3.861e+02 1.425e-01 2.839e-01 2.414e-
011.052e-01 2.597e-01 9.744e-02]
[2.029e+01 1.434e+01 1.351e+02 1.297e+03 1.003e-01 1.328e-01 1.980e-
011.043e-01 1.809e-01 5.883e-02]
[1.245e+01 1.570e+01 8.257e+01 4.771e+02 1.278e-01 1.700e-01 1.578e-
018.089e-02 2.087e-01 7.613e-02]
[1.825e+01 1.998e+01 1.196e+02 1.040e+03 9.463e-02 1.090e-01 1.127e-
017.400e-02 1.794e-01 5.742e-02]
[1.371e+01 2.083e+01 9.020e+01 5.779e+02 1.189e-01 1.645e-01 9.366e-
025.985e-02 2.196e-01 7.451e-02]
[1.300e+01 2.182e+01 8.750e+01 5.198e+02 1.273e-01 1.932e-01 1.859e-
019.353e-02 2.350e-01 7.389e-02]
[1.246e+01 2.404e+01 8.397e+01 4.759e+02 1.186e-01 2.396e-01 2.273e-
018.543e-02 2.030e-01 8.243e-02]]
[ ]: b. Use the target variable as 'diagnosis' (Malignant – M, Benign – B)

[4]: y=df['diagnosis'
] y
[4] 0 M
:
1 M
2 M
3 M
4 M
..
564 M
565 M
566 M
567 M
568 B
Name: diagnosis, Length: 569, dtype: object

[ ]: c. Encode the categorical value of the target column to numerical value

4
[5]: df['diagnosis']= df['diagnosis'].replace({'M':1,
'B':0}) df['diagnosis']
[5] 0 1
:
1 1
2 1
3 1
4 1
..
564 1
565 1
566 1
567 1
568 0
Name: diagnosis, Length: 569, dtype: int64

[6]: from sklearn.preprocessing import


LabelEncoderle = LabelEncoder()
y = le.fit_transform(y)
y

[6] array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,
:
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,
1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1,
0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1,
0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0,
0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1,
1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0,
0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1,
1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1,
0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0,
0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1,
1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1,
1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1,
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0,
0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0,
0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0,
0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0]
)

5
[ ]: d. Divide the data into training (75%) and testing set (25%)

[7]: x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25)

[ ]: e. Perform the classification with K=3

[8]: knn = KNeighborsClassifier(n_neighbors = 3)


knn.fit(x_train, y_train)
y_pred = knn.predict(x_test)
print(y_pred)
print(y_test)
[0 1 0 1 1 1 0 0 0 1 0 0 0 0 1 1 0 0 1 1 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 1 0
0 0 1 0 1 0 1 0 1 0 0 0 0 1 0 0 1 1 1 0 0 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0
1 0 0 1 0 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 1 1 1 0 0 0 1 0 0 1 1 0 0
1 1 0 1 0 1 0 1 0 1 1 1 1 0 1 1 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1]
[0 1 0 1 1 1 1 0 1 1 0 1 0 0 1 1 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0
0 0 1 0 1 0 1 1 1 0 0 0 0 1 0 0 1 1 1 0 0 1 1 1 0 1 1 0 1 0 0 0 0 0 0 1 0
1 0 0 1 0 0 1 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 1 1 0 0
1 1 0 1 0 0 0 1 0 1 1 1 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1]

[ ]: f. Analyse the performance of the classifier with various performance measures␣


↪suchas confusion matrix,

accuracy, recall, precision, specificity, f-score, Receiver␣


↪operatingcharacteristic (ROC) curve and Area Under Curve (AUC) score.

[9]: conf_matrix = confusion_matrix(y_test,


y_pred)accuracy = accuracy_score(y_test,
y_pred) report = classification_report(y_test,
y_pred)
[10]: tn, fp, fn, tp = conf_matrix.ravel()
specificity = tn / (tn + fp)
recall = tp / (tp + fn)
precision = tp / (tp + fp)
f1_score = 2 * precision * recall / (precision + recall)

[11]: y_pred_proba = knn.predict_proba(x_test)


fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba[:,
1])auc_score = auc(fpr, tpr)
plt.plot(fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score)
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()

6
[ ]: g. Perform feature scaling on independent variables and analyse the performance

[12]: print("Confusion Matrix:\n", conf_matrix)


print("Accuracy: ", accuracy)
print("Specificity: ", specificity)
print("Recall: ", recall)
print("Precision: ", precision)
print("F1-score: ", f1_score)
print("Classification Report:\n",
report)print("AUC score: ", auc_score)
print("True positive: ",tp)
print("True negative: ", tn)
Confusion Matrix:
[[80 6]
[ 9 48]]
Accuracy: 0.8951048951048951
Specificity: 0.9302325581395349
Recall: 0.8421052631578947
Precision: 0.8888888888888888

7
F1-score: 0.8648648648648649
Classification Report:
precision recall f1-score support

0 0.90 0.93 0.91 86


1 0.89 0.84 0.86 57

accuracy 0.90 143


macro avg 0.89 0.89 0.89 143
weighted avg 0.89 0.90 0.89 143
AUC score: 0.9190126478988168
True positive: 48
True negative: 80
[ ]: h. Change the value of K in KNN with 3,5,7,9,11 and tabulate the various TP,␣
↪TN, accuracy, f-score and AUC score obtained.

[17]: knn = KNeighborsClassifier(n_neighbors = 3)


knn.fit(x_train, y_train)
y_pred = knn.predict(x_test)
y_pred
conf_matrix = confusion_matrix(y_test, y_pred)
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba[:,
1])auc_score = auc(fpr, tpr)
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)
tn, fp, fn, tp = conf_matrix.ravel()
specificity = tn / (tn + fp)
recall = tp / (tp + fn)
precision = tp / (tp + fp)
f1_score = 2 * precision * recall / (precision + recall)
print(tp,tn)
print(accuracy)
print(f1_score)
print(auc_score
)
48 80
0.8951048951048951
0.8648648648648649
0.9190126478988168
[13]:
knn = KNeighborsClassifier(n_neighbors = 5)
knn.fit(x_train, y_train)
y_pred = knn.predict(x_test)
y_pred
conf_matrix = confusion_matrix(y_test, y_pred)
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba[:, 1])

8
auc_score = auc(fpr, tpr)
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)
tn, fp, fn, tp = conf_matrix.ravel()
specificity = tn / (tn + fp)
recall = tp / (tp + fn)
precision = tp / (tp + fp)
f1_score = 2 * precision * recall / (precision + recall)
print(tp,tn)
print(accuracy)
print(f1_score)
print(auc_score
)

46 82
0.8951048951048951
0.8598130841121495
0.9190126478988168
[14]:
knn = KNeighborsClassifier(n_neighbors = 7)

knn.fit(x_train, y_train)
y_pred = knn.predict(x_test)
y_pred
conf_matrix = confusion_matrix(y_test, y_pred)
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba[:,
1])auc_score = auc(fpr, tpr)
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)
tn, fp, fn, tp = conf_matrix.ravel()
specificity = tn / (tn + fp)
recall = tp / (tp + fn)
precision = tp / (tp + fp)
f1_score = 2 * precision * recall / (precision + recall)
print(tp,tn)
print(accuracy)
print(f1_score)
print(auc_score
) 83
47
0.9090909090909091
0.8785046728971964
0.9190126478988168
[15]:
knn = KNeighborsClassifier(n_neighbors = 9)
knn.fit(x_train, y_train)
y_pred = knn.predict(x_test)
y_pred
conf_matrix = confusion_matrix(y_test, y_pred)

9
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba[:,
1])auc_score = auc(fpr, tpr)
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)
tn, fp, fn, tp = conf_matrix.ravel()
specificity = tn / (tn + fp)
recall = tp / (tp + fn)
precision = tp / (tp + fp)
f1_score = 2 * precision * recall / (precision + recall)
print(tp,tn)
print(accuracy)
print(f1_score)
print(auc_score
)

45 82
0.8881118881118881
0.8490566037735849
0.9190126478988168
[16]:
knn = KNeighborsClassifier(n_neighbors = 11)
knn.fit(x_train, y_train)
y_pred = knn.predict(x_test)
y_pred
conf_matrix = confusion_matrix(y_test, y_pred)
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba[:,
1])auc_score = auc(fpr, tpr)
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)
tn, fp, fn, tp = conf_matrix.ravel()
specificity = tn / (tn + fp)
recall = tp / (tp + fn)
precision = tp / (tp + fp)
f1_score = 2 * precision * recall / (precision + recall)
print(tp,tn)
print(accuracy)
print(f1_score)
print(auc_score
) 83
45
0.8951048951048951
0.8571428571428572
0.9190126478988168
[ ] |K value| TP | TN | Accuracy | F-score | AUC-score |
:
| |
| 3 | 80 | 80 |0.8951048951048951|0.8648648648648649|0.9190126478988168|
| |
| 5 | 46 | 82 |0.8951048951048951|0.8951048951048951|0.9190126478988168|

10
| |
| 7 | 47 | 83 |0.9090909090909091|0.8785046728971964|0.9190126478988168
|
| |
| 9 | 45 | 82 |0.8881118881118881|0.8490566037735849|0.9190126478988168
|
| |
| 11 | 45 | 83 |0.8951048951048951|0.8571428571428572|0.9190126478988168
|

[ ]: i. Analyse for which K value, the classification algorithm provides better␣


↪performance.

[46]: from sklearn.datasets import load_iris

irisData = load_iris()

X = irisData.data
y = irisData.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2,␣


↪random_state=42)

neighbors = [3,5,7,9,11]
train_accuracy = np.empty(len(neighbors))
test_accuracy = np.empty(len(neighbors))

for i,k in enumerate(neighbors):


knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)

train_accuracy[i] = knn.score(X_train, y_train)


test_accuracy[i] = knn.score(X_test, y_test)

plt.plot(neighbors, test_accuracy, label = 'Testing dataset Accuracy')


plt.plot(neighbors, train_accuracy, label = 'Training dataset Accuracy')

plt.legend()
plt.xlabel('n_neighbors'
)
plt.ylabel('Accuracy')
plt.show()

11
[ ]: Therefore for the value k=7, the classification algorithm provides better
performance.

[ ]: RESULT:
Hence, we where able to work on the KNN classification and also know about
the performence matrics,and the output has been verified.

12

You might also like