IRIS data set (Multivariate Gaussian Classifier, PCA, Python)

Michael Galarnyk

IRIS data set (Multivariate Gaussian Classifier, PCA, Python)

IRIS data set analysis using python (Multivariate Gaussian Classifier, PCA, Python) Download the IRIS data set from: https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data This is a data set of 150 points in R4, with three classes; refer to the website for more details of the features and classes. (a) Use a PCA projection to 2d to visualize the entire data set. You should plot different classes using different colors/shapes. Do the classes seem well-separated from each other? (b) Now build a classifier for this data set, based on a generative model.

IRIS data set March 22, 2016 IRIS data set Download the IRIS data set from: https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data This is a data set of 150 points in R4, with three classes; refer to the website for more details of the features and classes. (a) Use a PCA projection to 2d to visualize the entire data set. You should plot different classes using different colors/shapes. Do the classes seem well-separated from each other? (b) Now build a classifier for this data set, based on a generative model. In [24]: %matplotlib inline import pandas as pd import matplotlib.pyplot as plt import numpy as np import random from sklearn.decomposition import PCA from scipy.stats import multivariate_normal 0.1 Importing iris dataset In [25]: !wget https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data -O iris.txt --2016-03-13 22:30:19-- https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.249 Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.249|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 4551 (4.4K) [text/plain] Saving to: ‘iris.txt’ iris.txt 100%[=====================>] 4.44K --.-KB/s in 0s 2016-03-13 22:30:19 (72.3 MB/s) - ‘iris.txt’ saved [4551/4551] 0.2 (a) Read Data and Perform PCA to 2 components In [35]: iris_df = pd.read_csv(’iris.txt’, names=[’sepal_len’,’sepal_wid’,’petal_len’,’petal wid’,’class pca = PCA(n_components = 2) pca.fit(iris_df.ix[:,0:4].values) X = pca.transform(iris_df.ix[:,0:4].values) iris_df[’P1’] = X[:,0] iris_df[’P2’] = X[:,1] iris_df.head(5) Out[35]: 0 sepal len sepal wid petal len petal wid 5.1 3.5 1.4 0.2 1 class P1 P2 Iris-setosa -2.684207 -0.326607 1 2 3 4 4.9 4.7 4.6 5.0 3.0 3.2 3.1 3.6 1.4 1.3 1.5 1.4 0.2 0.2 0.2 0.2 Iris-setosa Iris-setosa Iris-setosa Iris-setosa In [27]: classes = iris_df[’class’].unique() colors = [(0,.8,1),(1,.3,.2),(0,.7, .1)] plt.figure(figsize = (8,8)) plt.xlabel(’P1’, fontsize = 15) plt.ylabel(’P2’, fontsize = 15) plt.title(’2 component PCA’, fontsize = 20) for cl, color in zip(classes,colors): P1 = iris_df[iris_df[’class’] == cl][’P1’].values P2 = iris_df[iris_df[’class’] == cl][’P2’].values plt.scatter(P2, P1, c = color, s = 50) plt.legend(classes) plt.grid() 2 -2.715391 0.169557 -2.889820 0.137346 -2.746437 0.311124 -2.728593 -0.333925 The three classes appear to be well separated! iris-virginica and iris-versicolor could be better separated, but it is good! 0.3 (b) Train with Multivariate Gaussian Classifier In [28]: df_train = iris_df[iris_df[’class’] == classes[0]][0:35] for c in classes[1:]: df_train = pd.concat([df_train, iris_df[iris_df [’class’] == c][0:35]]) df_test = iris_df[iris_df[’class’] == classes[0]][35:] for c in classes[1:]: df_test = pd.concat([df_test, iris_df[iris_df[’class’] == c][35:]]) In [29]: def classify(sample_df, valid_df): prob = [] for i in xrange(3): cond = sample_df[’class’] == classes[i] mean = np.mean(sample_df[cond].ix[:,0:4].values, axis = 0) cov = np.cov(np.transpose(sample_df[cond].ix[:,0:4].values)) func = multivariate_normal(mean =mean, cov=cov) prob.append(func.logpdf(valid_df.ix[:,0:4])) max_prob = np.argmax(prob, axis = 0) tf_number_error = [classes[i] != j for i,j in zip(max_prob, valid_df[’class’])] error_percent = np.sum(tf_number_error)/len(valid_df) return error_percent 0.4 Train the dataset In [30]: def find_error(sample_df, valid_df): prob, label, flower = [], [], [] for i in xrange(3): cond = sample_df[’class’] == classes[i] mean = np.mean(sample_df[cond].ix[:,0:4].values, axis = 0) cov = np.cov(np.transpose(sample_df[cond].ix[:,0:4].values)) func = multivariate_normal(mean, cov) prob.append(func.logpdf(valid_df.ix[:,0:4])) max_prob = np.argmax(prob, axis = 0) prob = np.matrix(prob) for i,j in zip(max_prob, valid_df[’class’]): if classes[i] != j: flower.append(j) label.append(classes[i]) return [flower, label] In [31]: #Create dataframe of flowers and predictions [flower, label] = find_error(df_train, df_train) 3 error_df1 = pd.DataFrame({’prediction’: label, ’flower’: flower}) error_df1.head() Out[31]: 0 1 0.5 flower Iris-versicolor Iris-virginica prediction Iris-virginica Iris-versicolor Test Dataset In [33]: error_percent = classify(df_train, df_test) print ’Error: ’, error_percent Error: 0 100% accuracy with 0 percent error on our test dataset. 4

Log In

IRIS data set (Multivariate Gaussian Classifier, PCA, Python)

Related papers

Related papers

Related topics