Data Mining Portfolio

Data Mining Portfolio
Nathan Cooper Jones - nathancooperjones@gmail.com

In [1]: import numpy as np
import pandas as pd
from scipy import stats
from sklearn import datasets

from sklearn import decomposition
from sklearn.preprocessing import Imputer
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
from sklearn import model_selection
from sklearn import metrics
import matplotlib.pyplot as plt

import matplotlib.image as mpimg
%matplotlib inline
Using a Collaborative Recommendation System for Netflix Users
I will begin by creating the utility matrix and centering it by subtracting it by the user means.
In [2]: rating_cols = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_csv('ml-100k/u.data', sep='\t', names=rating_cols)
In [3]: utility = ratings.pivot(index='user_id', columns='movie_id', values='rating')
utility.head()
Out[3]:
movie_id 1 2 3 4 5 6 7 8 9 10 ... 1673 1674 1675 1676 1677 1678
user_id
1 5.0 3.0 4.0 3.0 3.0 5.0 4.0 1.0 5.0 3.0 ... NaN NaN NaN NaN NaN NaN
2 4.0 NaN NaN NaN NaN NaN NaN NaN NaN 2.0 ... NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
5 4.0 3.0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
5 rows × 1682 columns
1
In [4]: user_means = utility.mean(axis=1)
user_means.head() # sanity check
Out[4]: user_id
1 3.610294
2 3.709677
3 2.796296
4 4.333333
5 2.874286
dtype: float64
In [5]: utility_centered = utility - user_means

utility_centered = utility_centered.where((pd.notnull(utility_centered)),0)
utility_centered.head()
Out[5]:
1 2 3 4 5 6 7 8 9 10
user_id
1 1.389706 -0.709677 1.203704 -1.333333 0.125714 1.364929 0.034739 -2.79661 0.727273 -1.206522
2 0.389706 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 -2.206522
3 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000
4 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000
5 0.389706 -0.709677 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000
5 rows × 1682 columns
Next, I will go through and calculate the cosine similarity for each user compared to user 1 and add it to a list which will then
be sorted to get the top 10 users who are closest to user 1 in rankings via cosine similarity calculation.
In [6]: lst = []
feat1 = utility_centered.loc[1:1]
for x in range(2, utility_centered.shape[0] + 1):
feat2 = utility_centered.loc[x:x]
lst.append([metrics.pairwise.cosine_similarity(feat1, feat2)[0][0], x])
In [7]: user_means[1]
Out[7]: 3.6102941176470589
2
In [8]: lst = sorted(lst, reverse=True)
lst = lst[0:10]
lst
Out[8]: [[0.29148679307800707, 738],

[0.2784017205961094, 592],
[0.26815054175880981, 276],
[0.26476146556668312, 267],
[0.26400260297782174, 643],
[0.26236784527028273, 757],
[0.26233704478060194, 457],
[0.26084701039863195, 606],
[0.25562438236025764, 916],
[0.2529544008014209, 44]]
In [9]: top_10 = []
for x in lst:
top_10.append(x[1])
top_10 = sorted(top_10)
top_10
Out[9]: [44, 267, 276, 457, 592, 606, 643, 738, 757, 916]
Now with the top 10 users most similar to user 1 via cosine similarity calculation being users 44, 267, 276, 457, 592, 606,
643, 738, 757, and 916, we can now see what each of these ten users rated movie 508 [The People vs. Larry Flynt (1996)]
and use the mean of all of these values added to the average rating that user 1 gives as our predicted value for user 1 (the
exact process which Leskovec describes in Mining of Massive Datasets accessible at http://infolab.stanford.edu/~ullman
/mmds/book.pdf (http://infolab.stanford.edu/~ullman/mmds/book.pdf)). Note that some of these ten users did not rate movie
508 and thus are not counted in the calculation of the mean (example, user 44 did not rate the movie at all but user 276 did,
so 276 is counted towards the mean. In total, four users out of the ten actually rated movie 508).
In [10]: # now let's do the second half of the question

sum = 0
count = 0
for x in top_10:
val = utility_centered.loc[x, 508]
if val:
sum += val
count += 1
mean = sum / count

mean
Out[10]: 0.6724137931034484
In [11]: expected_rating = user_means[1] + mean

expected_rating
Out[11]: 4.2827079107505073
With all that work finally done, we can conclude that user 1 would likely rate The People vs. Larry Flynt a 4.28 out of 5. One
thing is for sure: an algorithm using this approach would most definitely recommend the movie to user 1 to watch, since
multiple users similar to this user watched and liked this selection.
3
Using Decision Trees to Draw Conclusions from Two Distributions
We will begin by randomly generating two different sets of Normal distributions, each with N = 1000. However, x1 will be
centered with mean 5 while x2 will be centered with mean -5.
In [12]: N = 1000
x1 = np.random.normal(5, 2, N)
c1 = np.repeat('0', N)
x2 = np.random.normal(-5, 2, N)
c2 = np.repeat('1', N)
4
In [13]: plt.hist(x1, bins=100)
Out[13]: (array([ 1., 0., 0., 0., 1., 0., 0., 0., 1., 1., 1.,
0., 0., 3., 0., 2., 3., 3., 8., 7., 3., 11.,
8., 11., 10., 10., 7., 13., 4., 6., 9., 7., 17.,
16., 20., 14., 21., 17., 29., 23., 24., 27., 14., 27.,
22., 30., 18., 16., 20., 16., 23., 33., 25., 25., 22.,
19., 25., 20., 24., 20., 20., 22., 18., 12., 19., 12.,
11., 20., 13., 10., 9., 10., 7., 13., 5., 5., 10.,
5., 2., 4., 5., 4., 1., 3., 2., 4., 2., 2.,
4., 0., 2., 0., 1., 1., 3., 0., 0., 0., 1.,
1.]), array([ -1.16186638, -1.03779776, -0.91372914, -0.78966051,
-0.66559189, -0.54152327, -0.41745465, -0.29338603,
-0.16931741, -0.04524879, 0.07881983, 0.20288845,
0.32695707, 0.45102569, 0.57509431, 0.69916293,
0.82323156, 0.94730018, 1.0713688 , 1.19543742,
1.31950604, 1.44357466, 1.56764328, 1.6917119 ,
1.81578052, 1.93984914, 2.06391776, 2.18798638,
2.312055 , 2.43612363, 2.56019225, 2.68426087,
2.80832949, 2.93239811, 3.05646673, 3.18053535,
3.30460397, 3.42867259, 3.55274121, 3.67680983,
3.80087845, 3.92494708, 4.0490157 , 4.17308432,
4.29715294, 4.42122156, 4.54529018, 4.6693588 ,
4.79342742, 4.91749604, 5.04156466, 5.16563328,
5.2897019 , 5.41377052, 5.53783915, 5.66190777,
5.78597639, 5.91004501, 6.03411363, 6.15818225,
6.28225087, 6.40631949, 6.53038811, 6.65445673,
6.77852535, 6.90259397, 7.0266626 , 7.15073122,
7.27479984, 7.39886846, 7.52293708, 7.6470057 ,
7.77107432, 7.89514294, 8.01921156, 8.14328018,
8.2673488 , 8.39141742, 8.51548604, 8.63955467,
8.76362329, 8.88769191, 9.01176053, 9.13582915,
9.25989777, 9.38396639, 9.50803501, 9.63210363,
9.75617225, 9.88024087, 10.00430949, 10.12837812,
10.25244674, 10.37651536, 10.50058398, 10.6246526 ,
10.74872122, 10.87278984, 10.99685846, 11.12092708, 11.2449957 ]),
<a list of 100 Patch objects>)
5
In [14]: plt.hist(x2, bins=100)
Out[14]: (array([ 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 3., 0., 0., 1.,
3., 3., 2., 3., 5., 7., 4., 5., 7., 10., 7.,
10., 3., 10., 10., 15., 16., 18., 18., 17., 16., 20.,
23., 22., 38., 21., 25., 28., 28., 38., 27., 37., 15.,
30., 30., 34., 38., 29., 31., 28., 22., 27., 25., 14.,
13., 17., 24., 10., 12., 11., 8., 12., 13., 6., 7.,
7., 6., 8., 2., 4., 1., 1., 1., 3., 2., 0.,
0., 1., 3., 0., 1., 0., 1., 0., 0., 0., 1.,
1.]), array([ -1.32144464e+01, -1.30625927e+01, -1.29107389e+01,
-1.27588852e+01, -1.26070314e+01, -1.24551777e+01,
-1.23033239e+01, -1.21514702e+01, -1.19996164e+01,
-1.18477627e+01, -1.16959089e+01, -1.15440552e+01,
-1.13922014e+01, -1.12403477e+01, -1.10884939e+01,
-1.09366402e+01, -1.07847864e+01, -1.06329327e+01,
-1.04810789e+01, -1.03292251e+01, -1.01773714e+01,
-1.00255176e+01, -9.87366390e+00, -9.72181015e+00,
-9.56995639e+00, -9.41810264e+00, -9.26624889e+00,
-9.11439514e+00, -8.96254139e+00, -8.81068764e+00,
-8.65883389e+00, -8.50698014e+00, -8.35512639e+00,
-8.20327264e+00, -8.05141889e+00, -7.89956513e+00,
-7.74771138e+00, -7.59585763e+00, -7.44400388e+00,
-7.29215013e+00, -7.14029638e+00, -6.98844263e+00,
-6.83658888e+00, -6.68473513e+00, -6.53288138e+00,
-6.38102763e+00, -6.22917387e+00, -6.07732012e+00,
-5.92546637e+00, -5.77361262e+00, -5.62175887e+00,
-5.46990512e+00, -5.31805137e+00, -5.16619762e+00,
-5.01434387e+00, -4.86249012e+00, -4.71063637e+00,
-4.55878261e+00, -4.40692886e+00, -4.25507511e+00,
-4.10322136e+00, -3.95136761e+00, -3.79951386e+00,
-3.64766011e+00, -3.49580636e+00, -3.34395261e+00,
-3.19209886e+00, -3.04024511e+00, -2.88839135e+00,
-2.73653760e+00, -2.58468385e+00, -2.43283010e+00,
-2.28097635e+00, -2.12912260e+00, -1.97726885e+00,
-1.82541510e+00, -1.67356135e+00, -1.52170760e+00,
-1.36985385e+00, -1.21800009e+00, -1.06614634e+00,
-9.14292593e-01, -7.62438842e-01, -6.10585091e-01,
-4.58731340e-01, -3.06877589e-01, -1.55023839e-01,
-3.17008766e-03, 1.48683663e-01, 3.00537414e-01,
4.52391165e-01, 6.04244916e-01, 7.56098667e-01,
9.07952418e-01, 1.05980617e+00, 1.21165992e+00,
1.36351367e+00, 1.51536742e+00, 1.66722117e+00,
1.81907492e+00, 1.97092867e+00]), <a list of 100 Patch objects>)
6
In [15]: print("x1 mean: {}, x2 mean: {}".format(x1.mean(), x2.mean()))
x1 mean: 5.003800218971892, x2 mean: -4.992537301817326
In [16]: df1 = pd.DataFrame(dict(zip(['value','class'],[x1,c1])))

df2 = pd.DataFrame(dict(zip(['value','class'],[x2,c2])))
df = df1.append(df2)
df.describe()
Out[16]:
value
count 2000.000000
mean 0.005631
std 5.391157
min -13.214446
25% -5.019632
50% 0.490347
75% 5.038763
max 11.244996
We will now select a sample of 80% from each of the two Normal distributions and use it to train our model to differentiate
between the two (we will expect positive numbers to be grouped with x1 while negative numbers should be grouped with
x2). We will then use the remaining 20% of the data to test the model; the results of this test are below.
In [17]: X_train, X_test, y_train, y_test = model_selection.train_test_split(df[['value']],

df['class'], test_size=0.2)
In [18]: clf = DecisionTreeClassifier(max_depth=2)

clf = clf.fit(X_train, y_train)
expected = y_test
predicted = clf.predict(X_test)
print(metrics.classification_report(expected, predicted))
precision recall f1-score support
0 0.99 1.00 0.99 209

1 1.00 0.98 0.99 191
avg / total 0.99 0.99 0.99 400
In [19]: from sklearn.tree import _tree

threshold = clf.tree_.threshold
threshold[0]
Out[19]: -0.10302744805812836
7
In [20]: from sklearn import tree
tree.export_graphviz(clf, out_file='tree.dot')
import pydot
(graph,) = pydot.graph_from_dot_file('tree.dot')
graph.write_png('tree.png')
Out[20]: True
In [21]: img = mpimg.imread('tree.png')

plt.imshow(img)
plt.show()
Naturally by combining two normal distributions on exact opposite sides of the y-axis, we would expect our threshold value
for choosing which distribution a given value falls into would be a threshold of 0 on average, since a negative number should
fall into the negative distribution (-5, 2) and a positive value should fall into the positive distribution (5,2), naturally. However,
during this randomized run, our threshold value for the split actually tests if the number is <= -0.103 or > -0.103, a threshold
value which is very close to 0, but not quite 0. In other runs, I've even had the threshold number be almost as big as 0.5!
Hence, because we are taking a randomized distribution and then randomly sampling from that to form a decision tree, the
threshold value might be some positive or some negative value centered around 0 that decides where to split to put the
value in the correct positive or negative normal distribution.
Working with PCA on the Iris Dataset

In [22]: df = pd.read_csv('iris.data', sep=',')
df.columns=['sepal_len', 'sepal_wid', 'petal_len', 'petal_wid', 'class']
pca = decomposition.PCA(n_components=2)
pca.fit(df.iloc[:,0:4])
X = pca.transform(df.iloc[:,0:4])
8
In [23]: dataset = datasets.load_iris()
dataset.keys()
df = pd.DataFrame(dataset.data, columns=dataset.feature_names)
df['class'] = dataset.target_names[dataset.target]
df.head()
Out[23]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) class
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
In [24]: pca = decomposition.PCA(n_components=2)

X[:,0].var()
Out[24]: 4.1966751631979804
In [25]: dfX = pd.DataFrame(data=X[:,0])
In [26]: plt.scatter(X[:,0], X[:,1], c=dataset.target)

# original PCA
Out[26]: <matplotlib.collections.PathCollection at 0x125699588>
9
In [27]: plt.scatter(X[:,0], df['sepal width (cm)'], c=dataset.target)
Out[27]: <matplotlib.collections.PathCollection at 0x11f32cac8>
In [28]: dfX[0].corr(df['sepal width (cm)'])
Out[28]: -0.38999337904750564
In [29]: plt.scatter(X[:,0], df['sepal length (cm)'], c=dataset.target)
Out[29]: <matplotlib.collections.PathCollection at 0x125ef0668>
In [30]: dfX[0].corr(df['sepal length (cm)'])
Out[30]: 0.89754488494076157
10
In [31]: plt.scatter(X[:,0], df['petal length (cm)'], c=dataset.target)
Out[31]: <matplotlib.collections.PathCollection at 0x125f8bb38>
In [32]: dfX[0].corr(df['petal length (cm)'])
Out[32]: 0.99785405063545973
In [33]: plt.scatter(X[:,0], df['petal width (cm)'], c=dataset.target)
Out[33]: <matplotlib.collections.PathCollection at 0x1260297b8>
In [34]: dfX[0].corr(df['petal width (cm)'])
Out[34]: 0.96648418315379248
From the above four graphs, it is clear that the graph comparing the first component of our PCA (PC1) and petal length
produce the tightest, most correlated scatter plot. When looking at the cosine distance which measures the angle between
the two different axes in this case, we see that the distance is minimized almost entirely with this graph compared to one
such as PC1 vs sepal width. A simple inspection from the above problem 3 shows that PC1 contains the largest amount of
variation of all the components as petal length does the same out of the other three features. When we further solidify our
inspection by actually calculating the correlation, we find that PC1 vs petal length gives a correlation of about 0.998, which is
nearly 1, the most linear reading a correlation can give. With the assurance that PC1 vs sepal width gives the correlation
closest to 0, we can assure ourselves that we have selected the correct pair of features.
11
In [35]: pca = decomposition.PCA(n_components=4)
pca.explained_variance_
Out[35]: array([ 4.22484077, 0.24224357, 0.07852391, 0.02368303])
In [36]: pca.explained_variance_.sum()
Out[36]: 4.569291275167787
In [37]: df.var().sum()
Out[37]: 4.5692912751677826
In [38]: pca.explained_variance_ratio_.cumsum()
Out[38]: array([ 0.92461621, 0.97763178, 0.99481691, 1. ])
As alluded to in Problem 3, we see fully that when we compute the PCA with the same number of components as the
original features, we find that we still capture the full variance of the original features. The advantage of PCA over using the
original features, however, is that we can still capture a large majority of the variability by using less features. We see this in
action when we look at the sum of the explained variance ratios. By using one component, we capture about 92.4% of the
total variance, by using two, we capture about 97.8%, three we capture nearly 99.5%, and of course by using all four, we
capture the full, total variability. Unlike the original attributes where the best one feature will get us is around 66%, one
feature after PCA will get drastically more variability. To get 95% of our total variability of our original features would require
all of three of our features, but after PCA, it only takes two features to get us far above the 95% mark. Hence, our PCA
worked in decreasing our dimensions while not decreasing the majority of our variability to work with.
Clustering Data with an Optimal Number of Clusters

In [39]: import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn import preprocessing
from sklearn import metrics
from sklearn.datasets import load_boston
%matplotlib inline
12
In [40]: dataset = load_boston()
print(dataset.DESCR)
13
Boston House Prices dataset
===========================
Notes
------
Data Set Characteristics:
:Number of Instances: 506
:Number of Attributes: 13 numeric/categorical predictive
:Median Value (attribute 14) is usually the target
:Attribute Information (in order):

- CRIM per capita crime rate by town
- ZN proportion of residential land zoned for lots over 25,000 sq.
ft.
- INDUS proportion of non-retail business acres per town
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 ot
herwise)
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
- AGE proportion of owner-occupied units built prior to 1940
- DIS weighted distances to five Boston employment centres
- RAD index of accessibility to radial highways
- TAX full-value property-tax rate per $10,000
- PTRATIO pupil-teacher ratio by town
- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by tow
n
- LSTAT % lower status of the population
- MEDV Median value of owner-occupied homes in $1000's
:Missing Attribute Values: None
:Creator: Harrison, D. and Rubinfeld, D.L.
This is a copy of UCI ML housing dataset.

http://archive.ics.uci.edu/ml/datasets/Housing
This dataset was taken from the StatLib library which is maintained at Carnegie
Mellon University.
The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic

prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980. N.B. Various transformations are used in the table on
pages 244-261 of the latter.
The Boston house-price data has been used in many machine learning papers that a
ddress regression
problems.
**References**
- Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Dat

a and Sources of Collinearity', Wiley, 1980. 244-261.
- Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Pr
oceedings on the Tenth International Conference of Machine Learning, 236-243, Un
iversity of Massachusetts, Amherst. Morgan Kaufmann.
- many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)
14
We will begin by scaling the dataset and then find the best number of clusters that reduces the overall silhouette score of the
model.
In [41]: df_scaled = pd.DataFrame(data=preprocessing.scale(dataset.data), columns=dataset.f

eature_names)
df_scaled.head()
Out[41]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD
0 -0.417713 0.284830 -1.287909 -0.272599 -0.144217 0.413672 -0.120013 0.140214 -0.982843 -0.666608
1 -0.415269 -0.487722 -0.593381 -0.272599 -0.740262 0.194274 0.367166 0.557160 -0.867883 -0.987329
2 -0.415272 -0.487722 -0.593381 -0.272599 -0.740262 1.282714 -0.265812 0.557160 -0.867883 -0.987329
3 -0.414680 -0.487722 -1.306878 -0.272599 -0.835284 1.016303 -0.809889 1.077737 -0.752922 -1.106115
4 -0.410409 -0.487722 -1.306878 -0.272599 -0.835284 1.228577 -0.511180 1.077737 -0.752922 -1.106115
In [42]: for n in range(2, 7):

clust_model = KMeans(n_clusters=n, init='k-means++')
clust_labels = clust_model.fit_predict(df_scaled)
silhouette_avg = metrics.silhouette_score(df_scaled, clust_labels)
print("Silhouette score for k = {} = {}".format(n, silhouette_avg))
Silhouette score for k = 2 = 0.35997734237402634

15
In [43]: # just to double check, here's an elbow analysis mostly
# taken from http://www.awesomestats.in/python-cluster-validation/
X = df_scaled[list(df_scaled.columns)]
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
cluster_range = range(1, 7)
cluster_errors = []
for num_clusters in cluster_range:

clusters = KMeans(num_clusters)
clusters.fit(X_scaled)
cluster_errors.append(clusters.inertia_)
clusters_df = pd.DataFrame({ "num_clusters":cluster_range, "cluster_errors": clust

er_errors})
plt.figure(figsize=(12,6))
plt.plot(clusters_df.num_clusters, clusters_df.cluster_errors, marker = "o")
Out[43]: [<matplotlib.lines.Line2D at 0x11e408470>]
16
In [44]: # clearly from the silhouette scores above, the optimal value of k is 2 for the sc
aled data
clust_model = KMeans(n_clusters = 2, init='k-means++')
clust_labels = clust_model.fit_predict(df_scaled)
print(clust_labels)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 0 0 1 0 0 0 0 0 1 0 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0]
In [45]: silhouette_avg = metrics.silhouette_score(df_scaled, clust_labels)

print(silhouette_avg)
0.359977342374
With optimal number of clusters being k=2, we will continue on by comparing the mean values for all features in each cluster
as well as the centroid coordinates.
In [46]: df_scaled['CLUST'] = clust_labels

C0 = df_scaled.loc[df_scaled['CLUST'] == 0]
C0.mean()
Out[46]: CRIM -0.388039

ZN 0.262392
INDUS -0.620368
CHAS 0.002912
NOX -0.584675
RM 0.243315
AGE -0.435108
DIS 0.457222
RAD -0.583801
TAX -0.631460
PTRATIO -0.285808
B 0.326451
LSTAT -0.446421
CLUST 0.000000
dtype: float64
17
In [47]: df_scaled['CLUST'] = clust_labels
C1 = df_scaled.loc[df_scaled['CLUST'] == 1]
C1.mean()
Out[47]: CRIM 0.721270

ZN -0.487722
INDUS 1.153113
CHAS -0.005412
NOX 1.086769
RM -0.452263
AGE 0.808760
DIS -0.849865
RAD 1.085145
TAX 1.173731
PTRATIO 0.531248
B -0.606793
LSTAT 0.829787
CLUST 1.000000
dtype: float64
In [48]: print("Cluster coordinates: \n{}".format(clust_model.cluster_centers_))
Cluster coordinates:
[[-0.38803894 0.26239167 -0.62036759 0.00291182 -0.58467512 0.24331476
-0.43510819 0.45722226 -0.58380115 -0.63145993 -0.28580826 0.32645106
-0.44642061]
[ 0.72127012 -0.48772236 1.15311264 -0.00541237 1.086769 -0.45226302
0.80876041 -0.8498651 1.0851445 1.1737306 0.53124811 -0.60679321
0.82978746]]
18
In [49]: sns.clustermap(df_scaled, method='average', metric='euclidean', z_score=None, stan
dard_scale=None)
//anaconda/lib/python3.5/site-packages/matplotlib/cbook/deprecation.py:106: Matp
lotlibDeprecationWarning: The axisbg attribute was deprecated in version 2.0. Us
e facecolor instead.
warnings.warn(message, mplDeprecation, stacklevel=1)
Out[49]: <seaborn.matrix.ClusterGrid at 0x12566a588>
19

Data Mining Portfolio

Uploaded by

Copyright:

Available Formats

Data Mining Portfolio

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining Portfolio

Uploaded by

Copyright:

Available Formats

Data Mining Portfolio

Nathan Cooper Jones - nathancooperjones@gmail.com

from sklearn import datasets

import matplotlib.pyplot as plt

Using a Collaborative Recommendation System for Netflix Users

In [2]: rating_cols = ['user_id', 'movie_id', 'rating', 'timestamp']

ratings = pd.read_csv('ml-100k/u.data', sep='\t', names=rating_cols)

In [3]: utility = ratings.pivot(index='user_id', columns='movie_id', values='rating')

5 rows × 1682 columns

user_means.head() # sanity check

In [5]: utility_centered = utility - user_means

5 rows × 1682 columns

Out[8]: [[0.29148679307800707, 738],

In [10]: # now let's do the second half of the question

mean = sum / count

In [11]: expected_rating = user_means[1] + mean

x1 mean: 5.003800218971892, x2 mean: -4.992537301817326

In [16]: df1 = pd.DataFrame(dict(zip(['value','class'],[x1,c1])))

In [17]: X_train, X_test, y_train, y_test = model_selection.train_test_split(df[['value']],

In [18]: clf = DecisionTreeClassifier(max_depth=2)

precision recall f1-score support

0 0.99 1.00 0.99 209

avg / total 0.99 0.99 0.99 400

In [19]: from sklearn.tree import _tree

In [21]: img = mpimg.imread('tree.png')

Working with PCA on the Iris Dataset

0 5.1 3.5 1.4 0.2 setosa

1 4.9 3.0 1.4 0.2 setosa

2 4.7 3.2 1.3 0.2 setosa

3 4.6 3.1 1.5 0.2 setosa

4 5.0 3.6 1.4 0.2 setosa

In [24]: pca = decomposition.PCA(n_components=2)

In [25]: dfX = pd.DataFrame(data=X[:,0])

In [26]: plt.scatter(X[:,0], X[:,1], c=dataset.target)

Out[26]: <matplotlib.collections.PathCollection at 0x125699588>

Out[27]: <matplotlib.collections.PathCollection at 0x11f32cac8>

In [28]: dfX[0].corr(df['sepal width (cm)'])

In [29]: plt.scatter(X[:,0], df['sepal length (cm)'], c=dataset.target)

Out[29]: <matplotlib.collections.PathCollection at 0x125ef0668>

In [30]: dfX[0].corr(df['sepal length (cm)'])

Out[31]: <matplotlib.collections.PathCollection at 0x125f8bb38>

In [32]: dfX[0].corr(df['petal length (cm)'])

In [33]: plt.scatter(X[:,0], df['petal width (cm)'], c=dataset.target)

Out[33]: <matplotlib.collections.PathCollection at 0x1260297b8>

In [34]: dfX[0].corr(df['petal width (cm)'])

Out[35]: array([ 4.22484077, 0.24224357, 0.07852391, 0.02368303])

Out[38]: array([ 0.92461621, 0.97763178, 0.99481691, 1. ])

Clustering Data with an Optimal Number of Clusters

from sklearn.datasets import load_boston

:Number of Instances: 506

:Number of Attributes: 13 numeric/categorical predictive

:Median Value (attribute 14) is usually the target

:Attribute Information (in order):

:Missing Attribute Values: None

:Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic

- Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Dat

In [41]: df_scaled = pd.DataFrame(data=preprocessing.scale(dataset.data), columns=dataset.f

In [42]: for n in range(2, 7):