Unit-6 Clustering Techniques

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 110

Unit-6 Clustering

Techniques

1
Hierarchical Clustering
• Hierarchical clustering, as the name suggests is
an algorithm that builds hierarchy of clusters.
• This algorithm starts with all the data points
assigned to a cluster of their own. Then two
nearest clusters are merged into the same
cluster. In the end, this algorithm terminates
when there is only a single cluster left.
• The results of hierarchical clustering can be
shown using dendrogram. The dendrogram
can be interpreted as:
2
3
• At the bottom, we start with 25 data points,
each assigned to separate clusters. Two closest
clusters are then merged till we have just one
cluster at the top. The height in the dendrogram
at which two clusters are merged represents the
distance between two clusters in the data space.
• The decision of the no. of clusters that can best
depict different groups can be chosen by
observing the dendrogram. The best choice of
the no. of clusters is the no. of vertical lines in
the dendrogram cut by a horizontal line that can
transverse the maximum distance vertically
without intersecting a cluster.
4
• In the above example, the best choice of no. of
clusters will be 4 as the red horizontal line in the
dendrogram below covers maximum vertical
distance AB.

5
• This algorithm has been implemented above using
bottom up approach. It is also possible to follow top-
down approach starting with all data points assigned in
the same cluster and recursively performing splits till
each data point is assigned a separate cluster.
• The decision of merging two clusters is taken on the basis of
closeness of these clusters. There are multiple metrics for
deciding the closeness of two clusters :
– Euclidean distance: ||a-b||2 = √(Σ(ai-bi))
– Squared Euclidean distance: ||a-b||22 = Σ((ai-bi)2)
– Manhattan distance: ||a-b||1 = Σ|ai-bi|
– Maximum distance:||a-b||INFINITY = maxi|ai-bi|
– Mahalanobis distance: √((a-b)T S-1 (-b))   {where, s :
covariance matrix} 6
• Difference between K Means and
Hierarchical clustering
• Hierarchical clustering can’t handle big data
well but K Means clustering can. This is
because the time complexity of K Means is
linear i.e. O(n) while that of hierarchical
clustering is quadratic i.e. O(n2).
• In K Means clustering, since we start with
random choice of clusters, the results
produced by running the algorithm multiple
times might differ. While results are
reproducible in Hierarchical clustering. 
7
• K Means is found to work well when the shape
of the clusters is hyper spherical (like circle in
2D, sphere in 3D).
• K Means clustering requires prior knowledge
of K i.e. no. of clusters you want to divide your
data into. But, you can stop at whatever
number of clusters you find appropriate in
hierarchical clustering by interpreting the
dendrogram

8
• Applications of Clustering
• Clustering has a large no. of applications
spread across various domains. Some of the
most popular applications of clustering are:
• Recommendation engines
• Market segmentation
• Social network analysis
• Search result grouping
• Medical imaging
• Image segmentation
• Anomaly detection
9
• Hierarchical clustering is based on the general
concept of finding a hierarchy of partial
clusters, built using either a bottom-up or a
top-down approach. More formally, they are
called:
• Agglomerative clustering: The process starts
from the bottom (each initial cluster is made
up of a single element) and proceeds by
merging the clusters until a stop criterion is
reached. In general, the target has a
sufficiently small number of clusters at the
end of the process.
10
• Divisive clustering: In this case, the initial state is
a single cluster with all samples and the process
proceeds by splitting the intermediate cluster
until all elements are separated. At this point, the
process continues with an aggregation criterion
based on the dissimilarity between elements.

11
• Divisive method-In divisive or top-down
clustering method we assign all of the
observations to a single cluster and then
partition the cluster to two least similar
clusters.
• Finally, we proceed recursively on each cluster
until there is one cluster for each observation.
• There is evidence that divisive algorithms
produce more accurate hierarchies than
agglomerative  algorithms in some
circumstances but is conceptually more
complex.
12
What is Euclidean Distance?
• According to the Euclidean distance formula,
the distance between two points in the plane
with coordinates (x, y) and (a, b) is given by

13
• The horizontal distance between the points is 4 and
the vertical distance is 3. Let's introduce one more
point (-2, -1). With this small addition we get a right-
angled triangle with legs 3 and 4. By the Pythagorean
theorem, the square of the hypotenuse
is (hypotenuse)² = 3² + 4². Which gives the length of
the hypotenuse as 5, same as the distance between
the two points according to the distance formula.

14
• Agglomerative method
• In agglomerative or bottom-up
clustering method we assign each observation
to its own cluster.
• Then, compute the similarity (e.g., distance)
between each of the clusters and join the two
most similar clusters.
• Finally, repeat steps 2 and 3 until there is only
a single cluster left. The related algorithm is
shown below.

15
16
• Before any clustering is performed, it is required
to determine the proximity matrix containing
the distance between each point using a
distance function.
• Then, the matrix is updated to display the
distance between each cluster. The following
three methods differ in how the distance
between each cluster is measured.
• Advantages
• It can produce an ordering of the objects, which
may be informative for data display.
• Smaller clusters are generated, which may be
helpful for discovery. 17
• Disadvantages
• No provision can be made for a relocation of
objects that may have been 'incorrectly'
grouped at an early stage. The result should
be examined closely to ensure it makes sense.
• Use of different distance metrics for
measuring distances between clusters may
generate different results. Performing
multiple experiments and comparing the
results is recommended to support the
veracity of the original results.

18
• Single Linkage-In single linkage hierarchical
clustering, the distance between two clusters
is defined as the shortest distance between
two points in each cluster. For example, the
distance between clusters “r” and “s” to the
left is equal to the length of the arrow
between their two closest points.

19
• Complete Linkage- In complete linkage
hierarchical clustering, the distance between
two clusters is defined as the longest distance
between two points in each cluster. For
example, the distance between clusters “r” and
“s” to the left is equal to the length of the arrow
between their two furthest points.

20
• Average Linkage-In average linkage hierarchical
clustering, the distance between two clusters is
defined as the average distance between each
point in one cluster to every point in the other
cluster. For example, the distance between
clusters “r” and “s” to the left is equal to the
average length each arrow between connecting
the points of one cluster to the other.

21
• Ward's linkage: In this method, all clusters are
considered and the algorithm computes the
sum of squared distances within the clusters
and merges them to minimize it. From a
statistical viewpoint, the process of
agglomeration leads to a reduction in the
variance of each resulting cluster. The
measure is:

• The Ward's linkage supports only the


Euclidean distance.
22
• Dendrograms
• To better understand the agglomeration
process, it's useful to introduce a graphical
method called a dendrogram, which shows in
a static way how the aggregations are
performed, starting from the bottom (where
all samples are separated) till the top (where
the linkage is complete). Unfortunately, scikit-
learn doesn't support them. However, SciPy
(which is a mandatory requirement for it)
provides some useful built-in functions.
• Let's start by creating a dummy dataset: 
23
from sklearn.datasets import make_blobs
>>> nb_samples = 25
>>> X, Y = make_blobs(n_samples=nb_samples,
n_features=2, centers=3, cluster_std=1.5)

• To avoid excessive complexity in the resulting


plot, the number of samples has been kept
very low. In the following figure, there's a
representation of the dataset:

24
25
• Now we can compute the dendrogram. The first step is
computing a distance matrix:
from scipy.spatial.distance import pdist
>>> Xdist = pdist(X, metric='euclidean')
• We have chosen a Euclidean metric, which is the most
suitable in this case. At this point, it's necessary to
decide which linkage we want. Let's take Ward;
however, all known methods are supported:
from scipy.cluster.hierarchy import linkage
>>> Xl = linkage(Xdist, method='ward')
• Now, it's possible to create and visualize a dendrogram:
from scipy.cluster.hierarchy import dendrogram
>>> Xd = dendrogram(Xl)
26
• In the x axis, there are the samples (numbered progressively), while the y
axis represents the distance. Every arch connects two clusters that are
merged together by the algorithm. For example, 23 and 24 are single
elements merged together. The element 13 is then aggregated to the
resulting cluster, and so the process continues.

27
• As you can see, if we decide to cut the graph at
the distance of 10, we get two separate
clusters: the first one from 15 to 24 and the
other one from 0 to 20. Looking at the previous
dataset plot, all the points with Y < 10 are
considered to be part of the first cluster, while
the others belong to the second cluster.
• If we increase the distance, the linkage
becomes very aggressive (particularly in this
example with only a few samples) and with
values greater than 27, only one cluster is
generated (even if the internal variance is quite
high!). 28
• Agglomerative clustering in scikit-learn
• Let's consider a more complex dummy dataset with 8
centers: 
>>> nb_samples = 3000
>>> X, _ = make_blobs(n_samples=nb_samples,
n_features=2, centers=8, cluster_std=2.0)
• A graphical representation is shown in the following
figure:

29
• Let's start with a complete linkage (AgglomerativeClustering uses the
method fit_predict() to train the model and transform the original
dataset):
• from sklearn.cluster import AgglomerativeClustering
>>> ac = AgglomerativeClustering(n_clusters=8, linkage='complete')
>>> Y = ac.fit_predict(X)
• A plot of the result (using both different markers and colors) is shown in
the following figure:

The result is totally bad

30
• Let's now consider the average linkage:
>>> ac = AgglomerativeClustering(n_clusters=8,
linkage='average') In this case, the clusters are better defined,
even if some of them could have become
>>> Y = ac.fit_predict(X) really small. It can also be useful to try
other metrics
• The result is shown in the following screenshot:

31
• Ward's linkage, that can be used only with a
Euclidean metric (also the default one):
>>> ac = AgglomerativeClustering(n_clusters=8)
>>> Y = ac.fit_predict(X)
• The resulting plot is shown in the following figure:

32
What is KNN?
• In four years of my career into analytics I have built
more than 80% of classification models and just 15-
20% regression models. These ratios can be more or
less generalized throughout the industry. The reason
of a bias towards classification models is that most
analytical problem involves making a decision.
• For instance will a customer attrite or not, should we
target customer X for digital campaigns, whether
customer has a high potential or not etc. These
analysis are more insightful and directly links to an
implementation roadmap.

33
• When do we use KNN algorithm?
• KNN can be used for both classification and
regression predictive problems. However, it is
more widely used in classification problems in
the industry.
• To evaluate any technique we generally look
at 3 important aspects:
• 1. Ease to interpret output
• 2. Calculation time
• 3. Predictive Power

34
• Let us take a few examples to  place KNN in
the scale :

• KNN algorithm fairs across all parameters of


considerations. It is commonly used for its
easy of interpretation and low calculation
time.
35
• How does the KNN algorithm work?
• Let’s take a simple case to understand this
algorithm. Following is a spread of red circles
(RC) and green squares (GS) :

You intend to find out the class of the blue star (BS) .
BS can either be RC or GS and nothing else. The “K” is
KNN algorithm is the nearest neighbors we wish to
take vote from. Let’s say K = 3.
36
• Hence, we will now make a circle with BS as
center just as big as to enclose only three
datapoints on the plane. Refer to following
diagram for more details:

37
• The three closest points to BS is all RC. Hence,
with good confidence level we can say that
the BS should belong to the class RC.
• Here, the choice became very obvious as all
three votes from the closest neighbor went to
RC. The choice of the parameter K is very
crucial in this algorithm.
• Next we will understand what are the factors
to be considered to conclude the best K.

38
• How do we choose the factor K?
• First let us try to understand what exactly
does K influence in the algorithm.
• If we see the last example, given that all the 6
training observation remain constant, with a
given K value we can make boundaries of each
class. These boundaries will segregate RC from
GS.
• The same way, let’s try to see the effect of
value “K” on the class boundaries. Following
are the different boundaries separating the
two classes with different values of K.
39
40
• If you watch carefully, you can see that the
boundary becomes smoother with increasing
value of K. With K increasing to infinity it
finally becomes all blue or all red depending
on the total majority. 
• The training error rate and the validation error
rate are two parameters we need to access on
different K-value.
• Following is the curve for the training error
rate with varying value of K :

41
• As you can see, the error rate at K=1 is always zero for the
training sample. This is because the closest point to any training
data point is itself. Hence the prediction is always accurate with
K=1. If validation error curve would have been similar, our
choice of K would have been 1. Following is the validation error
curve with varying value of K:
42
43
• We can implement a KNN model by following the below steps:
• Load the data
• Initialise the value of k
• For getting the predicted class, iterate from 1 to total
number of training data points
– Calculate the distance between test data and each row of
training data. Here we will use Euclidean distance as our
distance metric since it’s the most popular method. The
other metrics that can be used are Chebyshev, cosine, etc.
– Sort the calculated distances in ascending order based on
distance values
– Get top k rows from the sorted array
– Get the most frequent class of these rows
– Return the predicted class

44
• Connectivity constraints
• scikit-learn also allows specifying a connectivity
matrix, which can be used as a constraint when
finding the clusters to merge. In this way, clusters
which are far from each other (non- adjacent in the
connectivity matrix) are skipped.
• A very common method for creating such a matrix
involves using the k-nearest neighbors graph
function (implemented as kneighbors_graph()), that
is based on the number of neighbors a sample has
(according to a specific metric). In the following
example, we consider a circular dummy dataset
(often used in the official documentation also):
45
• from sklearn.datasets import make_circles
>>> nb_samples = 3000
>>> X, _ = make_circles(n_samples=nb_samples,
noise=0.05) 
• A graphical representation is shown in the following
figure:

46
• We start with unstructured agglomerative clustering
based on average linkage and impose 20 clusters:
>>> ac = AgglomerativeClustering(n_clusters=20,
linkage='average')
>>> ac.fit(X)
• In this case, we have used the method fit() because
the class AgglomerativeClustering, after being
trained, exposes the labels (cluster number) through
the instance variable labels_ and it's easier to use
this variable when the number of clusters is very
high.
• A graphical plot of the result is shown in the
following figure:
47
48
• Now we can try to impose a constraint with different
values for k:
• from sklearn.neighbors import kneighbors_graph
>>> acc = []
>>> k = [50, 100, 200, 500]
>>> for i in range(4):
>>> kng = kneighbors_graph(X, k[i])
>>> ac1 = AgglomerativeClustering(n_clusters=20,
connectivity=kng, linkage='average')
>>> ac1.fit(X)
>>> acc.append(ac1)
• The resulting plots are shown in the following
screenshot: 49
50
• Recommendation Systems
• The basic concepts are users, items, and
ratings (or an implicit feedback about the
products, like the fact of having bought them).
Every model must work with known data (like
in a supervised scenario), to be able to suggest
the most suitable items or to predict the
ratings for all the items not evaluated yet.
• We're going to discuss two different kinds of
strategies:
• User or content based
• Collaborative filtering 51
• The first approach is based on the information
we have about users or products and its target
is to associate a new user with an existing
group of peers to suggest all the items
positively rated by the other members, or to
cluster the products according to their
features and propose a subset of items similar
to the one taken into account.
• The second approach, which is a little bit more
sophisticated, works with explicit ratings and
its purpose is to predict this value for every
item and every user.
52
• Even if collaborative filtering needs more
computational power as, nowadays, the great
availability of cheap resources, allows using this
algorithm with millions of users and products to
provide the most accurate recommendations in
real-time.
• The model can also be retrained or updated
every day.
• Naive user-based systems
• In this first scenario, we assume that we have
a set of users represented by feature vectors:

53
• we have a set of items:
• Let's assume also that there is a relation which
associates each user with a subset of items
(bought or positively reviewed), items for
which an explicit action or feedback has been
performed:

• In a user-based system, the users are periodically


clustered (normally using a k-nearest neighbors
approach), and therefore, considering a generic user
u (also new), we can immediately determine the ball
containing all the users who are similar (therefore
neighbors) to our sample: 54
• At this point, we can create the set of suggested items using
the relation previously introduced:

• In other words, the set contains all the unique products


positively rated or bought by the neighborhood. I've used the
adjective naive because there's a similar alternative that
• User-based system implementation with scikit- learn
• For our purposes, we need to create a dummy
dataset of users and products:
•  

55
import numpy as np
>>> nb_users = 1000
>>> users = np.zeros(shape=(nb_users, 4)) 
>>> for i in range(nb_users):
>>> users[i, 0] = np.random.randint(0, 4)
>>> users[i, 1] = np.random.randint(0, 2)
>>> users[i, 2] = np.random.randint(0, 5)
>>> users[i, 3] = np.random.randint(0, 5)
We assume that we have 1,000 users with four
features represented by integer numbers bounded
between 0 and 4 or 5. It doesn't matter what they
mean; their role is to characterize a user and allow for
clustering of the set.
56
• For the products, we also need to create the
association:
>>> nb_product = 20
>>> user_products = np.random.randint(0,
nb_product, size=(nb_users, 5)) 
• We assume that we have 20 different items (from 1 to 20; 0
means that a user didn't buy anything) and an association
matrix where each user is linked to a number of products
bounded between 0 and 5 (maximum). For example:

57
• At this point, we need to cluster the users using the
NearestNeighbors implementation provided by scikit-learn:
• from sklearn.neighbors import NearestNeighbors
>>> nn = NearestNeighbors(n_neighbors=20, radius=2.0)
>>> nn.fit(users)
• NearestNeighbors(algorithm='auto', leaf_size=30,
metric='minkowski', metric_params=None, n_jobs=1,
n_neighbors=20, p=2, radius=2.0) 
• We have selected to have 20 neighbors and a Euclidean
radius equal to 2. This parameter is used when we want to
query the model to know which items are contained in the
ball whose center is a sample and with a fixed radius. In our
case, we are going to query the model to get all the
neighbors of a test user:
•  
58
>>> test_user = np.array([2, 0, 3, 2])
>>> d, neighbors = nn.kneighbors(test_user.reshape(1,
-1))
>>> print(neighbors)
array([[933,67, 901, 208,23, 720, 121, 156, 167,
60, 337, 549, 93,563, 326, 944, 163, 436,
174,22]], dtype=int64) 
• Now we need to build the recommendation list using
the association matrix:

59
>>> suggested_products = []
>>> for n in neighbors:
>>> for products in user_products[n]:
>>> for product in products:
>>> if product != 0 and product not in
suggested_products:
>>> suggested_products.append(product) 
>>> print(suggested_products)
[14, 5, 13, 4, 8, 9, 16, 18, 10, 7, 1, 19, 12, 11, 6, 17, 15, 3,
2]
• For each neighbor, we retrieve the products he/she bought and perform a
union, avoiding the inclusion of items with zero value (meaning no product)
and double elements. The result is a list (not sorted) of suggestions that can
be obtained almost in real time for many different systems.
60
• Content-based systems
• This is probably the simplest method and it's based only
on the products, modeled as feature vectors:

• Just like the users, the features can also be categorical (indeed, for
products it's easier), for example, the genre of a book or a movie, and
they can be used together with numerical values (like price, length,
number of positive reviews, and so on) after encoding them.
• Then a clustering strategy is adopted, even if the most
used is k-nearest neighbors as it allows controlling the
size of each neighborhood to determine, given a sample
product, the quality and the number of suggestions.
•  
•  
61
• Using scikit-learn, first of all we create a dummy product
dataset:
>>> nb_items = 1000
>>> items = np.zeros(shape=(nb_items, 4))

• In this case, we have 1000 samples with four integer features


bounded between 0 and 100. Then we proceed, as in the
previous example, towards clustering them:
>>> nn = NearestNeighbors(n_neighbors=10, radius=5.0)
>>> nn.fit(items)
•   At this point, it's possible to query our model with the method
radius_neighbors(),
62
• which allows us to restrict our research only to a limited
subset. The default radius (set through the parameter radius)
is 5.0, but we can change it dynamically:
>>> test_product = np.array([15, 60, 28, 73])
>>> d, suggestions =
nn.radius_neighbors(test_product.reshape(1, -1), radius=20)
>>> print(suggestions)
[array([657, 784, 839, 342, 446, 196], dtype=int64)]
>>> d, suggestions =
nn.radius_neighbors(test_product.reshape(1, -1), radius=30)
>>> print(suggestions)
[ array([844, 340, 657, 943, 461, 799, 715, 863, 979, 784, 54, 148,
806,
• 465, 585, 710, 839, 695, 342, 881, 864, 446, 196, 73, 663, 580,
216], dtype=int64)]
63
• When clustering with k-nearest neighbors, it's
important to consider the metric adopted for
determining the distance between the
samples.
• The default for scikit-learn is the Minkowski
distance, which is a generalization of Euclidean
and Manhattan distance, and is defined as:

• The parameter p controls the type of distance


and the default value is 2, so that the resulting
metric is a classical Euclidean distance.
64
• Other distances are offered by SciPy (in the package
scipy.spatial.distance) and include, for example, the
Hamming and Jaccard distances.
• The former is defined as the disagree proportion
between two vectors (if they are binary this is the
normalized number of different bits). For example:
• from scipy.spatial.distance import hamming
>>> a = np.array([0, 1, 0, 0, 1, 0, 1, 1, 0, 0])
>>> b = np.array([1, 1, 0, 0, 0, 1, 1, 1, 1, 0])
>>> d = hamming(a, b)
>>> print(d) 0.40000000000000002

65
• It means there's a disagree proportion of 40 percent,
or, considering that both vectors are binary, there 4
different bits (out of 10). This measure can be useful
when it's necessary to emphasize the
presence/absence of a particular feature.
• The Jaccard distance is defined as:

• It's particularly useful to measure the dissimilarity


between two different sets (A and B) of items. If our
feature vectors are binary, it's immediate to apply
this distance using Boolean logic. Using the previous
test values, we get:
66
• from scipy.spatial.distance import jaccard 
>>> d = jaccard(a, b)
>>> print(d) 0.5714285714285714 
This measure is bounded between 0 (equal vectors) and 1 (total
dissimilarity). 
• As for the Hamming distance, it can be very useful when it's
necessary to compare items where their representation is made
up of binary states (like present/absent, yes/no, and so forth ). If
you want to adopt a different metric for k-nearest neighbors, it's possible to
specify it directly using the metric parameter:
>>> nn = NearestNeighbors(n_neighbors=10, radius=5.0,
metric='hamming')
>>> nn.fit(items) 
>>> nn = NearestNeighbors(n_neighbors=10, radius=5.0,
metric='jaccard')
>>> nn.fit(items)
67
• Model-free (or memory-based) collaborative
filtering
• As with the user-based approach, let's consider
having two sets of elements: users and items.
However, in this case, we don't assume that they
have explicit features. Instead, we try to model a
user-item matrix based on the preferences of each
user (rows) for each item (columns). For example:
•  

68
• In this case, the ratings are bounded between 1
and 5 (0 means no rating), and our goal is to
cluster the users according to their rating vector
(which is, indeed, an internal representation
based on a particular kind of feature). This allows
producing recommendations even when there
are no explicit pieces of information about the
user.
• However, it has a drawback, called cold-startup,
which means that when a new user has no
ratings, it's impossible to find the right
neighborhood, because he/she can belong to
virtually any cluster.
69
• In order to build the model, we first need to define the user-item
matrix as a Python dictionary with the structure:
• { user_1: { item1: rating, item2: rating, ... }, ..., user_n: ...
}
• A missing value in a user internal dictionary means no rating. In
our example, we consider 5 users with 5 items:
• from scikits.crab.models import
MatrixPreferenceDataModel
>>> user_item_matrix = {
1: {1: 2, 2: 5, 3: 3},
2: {1: 5, 4: 2},
3: {2: 3, 4: 5, 3: 2},
4: {3: 5, 5: 1},
5: {1: 3, 2: 3, 4: 1, 5: 3} }  
70
>>> model =
MatrixPreferenceDataModel(user_item_matrix)
• Once the user-item matrix has been defined, we
need to pick a metric and therefore, a distance
function d(ui, uj), to build a similarity matrix:

• from scikits.crab.similarities import UserSimilarity


from scikits.crab.metrics import
euclidean_distances
• >>> similarity_matrix = UserSimilarity(model,
euclidean_distances)
71
• At this point, it's possible to build the recommendation system
(based on the k-nearest neighbors clustering method) and test it:
• from scikits.crab.recommenders.knn import
UserBasedRecommender
>>> recommender = UserBasedRecommender(model,
similarity_matrix, with_preference=True)>>>
print(recommender.recommend(2))
• [(2, 3.6180339887498949), (5, 3.0), (3,
2.5527864045000417)]
• So the recommender suggests the following predicted rating for
user 2:
• Item 2: 3.6 (which can be rounded to 4.0)
• Item 5: 3
• Item 3: 2.5 (which can be rounded to 3.0)

72
• Model-based collaborative filtering
• This is currently one of the most advanced
approaches and is an extension of what was already
seen in the previous section. The starting point is
always a rating-based user-item matrix:

• However, in this case, we assume the presence of


latent factors for both the users and the items. In
other words, we define a generic user as:

73
• A generic item is defined as:

• We don't know the value of each vector component


(for this reason they are called latent), but we
assume that a ranking is obtained as:

• So we can say that a ranking is obtained from a


latent space of rank k, where k is the number of
latent variables we want to consider in our model. In
general, there are rules to determine the right value
for k, so the best approach is to check different
values and test the model with a subset of known
ratings.
74
• However, there's still a big problem to solve:
finding the latent variables. There are several
strategies, but before discussing them, it's
important to understand the dimensionality of
our problem.
• If we have 1000 users and 500 products, M has
500,000 elements. If we decide to have rank
equal to 10, it means that we need to find
5000000 variables constrained by the known
ratings. As you can imagine, this problem can
easily become impossible to solve with standard
approaches and parallel solutions must be
employed.
75
• Singular Value Decomposition strategy
• Singular value decomposition is a method of
decomposing a matrix into three other matrices: fig-
Equation-1

• Where:
• A is an m × n matrix
• U is an m × n orthogonal matrix
• S is an n × n diagonal matrix
• V is an n × n orthogonal matrix
• The reason why the last matrix is transposed will
become clear later on in the exposition. Also, the term,
“orthogonal,” 
76
• For the moment, we will assume that m ≥ n. What
happens when this isn’t true is quite interesting and
is one of the keys, in my opinion, to understanding
singular value decomposition.
• Now using summation notation. 

• Note how we’ve collapsed the diagonal matrix, S,


into a vector, thus simplifying the expression into a
single summation. The variables, {sᵢ}, are called
singular values and are normally arranged from
largest to smallest:
77
• The columns of U are called left singular
vectors, while those of V are called right singular
vectors.
• We know that U and V are orthogonal, that is:

• Where I is the identity matrix. Only the diagonals of


the identity matrix are 1, with all other values being
0. Note that because U is not square we cannot say
that U Transpose(U)=I, so U is only orthogonal in one
direction.
• Using the orthogonality property, we can rearrange
(1) into the following pair of eigenvalue equations:
78
• Numerical procedure
• Since Transpose(A)A is the same size or smaller
than A Transpose(A), a typical procedure is to plug Equation
(3) into an eigenvalue calculator to find V and S² and then
find U by projecting A onto V:

• Note that the method is completely


symmetric; U and V change places when A is transposed

79
• let's create an example using SciPy. The first thing to
do is to create a dummy user-item matrix:

80
• We're assuming that we have 20 users and 10
products. The ratings are bounded between 1 and 5,
and 0 means no rating. Now we can decompose M:
• from scipy.linalg import svd import numpy as np
>>> U, s, V = svd(M, full_matrices=True)
>>> S = np.diag(s)
>>> print(U.shape) (20L, 20L)
>>> print(S.shape) (10L, 10L)
>>> print(V.shape) (10L, 10L)
Now let's consider only the first eight singular values,
which will have eight latent factors for both the users
and items:
81
>>> Uk = U[:, 0:8]
>>> Sk = S[0:8, 0:8]
>>> Vk = V[0:8, :]
Bear in mind that in SciPy SVD implementation, V is
already transposed.
we can easily get a prediction considering the cosine similarity
(which is proportional to the dot product) between customers
and products. The two latent factor matrices are:

In order to take into account the loss of precision, it's useful


also to consider the average rating per user (which
corresponds to the mean row value of the user-item matrix),
so that the result rating prediction for the user i and the item j
becomes: 82
• Here SU(i) and SI(j) are the user and product
vectors respectively. Continuing with our
example, let's determine the rating prediction
for user 5 and item 2:
>>> Su = Uk.dot(np.sqrt(Sk).T)
>>> Si = np.sqrt(Sk).dot(Vk).T
>>> Er = np.mean(M, axis=1)
>>> r5_2 = Er[5] + Su[5].dot(Si[2])
>>> print(r5_2) 2.38848720112

83
• Fundamental of Deep Network
• Deep learning is an aspect of artificial
intelligence (AI) that is concerned with
emulating the learning approach that
human beings use to gain certain types
of knowledge.
• At its simplest, deep learning can be
thought of as a way to automate 
predictive analytics.

84
• While traditional machine learning algorithms
are linear, deep learning algorithms are
stacked in a hierarchy of increasing complexity
and abstraction.
• To understand deep learning, imagine a
toddler whose first word is dog. The toddler
learns what a dog is (and is not) by pointing to
objects and saying the word dog. The parent
says, "Yes, that is a dog," or, "No, that is not a
dog." As the toddler continues to point to
objects, he becomes more aware of the
features that all dogs possess. 
85
• What the toddler does, without knowing it, is clarify a
complex abstraction (the concept of dog) by building
a hierarchy in which each level of abstraction is
created with knowledge that was gained from the
preceding layer of the hierarchy.
• How deep learning works
• Computer programs that use deep learning go
through much the same process. Each algorithm in
the hierarchy applies a nonlinear transformation on
its input and uses what it learns to create a statistical
model as output. Iterations continue until the output
has reached an acceptable level of accuracy. The
number of processing layers through which data
must pass is what inspired the label deep.
86
87
• In traditional machine learning, the learning
process is supervised and the programmer has to
be very, very specific when telling the computer
what types of things it should be looking for when
deciding if an image contains a dog or does not
contain a dog. This is a laborious process
called feature extraction and the computer's
success rate depends entirely upon the
programmer's ability to accurately define a feature
set for "dog." The advantage of deep learning is
that the program builds the feature set by itself
without supervision. Unsupervised learning is not
only faster, but it is usually more accurate.
88
• Initially, the computer program might be provided
with training data, a set of images for which a
human has labeled each image "dog" or "not dog"
with meta tags. The program uses the information it
receives from the training data to create a feature
set for dog and build a predictive model. In this
case, the model the computer first creates might
predict that anything in an image that has four legs
and a tail should be labeled "dog." Of course, the
program is not aware of the labels "four legs" or
"tail;" it will simply look for patterns of pixels in the
digital data. With each iteration, the predictive
model the computer creates becomes more
complex and more accurate.
89
90
• ecause this process mimics a system of human
neurons, deep learning is sometimes referred
to as deep neural learning or deep 
neural networking.
• Unlike the toddler, who will take weeks or
even months to understand the concept of
"dog," a computer program that uses deep
learning algorithms can be shown a training
set and sort through millions of images,
accurately identifying which images have dogs
in them within a few minutes.

91
• Using neural networks
• A type of advanced machine learning algorithm,
known as neural networks, underpins most deep
learning models. Neural networks come in several
different forms, including recurrent neural
networks, convolutional neural networks, artificial
neural networks and feedforward neural
networks, and each has their benefit for specific
use cases. However, they all function in
somewhat similar ways, by feeding data in and
letting the model figure out for itself whether it
has made the right interpretation or decision
about a given data element.
92
• Neural networks involve a trial-and-error process, so they
need massive amounts of data to train on. It's no coincidence
that neural networks became popular only after most
enterprises embraced big data analytics and accumulated
large stores of data. Because the model's first few iterations
involve somewhat-educated guesses on the contents of
image or parts of speech, the data used during the training
stage must be labeled so the model can see if its guess was
accurate.
• This means that, though many enterprises that use big data
have large amounts of data, unstructured data is less helpful.
Unstructured data can be analyzed by a deep learning model
once it has been trained and reaches an acceptable level of
accuracy, but deep learning models can't train on
unstructured data.

93
• Examples of deep learning applications
• Because deep learning models process
information in ways similar to the human
brain, models can be applied to many tasks
people do.
• Deep learning is currently used in most
common image recognition tools, NLP
processing and speech recognition software.
These tools are starting to appear in
applications as diverse as self-driving cars and
language translation services.
94
• Limitations of deep learning
• The biggest limitation of deep learning models is that
they learn through observations. This means they only
know what was in the data they trained on.
• If a user has a small amount of data or it comes from one
specific source that is not necessarily representative of
the broader functional area, the models will not learn in a
way that is generalizable.
• The issue of biases is also a major problem for deep
learning models. If a model trains on data that contains
biases, the model will reproduce those biases in its
predictions. This has been a vexing problem for deep
learning programmers because models learn to
differentiate based on subtle variations in data elements. 
95
• Weights
• Weights implement the product of an input i with a
weight value w, to produce an output o. Seems easy,
but addition and multiplications are at the hearth of
neural networks.

96
97
• They sum up some inputs i and use a non-
linearity (added as separate block) to output
(o) a non-liner version of the sum.
• Neurons use weights w to take inputs and
scale them, before summing the values.
• Identity layers
• Identity layers just pass the input to the
output. Seem pretty bare, but they are an
essential block of neural networks.

98
• Non-linearities
• http://pytorch.org/docs/master/nn.html#non-linear-activations
• They take the inner value of a neuron and transform
it with a non-liner function to produce a neuron
output.

• Linear layers
• http://pytorch.org/docs/master/nn.html#linear-layers
• These layers are an array of neurons. Each take
multiple inputs and produce multiple outputs.

99
• The input and output number is arbitrary and of uncorrelated
length.
• These building blocks are basically an array of linear
combinations of inputs scaled by weights. Weights multiply
inputs with an array of weight values, and they are usually
learned with a learning algorithm.

100
• Linear layers do not come with non-linearities, you
will have to add it after each layer. If you stack
multiple layers, you will need a non-linearity
between them, or they will all collapse to a single
linear layer.
• Convolutional layers
• http://pytorch.org/docs/master/nn.html#convolution-layers
• They are exactly like a linear layer, but each
output neuron is connected to a locally
constrained group of input neurons.

101
• This group is often called receptive-field, borrowing the name
from neuroscience. Convolutions can be performed in 1D, 2D,
3D… etc.
• If the inputs are correlated, then it makes more sense to look
at a group of inputs rather than a single value like in a linear
layer. Linear layer can be thought of a convolutions with 1
value per filters.

102
• Recurrent neural network
• http://pytorch.org/docs/master/nn.html#rnn
• Recurrent neural networks are a special kind of linear layer,
where the output of each neuron is fed back ad additional
input of the neuron, together with actual input.

103
• You can think of RNN as a combination of the
input at instant t and the state/output of the
same neuron at time t-1.
• RNN can remember sequences of values, since
they can recall previous outputs, which again
are linear combinations of their inputs.
• Attention modules
• Attention modules are simply a gating function
for memories. If you want to specify which
values in an array should be passed through
attention, you use a linear layer to gate each
input by some weighting function.
104
105
• Attention modules can be soft when the
weights are real-valued and the inputs are
thus multiplied by values.
• Attention is hard when weight are binary, and
inputs are either 0 or passing through.
Outputs are also called attention head
outputs.
• Memories
• A basic building block that saves multiple
values in a table.

106
107
• This is basically just a multi-dimensional array.
The table can be of any size and any
dimensions. It can also be composed of multiple
banks or heads. Generally memory have a write
function writes to all locations and reads from
all location. An attention-like module can focus
reading and writing to specific locations. See
attention module below.
• Residual modules
• Residual modules are a simple combination of
layers and pass-through layers or Identity layers
(ah-ah! told you they were important!).
108
O/P
109
• References
• https://www.analyticsvidhya.com/blog/tag/hierarchical-
clustering/
• https://www.saedsayad.com/clustering_hierarchical.htm
• http://www.improvedoutcomes.com/docs/WebSiteDocs/
Clustering/Agglomerative_Hierarchical_Clustering_Overview.htm
• https://www.saedsayad.com/data_mining_map.htm
• http://rosalind.info/glossary/euclidean-distance/
• https://blog.statsbot.co/singular-value-decomposition-tutorial-
52c695315254
• https://searchenterpriseai.techtarget.com/definition/deep-
learning-deep-neural-network
• https://github.com/mbadry1/DeepLearning.ai-Summary/tree/
master/1-%20Neural%20Networks%20and%20Deep%20Learning

110

You might also like