0% found this document useful (0 votes)
14 views

Clustering

Understanding Clustering

Uploaded by

Tinotenda Sandra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
14 views

Clustering

Understanding Clustering

Uploaded by

Tinotenda Sandra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 6
13: Clustering Previous Next Index Unsupervised learning - introduction + Talk about clustering © Learning from unlabeled data + Unsupervised learning © Useful to contras with supervised learning + Compare and contrast © Supervised learning = Given a set of labels, fit a hypothesis to it © Unsupervised learning = Try and determining structure in the data 1 Clustering algorithm groups data together based on data features + What is clustering good for Market segmentation - group customers into different market segments © Social network analysis - Facebook "smartlists’ © Organizing computer clusters and data centers for network layout and location © Astronomical data analysis - Understanding galaxy formation K-means algorithm + Want an algorithm to automatically group the data into coherent clusters + K-means is by far the most widely used clustering algorithm Overview + Take unlabeled data and group into two clusters + Algorithm overview © 1) Randomly allocate two points as the cluster centroids = Have as many cluster centroids as clusters you want to do (K cluster centroids, in fact) = In our example we just have two clusters © 2) Cluster assignment step = Go through each example and depending on if it’s closer to the red or blue centroid assign each point to one of the two clusters = To demonstrate this, we've gone through the data and "colour" each point red or blue © 3) Move centroid step ‘= Take each centroid and move to the average of the correspondingly assigned data-points * Repeat 2) and 3) until convergence + More formal definition © Input: = K (number of clusters in the data) + Training set fx! x2, x8. x") © Algorithm: * Randomly initialize K cluster centroids as {t1,, Hoy Hg + sx} Repeat { fori=1tom cl := index (from 1 to K) of cluster centroid closest to x) fork=1toK [4 = average (mean) of points assigned to cluster k } + Loop = This inner loop repeatedly sets the &® centroid closes to x! + ive. take example, measure squared distance to each cluster centroid, assign eto the cluster closest @a Kod ope Wal Cs « toops = Loops over each centroid calculate the average mean based on all the points associated with each centroid from ¢) = What if there's a centroid with no data mare acon oto th clase = Or, randomly reinitialize it + Not sure when though, variable to be the index of the closes variable of cluster K-means for non-separated clusters + So far looking at K-means where we have well defined clusters + But often K-means is applied to datasets where there aren't well defined clusters © eg, T-shirt sizing T-shirt sizing 2 eor 9 © > . . 2 . 2 wt. Height t obvious discrete groups ay you want to have three sizes ($,M,L) how big do you make these? © One way would be to run K-means on this data © May do the following T-shirt sizing Height © So creates three clusters, even though they aren't realy there © Look at first population of people = Tey and design a small T-shirt which fits the 1st population ‘= And so on for the other two © This is an example of market segmentation * Build products which suit the needs of your subpopulations K means optimization objective + Supervised learning algorithms have an optimization objective (cost function) © K-means does too + K-means has an optimization objective like the supervised learning funetions we've seen, © Why is this good? © Knowing this is useful because it helps for debugging © Helps find better clusters ‘+ While K-means is running we keep track of two sets of variables © lis the index of clusters {1,2,... K} to which x is currently assigned one of K different values) © ty is the cluster associated with centroid k = Locations of cluster centroid k * i. there are mc! values, as each example has a c! value, and that value is one the the clusters (i.e. can only be = So there are K = So these the centroids which exist in the training data space the cluster centroid of the cluster to which example x' has been assigned to ‘= This is more for convenience than anything else ‘= You could look up that example iis indexed to cluster j (using the c vector), where jis between 1 and K = Then look up the value associated with cluster j in the u vector (i.e. what are the features associated with 4) ‘= But instead, for easy description, we have this variable which gets, = Lets say x! as been assigned to cluster 5 = Means that tly the same value he = Hs * Using this notation we can write the optimization objective; te IT (CO yop, pry esac) = =D Ile = peo |? (estes) = D2 Hl 6 i.e. squared distances between training example x! and the cluster centroid to which x! has been assigned to + This is just what we've been doing, as the visual description below shows; New = The red line here shows the distances between the example x! and the cluster to which that example has been. assigned = Means that when the example is very close to the cluster, this value is small = When the cluster is very far away from the example, the value is large © This is sometimes called the distortion (or distortion cost function) © Sowe are finding the values which minimizes this funetion; ymin I(r, we) Bayes MK + Ifwe consider the k-means algorithm © The cluster assigned step is minimizing J(..) with respect to cc? .¢ * i.e. find the centroid closest to each example = Doesn't change the centroids themselves © The move centroid step = We can show this step is choosing the values of which minimizes J(.) with respect to 1 © So, we're partitioning the algorithm into two parts ‘» First part minimizes the c variables 1 Second part minimizes the J variables * We can use this knowledge to help debug our K-means algorithm Random initialization + How we initialize K-means © And how avoid local optimum + Consider clustering algorithm © Never spoke about how we initialize the centroids 1 A few ways - one method is most recommended + Have number of centroids set to less than number of examples (K < m) (if K > m we have a problem)o © Randomly pick K training examples © Set p, up to pix to these example's values + K means can converge to different solutions depending on the initialization setup © Risk of_ocal optimum GLOBAL OPTIMUM LocaL OPTIMA © The local optimum are valid convergence, but local optimum not global ones «If this is a concern ‘© We can do multiple random initializations * See if we get the same result - many same results are likely to indicate a global optimum + Algorithmically we can do this as follows; For i=1to 100{ Randomly initialize K-means. Run K-means. Get c{1) c™), 1... Compute cost function (distortion) TED, 2.5, wry.) } © Atypical number of times to initialize K-means is 50-1000 © Randomly initialize K-means + For each 100 random initialization run K-means * Then compute the distortion on the set of cluster assignments and centroids at convergent + End with 100 ways of cluster the data * ick the clustering which gave the lowest distortion + Ifyou're running K means with 2-10 clusters ean help find better global optimum ‘© If Kis larger than 10, then multiple random initializations are less likely to be necessary © First solution is probably good enough (better granularity of clustering) How do we choose the number of clusters? * Choosing K? © Nota great way to do this automatically © Normally use visualizations to do it manually + What are the intuitions regarding the data? + Why is this hard © Sometimes very ambiguous 1 e.g, two clusters or four clusters = Not necessarily a correct answer © This is why doing it automatic this is hard Elbow method + Vary K and compute cost function at a range of K values + As K increases J\..) minimum value should decrease (i.e. you decrease the granularity so centroids can better optimize) © Plot this (K vs JO) + Look for the "elbow" on the graph Selon" ieee 12345678 K (ho. of clusters) Cost function J « Chose the “elbow” number of clusters * Ifyou get a nice plot this is a reasonable way of choosing K © Risks © Normally you don’t get a a nice line -> no clear elbow on curve © Not really that helpful Another method for choosing K + Using K-means for market segmentation + Running K-means for a later/downstream purpose © See how well different number of clusters serve you later needs + eg. © T-shirt size example = Ifyou have three sizes (S,M,L) 1 Or five sizes (XS, S, M, L, XL) = Run K means where K= 3 and K= 5 © How does this look Tshirt sting L. T-shirt sing Weight Weight Height Height © This gives a way to chose the number of clusters = Could consider the cost of making extra sizes vs. how well distributed the products are = How important are those sizes though? (e.g. more sizes might make the customers happier) = So applied problem may help guide the number of clusters

You might also like