0% found this document useful (0 votes)
98 views

CZ4032 Data Analytics & Mining Notes

This document discusses various techniques for classification and clustering in machine learning. It covers topics like overfitting and underfitting in decision trees, instance-based classifiers like k-nearest neighbors, Bayesian classifiers, support vector machines, artificial neural networks, and ensemble methods. It also discusses different types of clustering like partitional, hierarchical, density-based clustering and the k-means clustering algorithm.

Uploaded by

Feng Chengxuan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
98 views

CZ4032 Data Analytics & Mining Notes

This document discusses various techniques for classification and clustering in machine learning. It covers topics like overfitting and underfitting in decision trees, instance-based classifiers like k-nearest neighbors, Bayesian classifiers, support vector machines, artificial neural networks, and ensemble methods. It also discusses different types of clustering like partitional, hierarchical, density-based clustering and the k-means clustering algorithm.

Uploaded by

Feng Chengxuan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 16

Lecture 4 – Classification: Alternative Techniques

Underfitting – when model is too simple, both training and test errors are large

 Insufficient Examples:
o Lack of data points in the lower half of the diagram makes it difficult to
predict correctly the class labels of that region
o Insufficient number of training records in the region causes the decision tree
to predict the test examples using other training records that are irrelevant
to the classification task

Overfitting – decision trees that are more complex than necessary

 The learning algorithm has access only to the training set during model building. It
has no knowledge of the test set
 Solution 1: Pre-Pruning
o Stop the algorithm before it becomes a fully-grown tree
o Typical stop conditions:
 All instances belong to the same class
 Attribute values are the same
o Restrictive conditions:
 Number of instances is less than some user-specified threshold
 Class distribution of instances are independent of the available
features
 Expanding the current node does not improve impurity measures
 Solution 2: Post-pruning
o Grow decision tree to its entirety
o Trim the nodes of decision tree in a bottom-up fashion
o If generalization error improves after trimming, replace sub-tree by a leaf
node
o Class label of a leaf node is determined from majority class of instances in the
sub-tree
o Can use Minimum Description Length (MDL) for post pruning
Estimating Generalization Errors

Occam’s Razor: Given two models of similar generalization errors, one should prefer the
simpler model over the more complex model

 For complex models, there is a greater chance that it was fitted accidentally by errors
in data
 Minimum Description Length:
o Cost (Model, Data) = Cost (Data|Model) + Cost (Model)
 Cost: number of bits needed for encoding
 Search for the least costly model
o Cost (Data|Model): misclassification errors
o Cost (Model): node encoding (number of children) + splitting condition
encoding
Instance Based Classifiers:

 Store the training records (without training explicit models)


 Use training records directly to predict the class label of unseen cases
 Rote Learner: memorizes entire training data and performs classification only if
attributes of record match one of the training examples exactly
 K-Nearest Neighbours (k-NN): Uses k “closest” points (nearest neighbours) for
performing classification
Nearest-Neighbour Classifier:

 Things Required:
o Set of stored training records
o Distance metric to compute distance between records
o Value of k, the number of nearest neighbours to retrieve
 If k is too small, sensitive to noise points
 If k is too large, neighbourhood may include points from other classes
 To classify an unknown record:
o Compute distance to other training records
o Identify k nearest neighbours
o The class labels of the nearest neighbours to determine the class label of the
unknown vote
 Computation:
o Euclidean distance:

o Determine the class from nearest neighbour list:


 Take majority vote of class labels among the k-nearest neighbours
 Weigh the vote according to distance (Weight factor, w = 1/d2)
 1 nearest-neighbour:
o Voronoi Diagram: boundaries denote 1-neighborhoods

Bayesian Classifier: A probabilistic framework for classification problems


Naïve Bayes Classifier

 If one of the conditional probabilities is zero, then the entire expression becomes
zero

Estimate Probabilities from Data:

 For continuous attributes:


o Discretize the range into bins
 One ordinal attribute per bin
 Violates independence assumption
o Two-way split: (A<v) or (A>v)
 Choose only one of the two splits as new attribute
o Probability density estimation:
 Assume attribute follows a normal distribution
 Use data to estimate parameters of distribution (e.g. mean and
standard deviation)
 Once probability distribution is known, can use it to estimate the
conditional probability P(Ai|c)
 Normal Distribution:
Support Vector Machines

 To know which line is better, we want to find a hyperplane that maximizes the
margin:

 If the problem is not linearly separable:

Nonlinear Support Vector Machines

 Transform data into higher dimensional space


 Using “Kernel Trick”, actual transformation need not be known
 Just compute similarity between two vectors in original space
Artificial Neural Networks (ANN)

 Perceptron Model is an assembly of inter-connected nodes and weighted links


 Output node sums up each of its input value according to the weights of its links
 Compare output node against some threshold t

 Algorithm for learning ANN:


o Initialize the weights (w0, w1, w2, …, wk)
o Adjust the weights in such a way that the output of ANN is consistent with
class labels of training examples
 Objective function:

 Find the weights wi that minimize the above objective function


e.g. backpropagation algorithm
Ensemble Methods:

 Construct a set of classifiers from the training data


 Predict class label previously unseen records by aggregating predictions made by
multiple classifer
 Assumption: Individual classifiers (voters) could be lousy (stupid), but the aggregate
(electorate) can usually classify (decide) correctly
 Bagging:
o Sampling with replacement
o Build classifier on each bootstrap sample
o Each sample has probability of (1 – (1-1/N) N of being selected
 Tends to be 0.63 for large N
 Boosting: An iterative procedure to adaptively change distribution of training data by
focusing more on previously misclassified records
o Initially, all N are assigned equal weights
o Unlike bagging, sampling weights may change at the end of boosting round
o Records that are wrongly classified will have their weights increased
o Records that are correctly classified will have their weights decreased
Lecture 6 – Cluster Analysis
Cluster Analysis: Finding groups of objects such that the objects in a group will be similar to
one another and different from to the objects in other groups.

 Types of Clustering:
o Partitional Clustering: A division data objects into non-overlapping subsets
(clusters) such that each data object is in exactly one subset
o Hierarchical Clustering: A set of nested clusters organized as a hierarchical
tree
o Contiguous Cluster (Nearest neighbour or Transitive): A cluster is a set of
points such that a point in a cluster is closer to one or more other points in
the cluster than to any point not in cluster
o Density-based: A cluster is a dense region or points, which is separated by
low-density regions, from other regions of high density
 Used when clusters are irregular or intertwined, and when noise and
outliers are present
o Conceptual Cluster: Finds clusters that share some common property or
represent a particular concept
 Other Distinctions:
o Exclusive versus non-exclusive
o Fuzzy vs non-fuzzy
o Partial versus complete
o Heterogeneous versus homogeneous

Partial Clustering: Hierarchical Clustering:

 K-means Clustering:
o Partitional clustering approach
o Each cluster is associated with a centroid
o Each point is assigned to the cluster with the closest centroid (Euclidean
distance, cosine similarity, etc.)
o Number of clusters, K, must be specified
o Complexity: O (n*k | *d) where n = number of points, k = number of clusters,
I = number of iterations, d = number of attributes
o Choice of initial centroids are extremely important
o Sum of Squared Error (SSE): For every point, the error is the distance to the
nearest cluster

o To reduce SSE, it is important to increase K, the number of clusters


 A good clustering with smaller K can have a lower SSE than a poor
clustering with higher K
o Problem with Selecting Initial Points:
 If there are K ‘real’ clusters then the chance of selecting one centroid
from each cluster is small
 Chance is relatively small when K is large
 If clusters are the small size, n, then

o Solution:
 Multiple runs
 Sample and use hierarchical clustering to determine initial centroids
 Select more than k initial centroids and then select among these initial
centroids: Select more widely separated
 Postprocessing
 Bisecting K-means: Not as susceptible to initialization issues
o Handling Empty Clusters:
 Choose the point that contributes most to SSE
 Choose a point from the cluster with the highest SSE
 If there are several empty clusters, the above can be repeated several
times
o Updating Centres Incrementally after each assignment:
 Each assignment updates zero/2 centroids
 More expensive
 Introduces an order dependency
 Never get an empty cluster
 Can use “weights” to change the impact
o Pre-Processing:
 Normalize the data
 Eliminate outliers
o Post-processing:
 Eliminate small clusters that may represent outliers
 Split ‘loose’ clusters, i.e. clusters with relatively high SSE
 Merge clusters that are ‘close’ and that have relatively low SSE
 Can use these steps during the clustering process: ISODATA (Iterative
Self-Organizing Data Analysis)
Bisecting K-means algorithm: Variant of K-means that can produce a partitional or a
hierarchical clustering

 Limitations:
o Problems when clusters are of differing sizes, densities, non-globular shapes
o Data contains outliers

Hierarchical Clustering

 Produces a set of nested clusters organized as a hierarchical tree


 Space: O(N2) where N is number of points; Time: O(N3) where there are N steps and
at each step the size, N2, proximity matrix must be updated and searched
 Can be visualized as a dendrogram
o A tree like diagram that records that sequences of merges or splits

 Strengths:
o Do not have to assume any particular number of clusters: Any desired
number of clusters can be obtained by ‘cutting’ the dendrogram at the
proper level
o Correspond to meaningful taxonomies
 Types of hierarchical clustering:
o Agglomerative:
 Start with points as individual clusters
 At each step, merge the closest pair of clusters until only one cluster
(or k clusters) left

o Divisive:
 Start with one, all-inclusive cluster
 At each step, split a cluster until each cluster contains a point
 Inter-Cluster Similarity

MIN MAX

Strength: Strength:
- Can handle non-elliptical shapes - Less susceptible to noise and outliers
Limitation: Limitations:
- Sensitive to noise and outliers - Tends to break large clusters
- Biased towards globular clusters
Group Average Distance between Centroids

Strength:
- Less susceptible to noise and outliers
Limitations:
- Biased towards globular clusters
 Cluster Similarity: Ward’s Method
o Based on the increase in squared error when two clusters are merged:
o Less susceptible to noise and outliers
o Biased towards globular clusters
o Hierarchical analogue of K-means

Minimum Spanning Tree (MST): Divisive Hierarchical Clustering

 Build MST
o Start with a tree that consists of any point
o In successive steps, look for the closest pair of points (p, q) such that one
point (p) is in the current tree but the other (q) is not
o Add q to the tree and put an edge between p and q

 Use MST for constructing hierarchy of clusters


Lecture 7 – Cluster Analysis: Alternative Methods

DBSCAN: density-based algorithm


 Density = number of points within a specified radius (Eps)
 A point is a core point if it has more than a specified number of points (MinPts)
within Eps (including itself)
o These are points that are at the interior of a cluster
 A border point is not a core point but is in the neighbourhood of a core point
 A noise point is any point that is not a core point or a border point

 DBSCAN Algorithm
o Eliminate noise points
o Perform clustering on the remaining noise
 Two core points within a specific radius are put into the same cluster
 Border points are added
 To determine EPS and Min Pts:
o For points in a cluster, their kth nearest neighbours are at roughly the same
distance
o Noise points have the kth nearest neighbour at farther distance
o So, plot sorted distance of every point to its kth nearest neighbour

Clustering Using REpresentatives (CURE)

 Representative points are found by selecting a constant number of points from a


cluster and then “shrinking” them toward the centre of the cluster
 Cluster similarity is the similarity of the closest pair of representative points from
different clusters
 Shrinking representative points toward the centre helps avoid problems with noise
and outliers
 CURE is better able to handle clusters of arbitrary shapes and sizes

Graph-Based Clustering
 Graph-Based clustering uses the proximity graph
o Start with the proximity matrix
o Considering each point as a node in a graph
o Each edge between two nodes has a weight which is the proximity between
the two points
 A graph, G = (V, E) representing a set of objects (V) and their relations
(E)
 Initially the proximity graph is fully connected
 MIN (single-link) and MAX (complete-link) can be viewed as starting
with this graph
o Clusters are connected components in the graph
 Clustering may work better
o Sparsification techniques keep the connections to the most similar (nearest)
neighbour of a point while breaking the connections to less similar points
o The nearest neighbours of a point tend to belong to the same class as the
point itself
o This reduces the impact of noise and outliers and sharpens the distinction
between clusters
 Sparsification facilitates the use of graph partitioning algorithms (Chameleon &
Hypergraph-based Clustering)

o Chameleon: Clustering using Dynamic Modelling


 Adapt to the characteristics of the data set to find the natural clusters
 Use a dynamic model to measure the similarity between clusters
 Relatively closeness & Relative inter-connectivity of the cluster
 Two clusters are combined if the resulting cluster shares
certain properties with the constituent clusters
 Merging scheme preserves self-similarity
 Used in spatial data:
 Clusters are defined as densely populated regions of
space
 Clusters have arbitrary shapes, orientation, and non-
uniform sizes
 Difference in densities across clusters and variation in
density within clusters
 Existence of special artefacts and noise
 Steps:
 Pre-processing Step: Represent the Data by a Graph
 Given a set of points, construct the k-nearest-
neighbour graph to capture the relationship between a
point and its k nearest neighbours
 Concept of neighbourhood is captured dynamically
 Phase 1: Use a multilevel graph partitioning algorithm on the
graph to find a large number of clusters of well-connected
vertices
 Each cluster should contain mostly points from one
“true” cluster
 Phase 2: Use Hierarchical Agglomerative Clustering to merge
sub-clusters: Two clusters are combined if the resulting cluster
shares certain properties with the constituent clusters
 Relative Interconnectivity: Absolute interconnectivity
of two clusters normalized by the internal connectivity
of the clusters
 Relative Closeness: Absolute closeness of two clusters
normalized by the internal closeness of the clusters

Shared Nearest Neighbour (SNN) graph: the weight of an edge is the number off shared
neighbours between the vertices given that the vertices are connected

 Jarvis Patrick Clustering:


o K-nearest neighbours of all points are found (breaking all but the k strongest
links from a point to other points in the proximity graph)
o Pair of points is put in the same cluster if:
 Two points share more than T neighbours
 Two points are in each other’s k nearest neighbour list
 SNN Density-based Clustering Algorithm:
o Compute the similarity matrix
o Sparsify the similarity matrix by keeping only the k most similar neighbours
o Construct the shared nearest neighbour graph from the sparsified similarity
matrix
o Find the core points
o Form clusters from the core points
o Discard all the noise points
o Assign all non-noise, non-core points to clusters

Cluster Validation:

 Determining the clustering tendency of a set of data, i.e. distinguishing whether non-
random structure actually exists in the data
 Compare the results of a cluster analysis to externally known results
 Evaluate how well the results of a cluster analysis fit the data without referencing to
external information
 Comparing the results of two different sets of cluster analyses to determine which is
better
 Determining the ‘correct’ number of clusters
 Measures of Cluster Validity:
o External Index: extent that cluster labels match externally supplied class
labels (Entropy)
o Internal Index: goodness of a clustering structure without respect to external
information (Sum of Squared Error)
 Cluster Cohesion: Measures how closely related are objects in a
cluster (e.g. SSE) – sum of weight of all links within a cluster
 Cluster Separation: Measures how distinct or well-separated a cluster
is from other clusters (e.g. Squared Error) – sum of weights between
nodes in the cluster and nodes outside the cluster

 Silhouette Coefficient: combine ideas of both cohesion and separation

o Relative Index: Used to compare two different clustering or cluster


o Correlation:
 Proximity Matrix
 Incidence Matrix: entry = 1 if associated pair of points belong to the
same cluster; else entry = 0
 Compute the correlation between the two matrices
 High correlation indicates that points that belong to the same cluster
are close to each other

Lecture 8 – Association Rule Mining


Associate Rule Mining: Given a set of transactions, find rules that will predict the occurrence
of an item based on the occurrences of other items in the transaction
Frequent Itemset: An itemset whose support is greater than or equal to a minsup threshold
 Itemset: A collection of one or more items (E.g. Milk, Bread, Diaper)
 K-itemset: An itemset that contains k items
Support count: Frequency of occurrence of an itemset
Support: Fraction of transactions that contain an itemset

You might also like