Lecture 1 PDF

This document provides an overview of cluster analysis. It discusses what cluster analysis is, common applications, requirements and challenges. It also summarizes typical clustering methodologies, how to cluster different data types, and how users can provide insights to clustering through visualization, semi-supervised learning, multi-view clustering and validation.

Uploaded by

MUKTAR REZVI

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

106 views

Lecture 1 PDF

Uploaded by

MUKTAR REZVI

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Lecture 1.

Cluster Analysis:
An Introduction
Lecture 1. Cluster Analysis: An Introduction
 What Is Cluster Analysis?
 Applications of Cluster Analysis
 Cluster Analysis: Requirements and Challenges
 Cluster Analysis: A Multi-Dimensional Categorization
 An Overview of Typical Clustering Methodologies
 An Overview of Clustering Different Types of Data
 An Overview of User Insights and Clustering
 Summary
2
Session 1: What Is Cluster
Analysis?
What Is Cluster Analysis?
 What is a cluster?
 A cluster is a collection of data objects which are
 Similar (or related) to one another within the same group (i.e., cluster)
 Dissimilar (or unrelated) to the objects in other groups (i.e., clusters)
 Cluster analysis (or clustering, data segmentation, …)
 Given a set of data points, partition them into a set of groups (i.e.,
clusters) which are as similar as possible
 Cluster analysis is unsupervised learning (i.e., no predefined classes)
 This contrasts with classification (i.e., supervised learning)
 Typical ways to use/apply cluster analysis
 As a stand-alone tool to get insight into data distribution, or
 As a preprocessing (or intermediate) step for other algorithms

4
Session 2: Applications of
Cluster Analysis
Cluster Analysis: Applications
 A key intermediate step for other data mining tasks
 Generating a compact summary of data for classification, pattern discovery,
hypothesis generation and testing, etc.
 Outlier detection: Outliers—those “far away” from any cluster
 Data summarization, compression, and reduction
 Ex. Image processing: Vector quantization
 Collaborative filtering, recommendation systems, or customer segmentation
 Find like-minded users or similar products
 Dynamic trend detection
 Clustering stream data and detecting trends and patterns
 Multimedia data analysis, biological data analysis and social network analysis
 Ex. Clustering images or video/audio clips, gene/protein sequences, etc.
6
Session 3: Cluster Analysis:
Requirements and Challenges
Considerations for Cluster Analysis
 Partitioning criteria

 Single level vs. hierarchical partitioning (often, multi-level hierarchical

partitioning is desirable)
 Separation of clusters

 Exclusive (e.g., one customer belongs to only one region) vs. non-exclusive
(e.g., one document may belong to more than one class)
 Similarity measure

 Distance-based (e.g., Euclidean, road network, vector) vs. connectivity-

based (e.g., density or contiguity)
 Clustering space

 Full space (often when low dimensional) vs. subspaces (often in high-
dimensional clustering)
8
Requirements and Challenges
 Quality
 Ability to deal with different types of attributes: Numerical, categorical,
text, multimedia, networks, and mixture of multiple types
 Discovery of clusters with arbitrary shape
 Ability to deal with noisy data
 Scalability
 Clustering all the data instead of only on samples
 High dimensionality
 Incremental or stream clustering and insensitivity to input order
 Constraint-based clustering
 User-given preferences or constraints; domain knowledge; user queries
 Interpretability and usability

9
Session 4: Cluster Analysis: A
Multi-Dimensional Categorization
Cluster Analysis: A Multi-Dimensional Categorization
 Technique-Centered
 Distance-based methods
 Density-based and grid-based methods
 Probabilistic and generative models
 Leveraging dimensionality reduction methods
 High-dimensional clustering
 Scalable techniques for cluster analysis
 Data Type-Centered
 Clustering numerical data, categorical data, text data, multimedia data, time-
series data, sequences, stream data, networked data, uncertain data
 Additional Insight-Centered
 Visual insights, semi-supervised, ensemble-based, validation-based
11
Session 5: An Overview of Typical
Clustering Methodologies
Typical Clustering Methodologies (I)
 Distance-based methods
 Partitioning algorithms: K-Means, K-Medians, K-Medoids
 Hierarchical algorithms: Agglomerative vs. divisive methods
 Density-based and grid-based methods
 Density-based: Data space is explored at a high-level of granularity and
then post-processing to put together dense regions into an arbitrary shape
 Grid-based: Individual regions of the data space are formed into a grid-like
structure
 Probabilistic and generative models: Modeling data from a generative process
 Assume a specific form of the generative model (e.g., mixture of Gaussians)
 Model parameters are estimated with the Expectation-Maximization (EM)
algorithm (using the available dataset, for a maximum likelihood fit)
 Then estimate the generative probability of the underlying data points
13
Typical Clustering Methodologies (II)
 High-dimensional clustering
 Subspace clustering: Find clusters on various subspaces
 Bottom-up, top-down, correlation-based methods vs. δ-cluster methods
 Dimensionality reduction: A vertical form (i.e., columns) of clustering
 Columns are clustered; may cluster rows and columns together (co-clustering)
 Probabilistic latent semantic indexing (PLSI) then LDA: Topic modeling of text data
 A cluster (i.e., topic) is associated with a set of words (i.e., dimensions) and a
set of documents (i.e., rows) simultaneously
 Nonnegative matrix factorization (NMF) (as one kind of co-clustering)
 A nonnegative matrix A (e.g., word frequencies in documents) can be
approximately factorized two non-negative low rank matrices U and V
 Spectral clustering: Use the spectrum of the similarity matrix of the data to
perform dimensionality reduction for clustering in fewer dimensions
14
Session 6: An Overview of
Clustering Different Types of Data
Clustering Different Types of Data (I)
 Numerical data
 Most earliest clustering algorithms were designed for numerical data
 Categorical data (including binary data)
 Discrete data, no natural order (e.g., sex, race, zip-code, and market-basket)
 Text data: Popular in social media, Web, and social networks
 Features: High-dimensional, sparse, value corresponding to word frequencies
 Methods: Combination of k-means and agglomerative; topic modeling; co-clustering
 Multimedia data: Image, audio, video (e.g., on Flickr, YouTube)
 Multi-modal (often combined with text data)
 Contextual: Containing both behavioral and contextual attributes
 Images: Position of a pixel represents its context, value represents its behavior
 Video and music data: Temporal ordering of records represents its meaning
16
Clustering Different Types of Data (II)
 Time-series data: Sensor data, stock markets, temporal tracking, forecasting, etc.
 Data are temporally dependent
 Time: contextual attribute; data value: behavioral attribute
 Correlation-based online analysis (e.g., online clustering of stock to find stock tickers)
 Shape-based offline analysis (e.g., cluster ECG based on overall shapes)
 Sequence data: Weblogs, biological sequences, system command sequences
 Contextual attribute: Placement (rather than time)
 Similarity functions: Hamming distance, edit distance, longest common subsequence
 Sequence clustering: Suffix tree; generative model (e.g., Hidden Markov Model)
 Stream data:
 Real-time, evolution and concept drift, single pass algorithm
 Create efficient intermediate representation, e.g., micro-clustering
17
Clustering Different Types of Data (III)
 Graphs and homogeneous networks
 Every kind of data can be represented as a graph with similarity values as edges
 Methods: Generative models; combinatorial algorithms (graph cuts); spectral
methods; non-negative matrix factorization methods
 Heterogeneous networks
 A network consists of multiple typed nodes and edges (e.g., bibliographical data)
 Clustering different typed nodes/links together (e.g., NetClus)
 Uncertain data: Noise, approximate values, multiple possible values
 Incorporation of probabilistic information will improve the quality of clustering
 Big data: Model systems may store and process very big data (e.g., weblogs)
 Ex. Google’s MapReduce framework
 Use Map function to distribute the computation across different machines
 Use Reduce function to aggregate results obtained from the Map step
18
Session 7: An Overview of User
Insights and Clustering
User Insights and Interactions in Clustering
 Visual insights: One picture is worth a thousand words
 Human eyes: High-speed processor linking with a rich knowledge-base
 A human can provide intuitive insights; HD-eye: visualizing HD clusters
 Semi-supervised insights: Passing user’s insights or intention to system
 User-seeding: A user provides a number of labeled examples, approximately
representing categories of interest
 Multi-view and ensemble-based insights
 Multi-view clustering: Multiple clusterings represent different perspectives
 Multiple clustering results can be ensembled to provide a more robust solution
 Validation-based insights: Evaluation of the quality of clusters generated
 May use case studies, specific measures, or pre-existing labels
20
Session 8: Summary
Summary: Cluster Analysis—An Introduction
 What Is Cluster Analysis?
 Applications of Cluster Analysis
 Cluster Analysis: Requirements and Challenges
 Cluster Analysis: A Multi-Dimensional Categorization
 An Overview of Typical Clustering Methodologies
 An Overview of Clustering Different Types of Data
 An Overview of User Insights and Clustering
 Summary
22
Recommended Readings
 Major Reference Books on Cluster Analysis

 Jiawei Han, Micheline Kamber, and Jian Pei. Data Mining: Concepts and Techniques.
Morgan Kaufmann, 3rd ed. , 2011 (Chapters 10 & 11)
 Charu Aggarwal and Chandran K. Reddy (eds.). Data Clustering: Algorithms and
Applications. CRC Press, 2014
 Mohammed J. Zaki and Wagner Meira, Jr.. Data Mining and Analysis: Fundamental
Concepts and Algorithms. Cambridge University Press, 2014
 Reference paper for this lecture

 Charu Aggarwal. An Introduction to Clustering Analysis. in Aggarwal and Reddy (eds.).

Data Clustering: Algorithms and Applications (Chapter 1). CRC Press, 2014

The Gerlach and Ely Instructional Model
0% (1)
The Gerlach and Ely Instructional Model
14 pages
The 5 Grammatical Functions of A Noun
100% (2)
The 5 Grammatical Functions of A Noun
33 pages
Public Speaking and Persuasion REVIEWER
No ratings yet
Public Speaking and Persuasion REVIEWER
5 pages
Welcome To English 4 ESO Revision
100% (1)
Welcome To English 4 ESO Revision
13 pages
Lecture 1 PDF
No ratings yet
Lecture 1 PDF
23 pages
Chap8-Cluster Analysis
No ratings yet
Chap8-Cluster Analysis
78 pages
Chap8-Cluster Analysis
No ratings yet
Chap8-Cluster Analysis
103 pages
Clustering For Big Data Analytics
No ratings yet
Clustering For Big Data Analytics
28 pages
Unit5 Clustering
No ratings yet
Unit5 Clustering
74 pages
Concepts and Techniques: - Chapter 10
No ratings yet
Concepts and Techniques: - Chapter 10
97 pages
10clustering - Han and Kamber
No ratings yet
10clustering - Han and Kamber
93 pages
10ClusBasic
No ratings yet
10ClusBasic
95 pages
Clustering K Means Agnes
No ratings yet
Clustering K Means Agnes
36 pages
APznzaaxpWzYylHJmwXGn2puBz7GP1usZYf9XTi7oqfrrKnFV9DMMfVzPCu6yO0UOnr_XFt1gJv4TE1ITR6850n9k65DydQUgoRlylNdn2acWAu6KNonoO8z7QULN6BlLxY_B-JhKko0tJ3K77woLz26oTaAv1YNcIuMcOSqInmgeCUzpUxjKC9VqnT_lhE7vDyWp_LQQjGTRnamgIC6ya3nlwi7mjjE9EUIiO2sUhjkD6RV
No ratings yet
APznzaaxpWzYylHJmwXGn2puBz7GP1usZYf9XTi7oqfrrKnFV9DMMfVzPCu6yO0UOnr_XFt1gJv4TE1ITR6850n9k65DydQUgoRlylNdn2acWAu6KNonoO8z7QULN6BlLxY_B-JhKko0tJ3K77woLz26oTaAv1YNcIuMcOSqInmgeCUzpUxjKC9VqnT_lhE7vDyWp_LQQjGTRnamgIC6ya3nlwi7mjjE9EUIiO2sUhjkD6RV
38 pages
(PML ITS - Week 10) - Clustering
No ratings yet
(PML ITS - Week 10) - Clustering
42 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
101 pages
Cluster Analysis
No ratings yet
Cluster Analysis
76 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
101 pages
10ClusBasic
No ratings yet
10ClusBasic
66 pages
Unit 5 DM
No ratings yet
Unit 5 DM
47 pages
17 GM ASAP Data Mining - Clustering
No ratings yet
17 GM ASAP Data Mining - Clustering
107 pages
Unit - 5 Cluster Analysis
No ratings yet
Unit - 5 Cluster Analysis
83 pages
Data Mining: I Gede Mahendra Darmawiguna
No ratings yet
Data Mining: I Gede Mahendra Darmawiguna
25 pages
Lecture 8 - Clustering
No ratings yet
Lecture 8 - Clustering
23 pages
ML_7th_Sem_AIML_ITE_Notes_Complete_LONG[1]-155-202
No ratings yet
ML_7th_Sem_AIML_ITE_Notes_Complete_LONG[1]-155-202
48 pages
Classification Clustering Overview
No ratings yet
Classification Clustering Overview
7 pages
05 Clustering
No ratings yet
05 Clustering
96 pages
10ClusBasic Editted v1
No ratings yet
10ClusBasic Editted v1
41 pages
Cluster-Analysis
No ratings yet
Cluster-Analysis
89 pages
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
93 pages
Cluster Analysis
No ratings yet
Cluster Analysis
10 pages
Clustering
No ratings yet
Clustering
104 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
50 pages
Data Clustering Seminar
No ratings yet
Data Clustering Seminar
34 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
9 pages
5 Algoritma Klastering
No ratings yet
5 Algoritma Klastering
85 pages
Concepts and Techniques: - Chapter 7
No ratings yet
Concepts and Techniques: - Chapter 7
70 pages
Clustering
No ratings yet
Clustering
32 pages
Data Mining Clustering
No ratings yet
Data Mining Clustering
76 pages
Introduction to Cluster Analysis.
No ratings yet
Introduction to Cluster Analysis.
53 pages
Cluster Analysis
No ratings yet
Cluster Analysis
36 pages
Clustering-Part 1
No ratings yet
Clustering-Part 1
35 pages
UNIT 3 DWDM Notes
No ratings yet
UNIT 3 DWDM Notes
32 pages
UG BSF Clustering
No ratings yet
UG BSF Clustering
119 pages
Data Clustering: A Review
No ratings yet
Data Clustering: A Review
60 pages
Techniques of Cluster Analysis: A Seminar On
No ratings yet
Techniques of Cluster Analysis: A Seminar On
25 pages
DM 3rd unit
No ratings yet
DM 3rd unit
5 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
42 pages
Cluster Analysis: Basic Concepts and Algorithms
No ratings yet
Cluster Analysis: Basic Concepts and Algorithms
141 pages
1
No ratings yet
1
15 pages
Chapter8-Basic Cluster Analysis2018
No ratings yet
Chapter8-Basic Cluster Analysis2018
143 pages
Cluster
No ratings yet
Cluster
7 pages
DATA MINING ASSIGNMENT (1)
No ratings yet
DATA MINING ASSIGNMENT (1)
5 pages
Cluster Analysis
No ratings yet
Cluster Analysis
18 pages
8 - Clustering
No ratings yet
8 - Clustering
85 pages
BDA Unit 2
No ratings yet
BDA Unit 2
31 pages
Clustering
No ratings yet
Clustering
8 pages
DWDS Unit 6 Cluster Analysis (1)
No ratings yet
DWDS Unit 6 Cluster Analysis (1)
31 pages
Lectures 5 and 6 - Data Anaysis in Management - MBM
No ratings yet
Lectures 5 and 6 - Data Anaysis in Management - MBM
61 pages
Cluster Analysis Set 01: Types of Clustering
No ratings yet
Cluster Analysis Set 01: Types of Clustering
18 pages
Cluster Analysis
No ratings yet
Cluster Analysis
21 pages
Custer Analysis: Prepared by Navin Ninama
No ratings yet
Custer Analysis: Prepared by Navin Ninama
20 pages
Session 7 Clustering
No ratings yet
Session 7 Clustering
93 pages
Julia for Data Science
From Everand
Julia for Data Science
Anshul Joshi
No ratings yet
Lecture 2. Similarity Measures For Cluster Analysis
No ratings yet
Lecture 2. Similarity Measures For Cluster Analysis
31 pages
Data Link Control: Mcgraw-Hill ©the Mcgraw-Hill Companies, Inc., 2004
No ratings yet
Data Link Control: Mcgraw-Hill ©the Mcgraw-Hill Companies, Inc., 2004
33 pages
CS 313 Introduction To Computer Networking & Telecommunication Data Link Layer Part II - Sliding Window Protocols
No ratings yet
CS 313 Introduction To Computer Networking & Telecommunication Data Link Layer Part II - Sliding Window Protocols
39 pages
Application Form PDF
No ratings yet
Application Form PDF
2 pages
Personal Statement and Study Plan Form PDF
No ratings yet
Personal Statement and Study Plan Form PDF
1 page
Application Form PDF
No ratings yet
Application Form PDF
2 pages
CS 313 Introduction To Computer Networking & Telecommunication Data Link Layer Part II - Sliding Window Protocols
No ratings yet
CS 313 Introduction To Computer Networking & Telecommunication Data Link Layer Part II - Sliding Window Protocols
39 pages
Application Form PDF
No ratings yet
Application Form PDF
2 pages
Application Form PDF
No ratings yet
Application Form PDF
2 pages
The Effects of Sleep Deprivation On Learning and Memory Among
No ratings yet
The Effects of Sleep Deprivation On Learning and Memory Among
5 pages
Fce Essay Introduction
No ratings yet
Fce Essay Introduction
2 pages
Grade Persuasive
No ratings yet
Grade Persuasive
55 pages
Life 2E Beginner Scope and Sequence
No ratings yet
Life 2E Beginner Scope and Sequence
4 pages
Multicultural Lesson Plan - 280
No ratings yet
Multicultural Lesson Plan - 280
5 pages
BEL120 - Adjectives
No ratings yet
BEL120 - Adjectives
24 pages
Jurnal Ubd
No ratings yet
Jurnal Ubd
11 pages
Chapter 4 Example
No ratings yet
Chapter 4 Example
8 pages
Is The World A Simulation
No ratings yet
Is The World A Simulation
5 pages
Raws Reviewer
No ratings yet
Raws Reviewer
5 pages
25. the Feasibility and Acceptability of Goal Management Training of Executive Functions in Children With Spina Bifida and Acquired Brain Injury (1)
No ratings yet
25. the Feasibility and Acceptability of Goal Management Training of Executive Functions in Children With Spina Bifida and Acquired Brain Injury (1)
21 pages
Piaget's Theory of Cognitive Development
No ratings yet
Piaget's Theory of Cognitive Development
12 pages
Detailed Lesson PlanPEandHealth - TecsonNo.4
No ratings yet
Detailed Lesson PlanPEandHealth - TecsonNo.4
2 pages
Awesome New Edition 2 Final Test Name
No ratings yet
Awesome New Edition 2 Final Test Name
4 pages
Valedictorian Speech
No ratings yet
Valedictorian Speech
6 pages
Literaturereview With Rubric
No ratings yet
Literaturereview With Rubric
43 pages
Bryan Adams Monin Cheater JEP PDF
No ratings yet
Bryan Adams Monin Cheater JEP PDF
25 pages
AI Project Cycle_Question Bank
No ratings yet
AI Project Cycle_Question Bank
34 pages
Lecture 2 - Characteristics of Research
No ratings yet
Lecture 2 - Characteristics of Research
11 pages
10 Golden Rules of Essay Writing Presenting
No ratings yet
10 Golden Rules of Essay Writing Presenting
1 page
McKnight Foundation - Communications Director
No ratings yet
McKnight Foundation - Communications Director
4 pages
ADM Activity3
No ratings yet
ADM Activity3
2 pages
Remedial Teaching Form
No ratings yet
Remedial Teaching Form
6 pages
Program - 1: AIM:-Introduction To Prolog
No ratings yet
Program - 1: AIM:-Introduction To Prolog
4 pages
Bi Notes
No ratings yet
Bi Notes
138 pages
Daily Lesson Log Earth and Life Science For Cot
No ratings yet
Daily Lesson Log Earth and Life Science For Cot
2 pages