IRS Unit-4

www.jntuworld.
com
www.jwjobs.net
Unit- 4
4.1 Introduction to Clustering
4.2 Thesaurus Generation 4.3 Item Clustering 4.4 Hierarchy of Clustering Introduction to Clustering : Clustering: provide a grouping of similar objects into a class under a more general title Clustering also allows linkage between clusters to be specified An information database can be viewed as being composed of a number of independent items indexed by a series of index terms Term clustering : Used to create a statistical thesaurus Increase recall by expanding searches with related terms (query expansion) Document clustering : Used to create document clusters The search can retrieve items similar to an item of interest, even if the query would not have retrieved the item (resultant set expansion) Result-set clustering
Process of Clustering : Define the domain for clustering

Thesaurus: medical terms Documents: set of items to be clustered Identify those objects to be used in the clustering process and reduce the potential for erroneous data that could induce errors in the clustering process
Determine the attributes of the objects to be clustered

Thesaurus: determine the specific words in the objects to be used Documents: may focus on specific zone within the items that are to be used to determine similarity Reduce erroneous association
www.jntuworld.com
www.jntuworld.com
www.jwjobs.net
Determine the relationships between the attributes whose co-occurrence in objects suggest those objects should be in the same class
Thesaurus: determine which words are synonyms and the strength of their relationships Documents: define a similarity function based on word co-occurrences that determine the similarity between two items
Apply some algorithm to determine the class(es) to which each object will be assigned Guidelines on the Characteristics of the Classes :
A well-defined semantic definition should exist for each class There is a risk that the name assigned to the semantic definition of the class could also be misleading
The size of the classes should be within the same order of magnitude Within a class, one object should not dominate the class Whether an object can be assigned to multiple classes or just one must be decided at creation time
Additional Decisions for Thesaurus:

Word coordination approach: specify if phrases as well as individual terms are to be clustered Word relationships Equivalence, hierarchical, non-hierarchical Parts-Wholes, Collocation, Paradigmatic, Taxonomy and Synonym, and Antonym Contrasted words, child-of, parent-of, is-part-of, has-part
Homograph resolution Vocabulary constraints Normalization: constrain the thesaurus to stems vs. complete words Specificity: eliminate specific words or use general terms for class identifiers
Thesaurus Generation:
Automatically generated thesauri contain classes that reflect the use of words in the corpus
www.jntuworld.com
www.jntuworld.com
www.jwjobs.net
The classes do not naturally have a name, but are just a groups of statistically similar terms Basic idea for term clustering: the more frequently two terms co-occur in the same items, the more likely they are about the same concept Term-clustering algorithms differ by the completeness with which terms are correlated The more complete the correlation, the higher the time and computational overhead to create the clusters
Complete Term Relation Method:

The similarity between every term pair is calculated as a basis for determining the clusters Using the vector model for clustering
A similarity measure is required to calculate the similarity between to terms
Term-Term Matrix
Threshold = 10
Term Relationship Matrix
www.jntuworld.com
www.jntuworld.com
www.jwjobs.net
The final step in creating clusters is to determine when two objects (words) are in the same cluster Cliques, single link, stars, and connected components
Cliques : Cliques require all terms in a cluster to be within the threshold of all other terms
Class 1 (Term 1, Term 3, Term 4, Term 6) Class 2 (Term 1, Term 5) Class 3 (Term2, Term 4, Term 6) Class 4 (Term 2, Term 6, Term 8) Class 5 (Term 7)
Clique Algorithm:
www.jntuworld.com
www.jntuworld.com
www.jwjobs.net
Single Link :
Any term that is similar to any term in the cluster an be added to the cluster It is impossible for a term to be in two different clusters
Overhead in assignment of terms to classes: O(n2)
Star :
Select a term and then places in the class all terms that are related to that term Terms not yet in classes are selected as new seeds until all terms are assigned to a class There are many different classes that can be created using the Star technique If we always choose as the starting point for a class the lowest numbered term not already in a class Class 1 (Term 1, Term 3, Term 4, Term 5, Term 6) Class 2 (Term 2, Term 4, Term 8, Term 6) Class 3 (Term 7)
String :
Starts with a term and includes in the class one additional term that is similar to the term selected and not already in a class The new term is then used as the new node and the process is repeated until no new terms can be added because the term being analyzed does not have another term related to it or the terms related to it are already in the class A new class is started with any terms not currently in any existing class Using the guidelines to select the lowest number term similar to the current term and not to select any term already in an existing class produces the following classes Class 1 (Term 1, Term 3, Term 4, Term 2, Term 6, Term 8)
www.jntuworld.com
www.jntuworld.com
www.jwjobs.net
Class 2 (Term 5) Class 3 (Term 7)
Network Diagram of Term Similarities :
Comparison : Clique
Produce classes that have the strongest relationships between all of the words in the class The class is more likely to be describing a particular concept Produce more classes than the other techniques
Single link
Partition the term into classes Produce the fewest number of classes and the weakest relationship between terms It is possible that two terms that have a similarity value of zero will be in the same class Classes will not be associated with a concept but cover diverse concepts
The selection of the technique is also governed by the density of the term relationship matrix and objects of the thesaurus
www.jntuworld.com
www.jntuworld.com
www.jwjobs.net
Term relationship matrix Sparse: toward single link Dense: toward clique
Objects of the thesaurus Cliques provide the highest precision when the statistical thesaurus is used for query term expansion The single link algorithm maximize recall but can cause selection of many non-relevant items
The single link algorithm has the least overhead in assignment of terms to classes: O(n2) comparisons
Clustering Using Existing Clusters:

Start with a set of existing clusters The initial assignment of terms to the clusters is arbitrary and revised by revalidating every term assignment to a cluster To minimum calculations, centroids are calculated for each cluster Centroid: the average of all of the vectors in a cluster
The similarity between all existing terms and the centroids of the clusters can be calculated The term is reallocated to the cluster that has the highest similarity
The process stops when minimal movement between clusters is detected
Illustration of Centroid Movement:
Initial centroids for clusters
www.jntuworld.com
www.jntuworld.com
www.jwjobs.net
Centroids after reassigning terms

Example : Initial assignment Class 1 = (Term 1, Term 2) Class 2 = (Term 3, Term 4) Class 3 = (Term 5, Term 6)
Initial centroid Class 1 = (0+4)/2, (3+1)/2, (3+0)/2, (0+1)/2, (2+2)/2 = 4/2, 4/2, 3/2, 1/2, 4/2 Class 2 = 0/2, 7/2, 0/2, 3/2, 5/2 Class 3 = 2/2, 3/2, 3/2, 0/2, 5/2
Apply the simple similarity measure between each of the 8 terms and 3 centroids One technique for breaking ties is to look at the similarity weights of the other terms in the class and assign it to the class that has the most similar weights
www.jntuworld.com
www.jntuworld.com
www.jwjobs.net
New Centroids and Cluster assignments
Clustering Using Existing Clusters :

Computation overhead: O(n) The number of classes is defined at the start of the process and cannot grow It is possible to be fewer classes at the end of the process
Since all terms must be assigned to a class, it forces terms to be allocated to classes, even if their similarity to the class is very weak compared to other terms assigned
One Pass Assignment :

Minimum overhead: only one pass of all of the terms is used to assign terms to classes Algorithm The first term is assigned to the first class Each additional term is compared to the centroids of the existing classes A threshold is chosen. If the item is greater than the threshold, it is assigned to the class with the highest similarity A new centroid has to be calculated for the modified class
If the similarity to all of the existing centroids is less than the threshold, the term is the first item in a new class This process continues until all items are assigned to classes
www.jntuworld.com
www.jntuworld.com
www.jwjobs.net
Example with threshold of 10

Class generated Class 1 = Term 1, Term 3, Term 4 Class2 = Term 2, Term 6, Term 8 Class 3 = Term 5 Class 4 = Term 7
Centroids values used Class 1 (Term 1, Term 3) = 0, 7/2, 3/2, 0, 4/2 Class 1 (Term 1, Term 3, Term 4) = 0, 10/3, 3/3, 3/3, 7/3 Class 2 (Term 2, Term 6) = 6/2, 3/2, 0/2, 1/2, 6/2
Minimum computation on the order of O(n) Does not produce optimum clustered classes Different classes can be produced if the order in which the items are analyzed changes
Item Clustering :
Think about manual item clustering, which is inherent in any library or filing system one item one category Automatic clustering one primary category and several secondary categories Similarity between documents is based on two items that have terms in common The similarity function is performed between rows of the item matrix
Item/Item and Item Relationship Matrix :
www.jntuworld.com
www.jntuworld.com
www.jwjobs.net
Hierarchy of Clusters :
Hierarchical clustering Hierarchical agglomerative clustering (HAC) start with un-clustered items and perform pair-wise similarity measures to determine the clusters Hierarchical divisive clustering start with a cluster and breaking it down into smaller clusters
Objectives of Creating a Hierarchy of Clusters :

Reduce the overhead of search Perform top-down searches of the centroids of the clusters in the hierarchy and trim those branches that are not relevant
Provide for visual representation of the information space Visual cues on the size of clusters (size of ellipse) and strengths of the linkage between clusters (dashed line, sold line)
Expand the retrieval of relevant items A user, once having identified an item of interest, can request to see other items in the cluster The user can increase the specificity of items by going to children clusters or by increasing the generality of items being reviewed by going to a parent clusters
Dendogram for Visualizing Hierarchical Clusters:
www.jntuworld.com
www.jntuworld.com
www.jwjobs.net
Hierarchical Agglomerative Clustering Method (HACM) :

First the N * N item relationship matrix is formed Each item is placed into its own clusters The following two steps are repeated until only one cluster exists The two clusters that have the highest similarity are found These two clusters are combined, and the similarity between the newly formed cluster and the remaining clusters recomputed
As the larger cluster is found, the clusters that merged together are tracked and form a hierarchy
HACM Example : Assume document A, B, C, D, and E exist and a document-document similarity matrix exists {{A} {B} {C} {D} {E}} {{A, B} {C} {D} {E}} {A, B, C, D, E}
Similarity Measure between Clusters : Single link clustering

The similarity between two clusters (inter-cluster similarity) is computed as the maximum similarity between any two documents in the two clusters, each initially from a separate cluster
www.jntuworld.com
www.jntuworld.com
www.jwjobs.net
Complete linkage
Inter-cluster similarity is computed as the minimum of the similarity between any documents in the two clusters such that one document is from each cluster
Group average
As a node is considered for a cluster, its average similarity to all nodes in the cluster is computed. It is placed in the cluster as long as its average similarity is higher than its average similarity for any other cluster
Wards method Clusters are joined so that their merger minimizes the increase in the sum of the distances from each individual document to the centroid of the cluster containing it If cluster A merged with either cluster B or cluster C, the centroids for the potential cluster AB and AC are computed as well as the maximum distance of any document to the centroid. The cluster with the lowest maximum is used
Analysis of HACM:
Wards method typically took the longest to implement Single link and complete linkage are somewhat similar in run time Clusters found in the single link clustering tend to be fair broad in nature and provide lower effectiveness Choosing the best cluster as the source of relevant documents results in very close effectiveness results for complete link, Wards and group average clustering A consistent drop in effectiveness for single link clustering is noted
www.jntuworld.com

IRS Unit-4

Uploaded by

Copyright:

Available Formats

IRS Unit-4

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

IRS Unit-4

Uploaded by

Copyright:

Available Formats

www.jntuworld.

Process of Clustering : Define the domain for clustering

Determine the attributes of the objects to be clustered

Additional Decisions for Thesaurus:

Complete Term Relation Method:

A similarity measure is required to calculate the similarity between to terms

Term Relationship Matrix

Overhead in assignment of terms to classes: O(n2)

Class 2 (Term 5) Class 3 (Term 7)

Network Diagram of Term Similarities :

Clustering Using Existing Clusters:

The process stops when minimal movement between clusters is detected

Illustration of Centroid Movement:

Initial centroids for clusters

Centroids after reassigning terms

New Centroids and Cluster assignments

Clustering Using Existing Clusters :

One Pass Assignment :

Example with threshold of 10

Item/Item and Item Relationship Matrix :

Objectives of Creating a Hierarchy of Clusters :

Dendogram for Visualizing Hierarchical Clusters:

Hierarchical Agglomerative Clustering Method (HACM) :

Similarity Measure between Clusters : Single link clustering

You might also like