Clustering System
Clustering System
Clustering System
Information Technology
Text Mining and Clustering
2013
Prabin Lama
Prabin Lama
KEYWORDS:
Information extraction, document pre-processing, document clustering, K-means, news article
headlines
CONTENTS
LIST OF ABBREVIATIONS (OR) SYMBOLS
1 INTRODUCTION
2.2.1 Tokenization
2.2.3 Lemmatization
11
12
12
13
13
14
14
2.6 Motivation
14
15
15
2.6.3 Objectives
16
16
3 REQUIREMENTS ANALYSIS
17
17
17
18
18
19
19
20
5 RESEARCH METHODOLOGY
21
21
22
6 EXPERIMENTAL SETUP
26
6.1 Corpus
26
6.2 Dataset
26
6.3 Approach
26
7 IMPLEMENTATION
28
28
29
30
31
32
32
8.1.1Document Input
32
8.1.2 Analysis
41
9 CONCLUSION
42
43
10.1 Limitations
43
43
REFERENCES
APPENDICES
Appendix 1. Sample code for overall clustering and text mining
FIGURES
Figure 1. Use case diagram of the system.
Figure 2. Activity diagram for the system.
Figure 3. Research approach and design strategy of the study.
Figure 4. The overall research process.
Figure 5. Implementation model of the project.
Figure 6. Flowchart for document preprocessing.
Figure 7. Flowchart for clustering component.
Figure 8. Matrix representation of custom data sample.
Figure 9. Graphical representation of custom data.
Figure 10. Graphical representation of Iteration 0 of the K-means algorithm.
Figure 11. Matrix representation of distance between documents and centroids.
Figure 12. Matrix representation of the documents in the cluster.
Figure 13. Graphical representation of Iteration 1 of the K-means algorithm
Figure 14. Representation of distance between documents and new centroids.
Figure 15. Matrix representation of the documents in the cluster.
Figure 16. Graphical representation of Iteration 2 of the K-means algorithm.
Figure 17. Representation of distance between documents and new centroids.
Figure 18. Matrix representation of the documents in the cluster.
17
20
22
23
28
30
31
33
34
35
36
36
37
38
38
39
40
40
TABLES
Table 1. Custom data sample with the term frequency.
32
41
Fuzzy C-Means
HTML
IE
Information Extraction
LSI
NLP
VSM
XML
1 INTRODUCTION
Along with the development of new smart technologies, the world is going
digital. Because of this, the web is growing day by day and has become a huge
source of the information. Looking up for the precise and relevant information
and extracting it from the web has now become a time-consuming task. There
are many techniques used for the information extraction from the web and text
mining is one of them.
This thesis entitled Clustering System based on Text Mining using the K
means algorithm, is mainly focused on the use of text mining techniques and
the K means algorithm to create the clusters of similar news articles headlines.
The project study is based on text mining with primary focus on data-mining and
information extraction. The news headlines and the links to the different news
portal are fetched via an XML file to the clustering system. The news headlines
within the XML file are then preprocessed using document preprocessing
techniques and finally grouped in the clusters based on their similarities. These
clusters are displayed in a sample webpage with the corresponding links to the
news portal sites.
2.1
Text Mining
Text mining, also known as text data mining or knowledge discovery process
from the textual databases, generally, is the process of extracting interesting
and non-trivial patterns or knowledge from unstructured text documents. All the
extracted information is linked together to form new facts or new hypotheses to
be explored further by more conventional means of experimentation.
Document Pre-processing
10
All alphabetic characters in the strings in close proximity are part of one
token; likewise with numbers.
The resulting list of tokens may or may not contain punctuation and
whitespace
Friends
Romans
Countrymen
11
The general strategy for determining a stop list is to sort the terms by
collection frequency and then to make the most frequently used terms, as a
stop list, the members of which are discarded during indexing.
Some of the examples of stop-word are: a, an, the, and, are, as, at, be, for,
from, has, he, in, is, it, its, of, on, that, the, to, was, were, will, with etc.
2.2.3 Lemmatization
12
For instance: If operated on a token saw, stemming may return just s. While
lemmatisation will go through the morphological analysis and return see or saw
depending upon the use of word saw as a verb or noun in the sentence.
2.2.4 Synonym Expansion
According to McCarthy et al. (2007), synonym expansion, also known as lexical
substitution, is the task of replacing a certain word in a given context with
another suitable word similar in meaning.
When a word has multiple meanings, synonym expansion tries to nd the
correct meaning of the word used in a sentence by identifying its synonyms (or
substitutes) in a given context. Given a sentence, for example she was a bright
girl, the task is to nd a synonym that could replace the word bright without
changing the meaning of the sentence. Let us assume that we take bright as
the target word, then the word brilliant could be the suitable substitution for the
selected word, which would both maintain the meaning of the target word and at
the same time t the context of the sentence.
2.3
Document Representation
13
2.4
Information Extraction
14
Motivation
The web pages contain a huge amount of information in unstructured or semistructured format which can be extracted and transformed to valuable
information as per our requirement. Different techniques are available for this
purpose. Text mining is one of them and is an important step in knowledge
discovery process as already explained above. News articles are one of the
important sources of information which keep people up-to-date with the current
happenings in the world. Sometimes, for the similar headlines of the news
articles, the content may differ across different news portal. The availability of
the similar headlines and their corresponding links to the news portals under a
single roof would be a very efficient and sophisticated option to explore
information on a similar topic.
15
16
To answer this question, different text mining and clustering techniques are
used to find the similarities between the news headlines and the GUI is created
to display the clusters of similar news headlines with their corresponding links to
the original sites.
2.6.3 Objectives
The objectives of this research work are:
To reduce the complexity of visiting each site manually to read the similar
news.
17
3 REQUIREMENTS ANALYSIS
3.1
Data Collection
In order to carry out the study, the headlines of news articles are collected from
the random online news portals (online) or some random headlines can be
fetched (offline). Those collected data are stored in an XML file for further
document preprocessing. The XML file is then fed to the system by referencing
its location in the local machine.
3.2
Functional requirements
18
3.3
Non-functional Requirements
3.3.2 Performance
The developed system must be able to group the given headlines into clusters
based on the K means algorithm.
3.3.3 Scalability
The system must provide as many options like changing headlines in the XML
file and then the changes occur in the clusters as well.
3.4
Resource Requirements
Window XP or greater.
19
Feasibility Analysis
20
4.2
System Planning
System
Perform tokenization
21
5 RESEARCH METHODOLOGY
According to Kothari (2004), research comprises defining and redefining
problems, formulating hypothesis or suggested solutions, collecting, organizing
and evaluating data, making deductions and reaching conclusion and further
testing the conclusion whether they fit into formulating hypothesis. Research
methodology may be broadly classified as Exploratory and Conclusive. This
chapter discusses the research methodology and the design strategy
implemented to answer the research question of this project.
5.1
Since all the document pre-processing techniques and the algorithm planned to
use during this project are existing ones, the research approach is more of
exploratory than conclusive.
The use of existing the K-means algorithm for creating clusters of the news
headlines from the corpus of dataset achieved through various document
preprocessing and text mining techniques and generation of an idea which can
be further explored and carried on to develop a new system related to the web
data mining explains that the research methodology used during this project is
Exploratory.
Data mining is a good research field that uses data either originated from the
researcher for addressing specific problem at hand or secondary data which
have already been collected for other purposes like generating new hypothesis
or new idea based on the existing studies and researches. Secondary data can
be classified as internal and external. The research strategy for this project is
the analysis of secondary data which consists of news articles headlines
gathered from different news portals and stored in an XML file. The research
approach and design strategy of this study is illustrated in Figure 3.
22
Research Design
Exploratory
Quantitative
Text Mining
Secondary Data
External Data
Internet/www
News Portals
Figure 3. Research approach and design strategy of the study.
5.2
In this section, the overall process used for the project study is described. The
process includes step-by-step actions performed for the clustering of the news
headlines from different news portals. The overall research process is illustrated
in Figure 4.
23
Document
Preprocessing
Extracted News
Content
Stop Word
Removal
Tokenization
Document
Clustering
Lemmatization
Document
Representation
(tf vector space
model)
Synonym
Expansion
Preprocessed
Document
24
words such as a, an, the, and prepositions. Hence the tokens contained in the
stop word list are discarded.
5.2.1.3 Lemmatization
The detail concept and explanation about lemmatization have been mentioned
in Section 2.2.3. After the text is tokenized, each inflected term is reduced to its
base form so that it can be analysed as a single term. Lemmatization goes
through the morphological analysis of the word in the text, so it is a preferred
method to stemming.
5.2.1.4 Synonym Expansion
The detail concept and explanation about Synonym Expansion have already
been mentioned in Section 2.2.4. Synonym expansion is carried out by
searching each token in the dictionary and transforming each word to the base
words. The dictionary consists of a list of words and all of their synonyms.
5.2.2 Document Representation
Vector space model is the one of the efficient methods of representing
documents as vectors using the term frequency weighting scheme as
mentioned in Section 2.3. The entire collection of dataset from the XML file is
represented as vectors using the Vector space model.
K-means Algorithm
Use:
25
For partitioning where each clusters center is represented by the mean value of
the objects in the cluster.
Input:
k: the number of clusters,
Output:
A set of k clusters.
Method:
Step 1: Choose k numbers of clusters to be determined.
Step 2: Choose Ck centroids randomly as the initial centers of the clusters.
Step 3: Repeat
3.1: Assign each object to their closest cluster center using Euclidean
distance.
3.2: Compute new cluster center by calculating mean points.
Step 4: Until
4.1: No change in cluster center OR
4.2: No object changes its clusters.
26
6 EXPERIMENTAL SETUP
The confidence in the overall system performance is yet to be justified by the
experimental results. This chapter discusses the experimentation carried to
evaluate the clustering system. The requirements for the experimentation are as
following:
6.1
Corpus
The corpus is a large collection of documents. The corpus for this project is the
headlines of the news articles extracted in an XML file along with the
corresponding site names. The headlines are stored as structured text.
6.2
Dataset
The dataset is constructed by retrieving the structured text from the XML file
which consists of the structured news headlines. Then the dataset is fetched to
the document pre-processing model as described in Section 5.2.1.
6.3
Approach
For the experiment, the K-means algorithm is applied on the dataset. The
clusters are created based on the minimum distance between the documents
and the cluster centroids. For each dataset of n clustering examples, n
experiments are run, where each clustering is taken in turn as the single
example test set with the n1 remaining clustering as the training set.
The experiment uses the following logical steps:
1) Read n datasets from the XML file.
2) Apply document preprocessing techniques as discussed in Section 5.2.1 to
every dataset.
3) Create the n-dimensional document vector for the dataset.
4) Pass the document vector to the K-means clustering algorithm.
27
5) Observe the clusters obtained and measure the performance of the model
based the cohesive measures of the clusters.
6) Carry out the experiment for different dataset containing news headlines of
another day.
28
7 IMPLEMENTATION
Implementation is the process of executing a plan or design to achieve some
output. In this thesis, the implementation encompasses the extraction of news
headlines in the XML document, fetching them in the system to go through the
document pre-processing techniques, forwarding the pre-processed documents
to the clustering system and obtaining clusters of similar news headlines as a
final output.
7.1
Implementation model
29
7.2
Implementation phases
30
Start
Input Corpus of
news from xml
document
Preprocessed
document
31
32
D1
D2
D3
D4
33
34
35
36
In the above matrix representation (Figure 12), value 1 in the row indicates that
the document belongs to the corresponding cluster and value 0 represents that
the document is not a member of the cluster.
37
Thus, Cluster 1 has D1 as its member and Cluster2 has D2, D3 and D4 as its
members.
Step 3:In this step, Iteration 1 of the algorithm runs, where the new centroid of each
cluster based on the new membership is calculated.
Cluster 1 has only one member which is D1. So, the centroid remains the same
(C1). Cluster 2 has three members, thus the new centroid is the average
coordinates of the three members (D2, D3 and D4).
C2= ((2+4+5)/3, (1+3+4)/3) = (11/3, 8/3)
38
After the new centroids are computed, the distances of all the documents to
them are computed as in Step 2.
39
40
41
Documents
Term1
Term2
Result(Cluster)
D1
D2
D3
D4
42
9 CONCLUSION
The first goal of the current study was to use text mining techniques on web
news articles headlines. The project study involved the great deal of work on
various areas of information retrieval and text mining and focused on the
various methods for document pre-processing and document clustering.
Text mining and clustering techniques are really powerful. This project study
was completely based on these techniques. The system was created for finding
the similarities among the news articles headlines. Various techniques were
applied for preparing the corpus of a pre-processed document. Lastly, the kmeans clustering algorithm was used for creating the clusters of similar news
articles headlines.
The similar news headlines were grouped into a single cluster. The real world
application of the project study would help people to find the similar news
headlines on different news portal from a single platform. This would not have
been possible without the use of text mining and clustering techniques. In
general, it is not feasible to manually look for similar news in each of the portals
and then compare each of them to find similarities between them.
43
44
REFERENCES
Ali Shah, N. & M. ElBahesh, E. Topic-Based Clustering of News Articles, University of
Alabama at Birmingham. Retrieved September 23, 2013 from:
http://pdf.aminer.org/000/006/766/topic_based_clustering_of_news_articles.pdf
Bouras, C. & Tsogkas, V. 2010. "W-kmeans: Clustering News Articles Using WordNet".
Retrieved September 5, 2013 from:
http://ru6.cti.gr/ru6/publications/116262780379.pdf
Buscaldi, D., Rosso, P. & Arnal Sanchis, E. 2005. A WordNet-based Query Expansion method
for Geographical Information Retrieval. Retrieved September 2, 2013 from:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.99.8031&rep=rep1&type=pdf
D. Manning, C., Raghavan, P. & Schtze, H. 2008. Introduction to Information Retrieval,
Cambridge, England: Cambridge University Press. Retrieved September 4, 2013 from:
http://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf
Falinouss, P. 2007. Stock Trend Prediction Using News Articles. Masters Thesis, Lulea
University of Technology, Sweden. Retrieved September 20, 2013 from:
http://epubl.ltu.se/1653-0187/2007/071/LTU-PB-EX-07071-SE.pdf
Fung, G. 2001. A Comprehensive Overview of Basic Clustering Algorithms. Retrieved
September 6, 2013 from:
http://pages.cs.wisc.edu/~gfung/clustering.pdf
Huang, C., Simon, P., Hsieh, S. & Prevot, L. 2007. Rethinking Chinese Word Segmentation:
Tokenization, Character Classification, or Word break Identification. Retrieved September 2,
2013 from:
http://delivery.acm.org/10.1145/1560000/1557791/p69huang.pdf?ip=86.50.66.20&id=1557791&acc=OPEN&key=BF13D071DEA4D3F3B0AA4BA89B
4BCA5B&CFID=389226789&CFTOKEN=39573449&__acm__=1387138322_f651d65355dcc03
5ef1e98e656194624
Jaiswal, P. "Clustering Blog Information" 2007. Master's Thesis Projects, Paper 36, San Jose
State University. Retrieved September 18, 2013 from:
http://scholarworks.sjsu.edu/cgi/viewcontent.cgi?article=1035&context=etd_projects
Kobayashi, M. & Takeda, K. 2000. "Information retrieval on the web". Retrieved September 20,
2013 from:
https://www.ischool.utexas.edu/~i385d/readings/Kobayashi.pdf
Kothari C.R. 2004. Research Methodology: Methods and Techniques. Retrieved October 8,
2013 from:
http://books.google.com/books?id=8c6gkbKiF4C&printsec=frontcover&source=gbs_atb&redir_esc=y#v=onepage&q&f=false
Plisson J., Lavrac N. & Mladenic D. 2004. A Rule based Approach to Word Lemmatization.
Retrieved September 25, 2013 from:
45
http://eprints.pascal-network.org/archive/00000715/01/Pillson-Lematization.pdf
Maheshwari, P. & Agrawal, J. 2010. "Centroid Based Text Clustering". Retrieved October 3,
2013 from:
http://www.ijest.info/docs/IJEST10-02-09-36.pdf
Qiujun, L. 2010. "Extraction of News Content for Text Mining Based on Edit Distance".
Retrieved October 3, 2013 from:
http://www.jofcis.com/publishedpapers/2010_6_11_3761_3777.pdf
Murali Krishna, S. & Durga Bhavani, S. 2010. "An Efficient Approach for Text Clustering Based
on Frequent Itemsets. Retrieved October 2, 2013 from:
http://www.textmining.xpg.com.br/ejsr_42_3_04.pdf
Vectomova, O. & Wang, Y. 2006. "A study of the effect of term proximity on query
expansion" (Abstract). Retrieved October 2, 2013 from:
http://jis.sagepub.com/content/32/4/324.abstract
McCarthy, D. & Navigli, R. 2009. English Lexical Substitution. Retrieved October 4, 2013 from:
http://www.dianamccarthy.co.uk/task10index.html
46
Appendices
Appendix 1.0
Sample code for overall clustering and text mining (NewsClustering.cs).
class NewsClustering
{
//create the list of objects of class headInfo to hold heading detail
public static List<NewsLibrary.NewsInfo> NewsList = new
List<NewsLibrary.NewsInfo>();
public static List<List<NewsLibrary.NewsInfo>> clusterList = new
List<List<NewsLibrary.NewsInfo>>();
static void Main()
{
//pass xml file of news and obtain list of object of news
NewsLibrary.PopulateNews news = new NewsLibrary.PopulateNews();
NewsList = news.populate(@"..\\..\\..\\title.xml");
//pass list of news objects to tokenization and obtain token of headline
NewsLibrary.Tokenization tokenization = new
NewsLibrary.Tokenization();
NewsList = tokenization.Tokenize(NewsList);
//pass list of news objects to stopword and obtain stopword free
headline
NewsLibrary.StopWord stopword = new NewsLibrary.StopWord();
NewsList = stopword.remove(NewsList);
//pass list of news objects to lemmatization and lemmmatize tokens in
headline
NewsLibrary.Lemmatization lematization = new
NewsLibrary.Lemmatization();
NewsList = lematization.Lematize(NewsList);
//past list of news objects to synononym expansion
47
Appendix 1