Automatic Text Summarization by Extracti
Automatic Text Summarization by Extracti
Aditi Sharan
Jawaharlal Nehru University, New Delhi, India
Namita Jain
Madurai Kamraj University, India
Hazra Imran
Jamia Hamdard, New Delhi, India
Abstract
The coming of the WWW and of the Internet as a telecommunications network has
changed the concept of what is considered information: the person who is best informed
is not the one with the most information but the one with the best means for obtaining
and assimilating (consuming) exactly the information acquired. This situation is proving
to be a great stimulus for research into and the development of applications in the field of
technology for recovering and extracting information. In this context, automated
document summary systems are a new step forward towards optimizing the treatment of
documentation in digital formats and for tailoring it to the needs of users. The objective
of this paper is to provide the status of work done in the field of automatic text
summarization and develop an algorithm to generate automatic text summary. Further
experiments are performed based on proposed algorithm.
1. Introduction
With the coming of the information revolution electronic documents are becoming the
principle media of business and academic information. Thousands and thousands of
electronic documents are produced and made available on the internet each day. In order
to fully utilize these online documents effectively, it is crucial to be able to extract the
gist of these documents. In most cases, summaries are written by human. But nowadays,
enormous amount of information is available on-line and it’s impossible for any human
to summarize all the available textual information. The need to access the essential
content of documents accurately to satisfy users demand, calls for the development of
computer programs which are able to produce text summaries.
In Proceedings of the 6th Annual CISTM Conference, July 31 -August 2 , 2008, IIT Delhi , India
www.cistm.org
Pg 30-2
Research and development in the area of automatic text summarization has been growing
in importance with the rapid growth of the Web and on line information services.
Summarization is the art of abstracting key content from one or more information
sources. People keep abreast of world affairs by listening to news bites. They even go to
movies largely on the basis of reviews. With summaries, they can make effective
decisions in less time.
Automatic summarization is the creation of a shortened version of a text by a computer
program. As access to data has increased so has interest in automatic summarization.
heuristics is used to determine the most important sentences within the document. Each
heuristic assigns a (positive or negative) score to the sentence. After all heuristics have
been applied, the x highest-scoring sentences are included in the summary. The
individual heuristics are weighted according to their importance.
In this article Section II outlines some techniques in automatic text summarization and
Section III presents the algorithm developed for automatic summarization by extracting
significant sentences through Luhn’s Keyword Cluster Method followed by Section IV
which discusses the Experiment and Results of the proposed method. Section V is the
Conclusion.
Techniques for Automatic text Summarization are usually classified in three families:
(i) based on the surface (no linguistic analysis is performed); (ii) based on entities named
in the text (there is some kind of lexical acknowledgement and classification); and (iii)
based on discourse structure (some kind of structural, usually linguistic, processing of the
document is required).
Commercial products usually make use of surface techniques. One classical method is
selection of statistically frequent terms in the document. E.g. those sentences containing
more of the most frequent terms (strings) will be selected as a summary of the document.
Another group of methods is based on position: position in the text, in the paragraph, in
depth or embedding of the section, etc. Other methods gain profit from outstanding parts
of the text: titles, subtitles, and leads... It is supposed that sentences containing words of
the title are better candidates to summarize the whole document.
Methods based on entities are grounded on techniques allowing us to acknowledge
linguistic units among the mass of alphanumeric symbol strings. To do this, lemmatizers,
morphological parsers and part—of—speech taggers are needed since, on the one hand,
one same string might belong to different parts of speech (e.g. ‘bomb’, noun or verb),
and, on the other, different strings might be instances of the same part of speech (e.g.
‘bomb’ and ‘bombed’).
Finally, simple methods based on structure can for instance take advantage of the hyper
textual scaffolding of an HTML page. More complex methods using linguistic
technology resources and techniques such as those mentioned above and others might
build a rhetoric structure of the document, allowing its most relevant fragments to be
detected. It is clear that when creating a text using fragments of a previous original,
reference chains and, in general, text cohesion, is easily lost.
Based on these techniques, now we discuss some important work done in this area.
sentence extraction module identifies the most important sentences in the input
document. The input of the extraction module is the input document, and the output is a
list of key sentences that have been selected by the extraction module.
The sentence reduction module removes nonessential phrases from an extracted key
sentence, resulting in a shortened version of the extracted sentence. The reduction
program uses multiple sources of knowledge to decide which phrases in an extracted
sentence should be removed, including lexical information, syntactic information from
linguistic databases, and statistical information from corpus.
The sentence combination module merges the resulting sentences from sentence
reduction with other phrases or reduced sentences together as coherent sentences. The
rule-based combination module can apply different combination operations based on the
phrases and sentences to be combined.
The training corpora for sentence reduction and sentence combination are constructed by
an automatic summary sentence decomposition program.
The Lexical Chains for Text Summarization suggested by Barzilay (1997), Lei (2007)
and Chan (2005) describes lexical chain identification in a given text by extracting
sentences to form summaries. A lexical chain is a set of semantically related words in a
text. The relations between words are stored in WorldNet database. A procedure for
constructing lexical chains follows three steps:
• Select a set of candidate words.
• For each candidate word, find an appropriate chain.
• If it is found, insert the word in the chain and update it.
In the preprocessing step all the words that appear as a noun entry in WorldNet are
chosen. Three kinds of relations are defined: (i) extra-strong (between a word and its
repetition), (ii) strong (between two words connected by a WorldNet relation) and (iii)
medium-strong when the link between the sunsets of the words is longer than one.
To obtain the summary of any given text it is necessary to identify the strongest chains
among all those produced by the algorithm. Once strongest chains are selected, the next
In Proceedings of the 6th Annual CISTM Conference, July 31 -August 2 , 2008, IIT Delhi , India
www.cistm.org
Pg 30-5
step of the summarization algorithm is to extract full sentences from the original text. A
chain has high concentration if its concentration is the maximum of all chains.
Trainable Summarizer by Kupiec (1995) and Jen-Yuan (2005) extracts the sentences
based on the following discrete feature set for the given text: Sentence Length Cut-off
Feature, Fixed-Phrase Feature, Paragraph Feature, Thematic Word feature, Uppercase
Word Feature.
For each sentence s, the probability of s being included in the summary is calculated
based on the k given features Fj: j=1….k which can be expressed using the Bayers rule as
follows:
P (s Є S) is a constant and P (Fj | s Є S) and P (Fj) can be estimated directly from the
training set by counting occurrences. This yields a simple Bayesian classification
In Proceedings of the 6th Annual CISTM Conference, July 31 -August 2 , 2008, IIT Delhi , India
www.cistm.org
Pg 30-6
function that assigns for each sentence s a score which can be used to select sentences for
inclusion in the summary.
3. Proposed Method
In this section we prepare a method for generating extractive summary. The method is
based on Luhn’s Keyword Cluster technique.
In our proposed method the sentences are selected by identifying clusters of significantly
related words. The proposed work is divided into two parts:
a. To develop an algorithm for generating summary by selecting significant
sentences.
b. To implement Luhn’s Keyword Cluster method.
The algorithm for automatic text summarization by extracting significant sentences is as
follows:
Step 1: Preprocess the text
Step 2: Remove Stopwords
Step 3: Perform Stemming
Step 4: Represent text in vector-space model.
Step 5: Identify the significant words.
a. Sort all the words in descending order of frequency in VSM.
b. Calculate threshold frequency ‘ms’ for selecting significant words.
c. Select significant words whose frequency > threshold ms.
Step 6: Calculate the Scores SS1, SS2, SS3 by Luhn’s Keyword Cluster Method.
Step 7: Add all the scores to give the Sentence Significance Score SSS.
Step 8: Extract sentences to generate a summary.]
Words which are too frequent among the documents in the collection are not good
discriminators. A word which occurs in 80% of the documents in the collection is useless
for purpose of retrieval; such words are referred as stopwords and are normally filtered
out as potential index terms. Articles, prepositions and conjunction are natural candidates
for a list of stopwords.
The size of the text is reduced by 40% or more with their elimination. A fixed
stopword list is constructed, and all words in the document matching these stopwords are
In Proceedings of the 6th Annual CISTM Conference, July 31 -August 2 , 2008, IIT Delhi , India
www.cistm.org
Pg 30-7
eliminated.
3.3 Stemming
A stem is the portion of a word which is left after the removal of its affixes e.g. word
“connect” is the stem for variants connected, connecting, connection, connections etc.
Stemming improves the retrieval performance because they reduce variants of the same
root word to a common concept. It also reduces the size of the indexing structure because
the number of distinct index terms is reduced. Porter’s Stemming Algorithm is the most
popular Stemming method.
The Vector Space model suggested by Salton (1975, 1983) is a very widely used data
representation model for classification and clustering of text documents today. The term-
weights of the words determine the weight of word in the vector. We have used term
frequency to assign weights.
After creating a VSM for term frequencies the terms are sorted in decreasing order.
Thus the most significant terms are at the top.
The lower limit for significance needs to be defined. Following the work of Trombos
(1997), the required minimum occurrence count for significant terms in a medium-sized
document was taken to be 7; where a medium sized document is defined as containing no
more than 40 sentences and not less than 25 sentences. For documents outside this range,
the limit for significance is computed as
where ms= the measure of significance i.e. the threshold for selecting significant
words.
L= Limit (25 for NS<25 and 40 for NS>40)
NS= number of sentences in the document.
Luhn (1958), Adenike (2001) reasoned that, the closer certain words are, the more
specifically an aspect of the subject is being treated. Two significant words are
In Proceedings of the 6th Annual CISTM Conference, July 31 -August 2 , 2008, IIT Delhi , India
www.cistm.org
Pg 30-8
considered significantly related if they are separated by not more than five non-
significant words e.g.
‘The sentence [scoring process utilizes information both from the structural]
organization.’
The cluster significance score factor for a sentence is given by the following formula:
SS1 = SW 2 (1)
TW
where SS1 = the sentence score
SW = the number of bracketed significant words
TW= the total number of bracketed words.
If two or more clusters of significant words appear in a given sentence, the one with the
highest score is chosen as the sentence score.
The title of an article often reveals the major subject of that document. This hypothesis
was examined in TREC documents where the title of each article was found to convey
the general idea of its contents. We exactly follow the Adesina and Jones’s approach
(2001). In order to utilize this attribute in scoring sentences, each constituent term in the
title section is looked up in the body of the text. For each sentence a title score is
computed as follows:
Edmundson (1969) noted that the position of a sentence within a document is useful in
determining its importance to the document. The first sentences of a document often
provide important information about the content of the document. Thus the first two
sentences of an article are assigned a location score computed as:
SS3 = 1 (3)
NS
where SS3= the location score for a sentence
NS = the number of sentences found in the document
In Proceedings of the 6th Annual CISTM Conference, July 31 -August 2 , 2008, IIT Delhi , India
www.cistm.org
Pg 30-9
The final score for each sentence is calculated by summing the individual score factors
obtained for each method used. Thus the final score for each sentence is calculated as:
SSS=SS1+SS2+SS3 (4)
The objective of the summary generation system is to generate a stand alone summary
that can be used to replace the entire documents. The lower limit of the summary length
was set at 25% of the original length. Accordingly the threshold for extracting the
significant sentences is calculated as
where TSS = number of significant sentences. TSS value may vary depending upon the
percentage. TSS sentences having maximum SSS values are extracted.
Based on above algorithm we now present our experiment and results.
Based on the algorithm proposed in Section 3, we now present the results for our
experiments. Table I is the Input File for which we are generating summary.
In Proceedings of the 6th Annual CISTM Conference, July 31 -August 2 , 2008, IIT Delhi , India
www.cistm.org
Pg 30-10
Table I
The contents of the input file
In Proceedings of the 6th Annual CISTM Conference, July 31 -August 2 , 2008, IIT Delhi , India
www.cistm.org
Pg 30-11
Table II
Vector Space Model
According to Section 3.6 lower limit for significance is computed with L = 25, NS = 21
ms = 7+[ 0.1(L – NS )] = 7+[0.1(25-21)] = 7.4
The significant words having frequency greater than the threshold frequency (lower limit
of significance) i.e. 7.4 are ‘text’ and ‘summary’.
Table III shows the Scores SS1, SS2, SS3 and SSS for each sentence as explained in
Section 3.6
In Proceedings of the 6th Annual CISTM Conference, July 31 -August 2 , 2008, IIT Delhi , India
www.cistm.org
Pg 30-12
Table III
The scores SS1, SS2 , SS3 and SSS for each sentence
key words in the text, which sentences they are present in, 0
and where these sentences are in the text.
It considers if the [text is tagged with bold text] tag, first 0.6 0.0
0 0.67
paragraph tag or numerical values. 7 0
All this information is compiled and used to [summarize 0.0
1 0 1.00
the original text] 0
SweSum is also available for Danish, Norwegian, English,
0.0
Spanish, French,Italian, Greek, Farsi (Persian) and German 0 0 0.00
0
texts.
TABLE IV
Summary of input text generated by Extracting Significant Sentences
TABLE V
Summary generated by the Word tool AutoSummarize
The summary generated by AutoSummarize has almost the same sentences extracted as
the summary generated by Luhn’s Keyword Cluster Method of extracting significant
sentences. It proves that automatic extracts can be defined, programmed and produced in
an operational system to supplement and perhaps compete with traditional ones.
5. Conclusion
Further the authors plan to extend the work by applying machine learning technique to
generate automatic summaries.
References
In Proceedings of the 6th Annual CISTM Conference, July 31 -August 2 , 2008, IIT Delhi , India
www.cistm.org
Pg 30-15
In Proceedings of the 6th Annual CISTM Conference, July 31 -August 2 , 2008, IIT Delhi , India
www.cistm.org