0% found this document useful (0 votes)
21 views15 pages

Automatic Text Summarization by Extracti

summarization

Uploaded by

yeshi telay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views15 pages

Automatic Text Summarization by Extracti

summarization

Uploaded by

yeshi telay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Automatic Text Summarization:

by Extracting Significant Sentences

Aditi Sharan
Jawaharlal Nehru University, New Delhi, India
Namita Jain
Madurai Kamraj University, India
Hazra Imran
Jamia Hamdard, New Delhi, India

Abstract

The coming of the WWW and of the Internet as a telecommunications network has
changed the concept of what is considered information: the person who is best informed
is not the one with the most information but the one with the best means for obtaining
and assimilating (consuming) exactly the information acquired. This situation is proving
to be a great stimulus for research into and the development of applications in the field of
technology for recovering and extracting information. In this context, automated
document summary systems are a new step forward towards optimizing the treatment of
documentation in digital formats and for tailoring it to the needs of users. The objective
of this paper is to provide the status of work done in the field of automatic text
summarization and develop an algorithm to generate automatic text summary. Further
experiments are performed based on proposed algorithm.

Keywords: Automatic Text Summarization, Significant Sentences

1. Introduction

With the coming of the information revolution electronic documents are becoming the
principle media of business and academic information. Thousands and thousands of
electronic documents are produced and made available on the internet each day. In order
to fully utilize these online documents effectively, it is crucial to be able to extract the
gist of these documents. In most cases, summaries are written by human. But nowadays,
enormous amount of information is available on-line and it’s impossible for any human
to summarize all the available textual information. The need to access the essential
content of documents accurately to satisfy users demand, calls for the development of
computer programs which are able to produce text summaries.

In Proceedings of the 6th Annual CISTM Conference, July 31 -August 2 , 2008, IIT Delhi , India
www.cistm.org
Pg 30-2

Research and development in the area of automatic text summarization has been growing
in importance with the rapid growth of the Web and on line information services.
Summarization is the art of abstracting key content from one or more information
sources. People keep abreast of world affairs by listening to news bites. They even go to
movies largely on the basis of reviews. With summaries, they can make effective
decisions in less time.
Automatic summarization is the creation of a shortened version of a text by a computer
program. As access to data has increased so has interest in automatic summarization.

1.1 Types of Summaries

Extracted Summary: On the least-complex end is summarization through text


extraction, the creation of summaries using terms, phrases and sentences pulled directly
from the source text using statistical analysis a surface level. Sparck (1971) views text
extraction as “what you see is what you get,” because parts of source text are extracted
directly.

Abstracted Summary: This process is sometimes known as machine understanding, a


multidisciplinary endeavor involving information retrieval, linguistics and artificial
intelligence. In simplest terms, automatic abstracting is fact extraction.

1.2 Various Approaches to Generate Summaries by Extraction

1. Domain Dependent Approaches: Several domain dependent approaches to


summarization use Information Extraction techniques in order to identify the most
important information within the document. Work in this area also includes
techniques for Report Generation and Event Summarization from specialized
databases.

2. Domain Independent Approaches: Several domain-independent approaches


often use statistical techniques in combination with shallow language technologies to
extract salient document fragments. The statistical techniques used are similar to
those employed in Information Retrieval and include vector space models, term
frequency and inverted document frequency.

1.3 Significant Sentence Extraction

Most of the automatic summarization techniques are based on extracting significant


sentences by some means. Sentence extraction is a technique used for automatic
summarization. In this approach, statistical heuristics are used to identify the most salient
sentences of a text. Sentence extraction is a low-cost approach compared to more
knowledge-intensive deeper approaches which require additional knowledge bases such
as ontology or linguistic knowledge. In short, "sentence extraction" works as a filter
which allows only important sentences to pass.
Sentence extraction summaries can give valuable clues to the main points of a document
and are frequently sufficiently intelligible to human readers. Usually, a combination of
In Proceedings of the 6th Annual CISTM Conference, July 31 -August 2 , 2008, IIT Delhi , India
www.cistm.org
Pg 30-3

heuristics is used to determine the most important sentences within the document. Each
heuristic assigns a (positive or negative) score to the sentence. After all heuristics have
been applied, the x highest-scoring sentences are included in the summary. The
individual heuristics are weighted according to their importance.
In this article Section II outlines some techniques in automatic text summarization and
Section III presents the algorithm developed for automatic summarization by extracting
significant sentences through Luhn’s Keyword Cluster Method followed by Section IV
which discusses the Experiment and Results of the proposed method. Section V is the
Conclusion.

2. Techniques for Automatic Text Summarization

Techniques for Automatic text Summarization are usually classified in three families:
(i) based on the surface (no linguistic analysis is performed); (ii) based on entities named
in the text (there is some kind of lexical acknowledgement and classification); and (iii)
based on discourse structure (some kind of structural, usually linguistic, processing of the
document is required).
Commercial products usually make use of surface techniques. One classical method is
selection of statistically frequent terms in the document. E.g. those sentences containing
more of the most frequent terms (strings) will be selected as a summary of the document.
Another group of methods is based on position: position in the text, in the paragraph, in
depth or embedding of the section, etc. Other methods gain profit from outstanding parts
of the text: titles, subtitles, and leads... It is supposed that sentences containing words of
the title are better candidates to summarize the whole document.
Methods based on entities are grounded on techniques allowing us to acknowledge
linguistic units among the mass of alphanumeric symbol strings. To do this, lemmatizers,
morphological parsers and part—of—speech taggers are needed since, on the one hand,
one same string might belong to different parts of speech (e.g. ‘bomb’, noun or verb),
and, on the other, different strings might be instances of the same part of speech (e.g.
‘bomb’ and ‘bombed’).
Finally, simple methods based on structure can for instance take advantage of the hyper
textual scaffolding of an HTML page. More complex methods using linguistic
technology resources and techniques such as those mentioned above and others might
build a rhetoric structure of the document, allowing its most relevant fragments to be
detected. It is clear that when creating a text using fragments of a previous original,
reference chains and, in general, text cohesion, is easily lost.
Based on these techniques, now we discuss some important work done in this area.

2.1. Cut and Paste Method

The Cut and Paste Method by Jing (2000) is domain-independent, single-document


summarization. Two revision operations: sentence reduction and sentence combination
are simulated.
The first stage, extraction, identifies the most important sentences in the input document.
The cut-and-paste generation component edits the extracted sentences, particularly by
simulating two revision operations: sentence reduction and sentence combination. The
In Proceedings of the 6th Annual CISTM Conference, July 31 -August 2 , 2008, IIT Delhi , India
www.cistm.org
Pg 30-4

sentence extraction module identifies the most important sentences in the input
document. The input of the extraction module is the input document, and the output is a
list of key sentences that have been selected by the extraction module.
The sentence reduction module removes nonessential phrases from an extracted key
sentence, resulting in a shortened version of the extracted sentence. The reduction
program uses multiple sources of knowledge to decide which phrases in an extracted
sentence should be removed, including lexical information, syntactic information from
linguistic databases, and statistical information from corpus.
The sentence combination module merges the resulting sentences from sentence
reduction with other phrases or reduced sentences together as coherent sentences. The
rule-based combination module can apply different combination operations based on the
phrases and sentences to be combined.
The training corpora for sentence reduction and sentence combination are constructed by
an automatic summary sentence decomposition program.

2.2. Extracting Sentence Segments for Text Summarization

The system proposed by Chuang (2000) extracts sentence segments to generate a


summary. The working of the system in brief is as follows:
• First, the sentences are broken into segments by special cue-markers.
• An algorithm is used to train the summarizer to extract important sentences.
• A sentence is segmented by a cue-phrase.
The sentence segments are represented by structured set of features like rhetorical
relations. In a complex sentence, with two clauses, the main segment is called the
nucleus, and its subordinate segment is called satellite and is connected to the main
segment by some kind of rhetorical relation. A rhetorical relation r (name, satellite,
nucleus) shows that there exists a relation of type name between the satellite and the
nucleus segments.

2.3 Lexical Chains for Text Summarization

The Lexical Chains for Text Summarization suggested by Barzilay (1997), Lei (2007)
and Chan (2005) describes lexical chain identification in a given text by extracting
sentences to form summaries. A lexical chain is a set of semantically related words in a
text. The relations between words are stored in WorldNet database. A procedure for
constructing lexical chains follows three steps:
• Select a set of candidate words.
• For each candidate word, find an appropriate chain.
• If it is found, insert the word in the chain and update it.
In the preprocessing step all the words that appear as a noun entry in WorldNet are
chosen. Three kinds of relations are defined: (i) extra-strong (between a word and its
repetition), (ii) strong (between two words connected by a WorldNet relation) and (iii)
medium-strong when the link between the sunsets of the words is longer than one.
To obtain the summary of any given text it is necessary to identify the strongest chains
among all those produced by the algorithm. Once strongest chains are selected, the next

In Proceedings of the 6th Annual CISTM Conference, July 31 -August 2 , 2008, IIT Delhi , India
www.cistm.org
Pg 30-5

step of the summarization algorithm is to extract full sentences from the original text. A
chain has high concentration if its concentration is the maximum of all chains.

2.4. The Pyramid Method

This new approach, developed by Nenkova (2004, 2007) is based on Summarization


Content Units (SCUS):
• as different people include different information when making a summary, SCUs
annotation highlights what people agree on
• SCUs are sub-sentential content units, not bigger than a clause, taken from a corpus
of manually-made summaries
• an SCU consists of a label (a concise sentence that states the meaning of the CU).
• Given a certain number of human summaries, semantically equivalent snippets are
identified and grouped under a common label, defining an SCU.
• Each SCU has a weight corresponding to the number of summaries in which it
appears.
• Each partition of the pyramid contains all and only the SCUs of the same weight.
• A pyramid has as many tiers as the number of the model summaries.
• If we use 4 model summaries, the pyramid will have 4 tiers
• SCUs weighted 4 will be at the top, as fewer SCUs are expressed in all 4 summaries.
Scoring a Summary: Eg. A pyramid of order n is a pyramid with n tiers. The score is a
ratio of the sum of the weights of its SCUs to the sum of the weights of an optimal
summary with the same number of SCUs.

2.5. Trainable Summarizer

Trainable Summarizer by Kupiec (1995) and Jen-Yuan (2005) extracts the sentences
based on the following discrete feature set for the given text: Sentence Length Cut-off
Feature, Fixed-Phrase Feature, Paragraph Feature, Thematic Word feature, Uppercase
Word Feature.
For each sentence s, the probability of s being included in the summary is calculated
based on the k given features Fj: j=1….k which can be expressed using the Bayers rule as
follows:

P(s Є S | F1, F2….Fk) = P (F1, F2 ….Fk) | s Є S) P (s Є S)


P (F1, F2 ….Fk)

Assuming statistical independence of the features

P(s Є S | F1, F2….Fk) = Пk j=1 P (Fj | s Є S) P (s Є S)


Пk j=1 P (Fj)

P (s Є S) is a constant and P (Fj | s Є S) and P (Fj) can be estimated directly from the
training set by counting occurrences. This yields a simple Bayesian classification

In Proceedings of the 6th Annual CISTM Conference, July 31 -August 2 , 2008, IIT Delhi , India
www.cistm.org
Pg 30-6

function that assigns for each sentence s a score which can be used to select sentences for
inclusion in the summary.

3. Proposed Method

In this section we prepare a method for generating extractive summary. The method is
based on Luhn’s Keyword Cluster technique.
In our proposed method the sentences are selected by identifying clusters of significantly
related words. The proposed work is divided into two parts:
a. To develop an algorithm for generating summary by selecting significant
sentences.
b. To implement Luhn’s Keyword Cluster method.
The algorithm for automatic text summarization by extracting significant sentences is as
follows:
Step 1: Preprocess the text
Step 2: Remove Stopwords
Step 3: Perform Stemming
Step 4: Represent text in vector-space model.
Step 5: Identify the significant words.
a. Sort all the words in descending order of frequency in VSM.
b. Calculate threshold frequency ‘ms’ for selecting significant words.
c. Select significant words whose frequency > threshold ms.
Step 6: Calculate the Scores SS1, SS2, SS3 by Luhn’s Keyword Cluster Method.
Step 7: Add all the scores to give the Sentence Significance Score SSS.
Step 8: Extract sentences to generate a summary.]

3.1 Text Preprocessing

Text preprocessing is a procedure which can be done by the following operations:


Lexical analysis of the text: Its major objective is the identification of the words in the
text. Spaces are recognized as word separators. Multiple spaces are reduced to single
spaces. Numbers are usually not good index terms, because without a surrounding
context, they are inherently vague, hence they are discarded. Hyphens also pose difficult
decision to the lexical analyzer. Breaking up hyphenated words might be useful due to
inconsistency of usage. Punctuation marks are removed entirely in the process of lexical
analysis. Removing them does not seem to have any impact in retrieval performance.

3.2 Removal of Stopwords

Words which are too frequent among the documents in the collection are not good
discriminators. A word which occurs in 80% of the documents in the collection is useless
for purpose of retrieval; such words are referred as stopwords and are normally filtered
out as potential index terms. Articles, prepositions and conjunction are natural candidates
for a list of stopwords.
The size of the text is reduced by 40% or more with their elimination. A fixed
stopword list is constructed, and all words in the document matching these stopwords are
In Proceedings of the 6th Annual CISTM Conference, July 31 -August 2 , 2008, IIT Delhi , India
www.cistm.org
Pg 30-7

eliminated.

3.3 Stemming

A stem is the portion of a word which is left after the removal of its affixes e.g. word
“connect” is the stem for variants connected, connecting, connection, connections etc.
Stemming improves the retrieval performance because they reduce variants of the same
root word to a common concept. It also reduces the size of the indexing structure because
the number of distinct index terms is reduced. Porter’s Stemming Algorithm is the most
popular Stemming method.

3.4 Representation of text in vector-space model

The Vector Space model suggested by Salton (1975, 1983) is a very widely used data
representation model for classification and clustering of text documents today. The term-
weights of the words determine the weight of word in the vector. We have used term
frequency to assign weights.

3.5 Identify the significant words

After creating a VSM for term frequencies the terms are sorted in decreasing order.
Thus the most significant terms are at the top.
The lower limit for significance needs to be defined. Following the work of Trombos
(1997), the required minimum occurrence count for significant terms in a medium-sized
document was taken to be 7; where a medium sized document is defined as containing no
more than 40 sentences and not less than 25 sentences. For documents outside this range,
the limit for significance is computed as

ms=7+[0.1(L - NS) for documents with NS<25

ms=7+ 0.1(NS - L) for documents with NS>40

where ms= the measure of significance i.e. the threshold for selecting significant
words.
L= Limit (25 for NS<25 and 40 for NS>40)
NS= number of sentences in the document.

A word is classified as “significant” if it is not a stopword and its within-document


term frequency is larger than the threshold ms.

3.6 Sentence Significance Score by Luhn’s Keyword Cluster Method

Luhn (1958), Adenike (2001) reasoned that, the closer certain words are, the more
specifically an aspect of the subject is being treated. Two significant words are
In Proceedings of the 6th Annual CISTM Conference, July 31 -August 2 , 2008, IIT Delhi , India
www.cistm.org
Pg 30-8

considered significantly related if they are separated by not more than five non-
significant words e.g.
‘The sentence [scoring process utilizes information both from the structural]
organization.’

3.6.1 Cluster significance method

The cluster significance score factor for a sentence is given by the following formula:

SS1 = SW 2 (1)
TW
where SS1 = the sentence score
SW = the number of bracketed significant words
TW= the total number of bracketed words.
If two or more clusters of significant words appear in a given sentence, the one with the
highest score is chosen as the sentence score.

3.6.2 Title term frequency method

The title of an article often reveals the major subject of that document. This hypothesis
was examined in TREC documents where the title of each article was found to convey
the general idea of its contents. We exactly follow the Adesina and Jones’s approach
(2001). In order to utilize this attribute in scoring sentences, each constituent term in the
title section is looked up in the body of the text. For each sentence a title score is
computed as follows:

SS2 = TTS (2)


TTT
where SS2 = the title score for a sentence
TTS = the total number of title terms found in a sentence
TTT= the total number of terms in a title
TTT is used as the normalization factor to ensure that this method does not have a
excessive sentence score factor contribution relative to the overall sentence score.

3.6.3 Location / Header Method

Edmundson (1969) noted that the position of a sentence within a document is useful in
determining its importance to the document. The first sentences of a document often
provide important information about the content of the document. Thus the first two
sentences of an article are assigned a location score computed as:

SS3 = 1 (3)
NS
where SS3= the location score for a sentence
NS = the number of sentences found in the document

In Proceedings of the 6th Annual CISTM Conference, July 31 -August 2 , 2008, IIT Delhi , India
www.cistm.org
Pg 30-9

3.6.4 Combining the Scores

The final score for each sentence is calculated by summing the individual score factors
obtained for each method used. Thus the final score for each sentence is calculated as:

SSS=SS1+SS2+SS3 (4)

where SSS= Sentence Significance Score


The summarization system was implemented such that each method could be invoked
independently. Thus it was possible to experiment with various combinations of the
method to determine the best summarization method.

3.6 Extraction of sentences to generate a summary

The objective of the summary generation system is to generate a stand alone summary
that can be used to replace the entire documents. The lower limit of the summary length
was set at 25% of the original length. Accordingly the threshold for extracting the
significant sentences is calculated as

TSS = NS*0.25 (5)

where TSS = number of significant sentences. TSS value may vary depending upon the
percentage. TSS sentences having maximum SSS values are extracted.
Based on above algorithm we now present our experiment and results.

4. Experiment and results

Based on the algorithm proposed in Section 3, we now present the results for our
experiments. Table I is the Input File for which we are generating summary.

In Proceedings of the 6th Annual CISTM Conference, July 31 -August 2 , 2008, IIT Delhi , India
www.cistm.org
Pg 30-10

Table I
The contents of the input file

Automatic text summarization is the technique, where a computer summarizes a text.


A text is entered into the computer and a summarized text is returned, which is a non
redundant extract from the original text. The technique has its roots in the 60s and
has been developed during 30 years, but today with the Internet and the WWW the
technique has become more important.
Automatic text summarization can be used:
To summarize news to SMS or WAP-format for mobile phones/PDA.
To let a computer synthetically read the summarized text. Written text can be too
long and boring to listen to.
In search engines to present compressed descriptions of the search results (see the
Internet search engine Google).
In keyword directed subscription of news which are summarized and pushed to the
user (see) Nyhetsguiden (In Swedish)
To search in foreign languages and obtain an automatically translated summary of
the automatically summarized text.
Microsoft Word has since 1997 a summarizer for documents. (See under Tools
where you can find Summary)SweSum is the first automatic text summarizer for
Swedish.) It summarizes Swedish news text in HTML/text format on the WWW.
During the summarization 5-10 key words - a mini summary is produced. Accurancy
84% at 40% summary of news with an average original length of 181 words.
Automatic text summarization is based on statistical, linguistical and heuristic
methods where the summarization system calculates how often certain key words
appear (the Swedish system has 700 000 possible Swedish entries pointing at 40 000
Swedish base key words). The key words belong to the so called open class words.
The summarization system calculates the frequency of the key words in the text,
which sentences they are present in, and where these sentences are in the text. It
considers if the text is tagged with bold text tag, first paragraph tag or numerical
values. All this information is compiled and used to summarize the original text.
SweSum is also available for Danish, Norwegian, English, Spanish, French, Italian,
Greek, Farsi (Persian) and German texts.

Table II is generated after performing Step 1 to Step 4 of the proposed algorithm. It


shows the frequency of the words in the document in descending order.

In Proceedings of the 6th Annual CISTM Conference, July 31 -August 2 , 2008, IIT Delhi , India
www.cistm.org
Pg 30-11

Table II
Vector Space Model

Text 26 tag 2 foreign 1 entered 1


summary 20 WWW 2 languages 1 returned 1
words 7 present 2 translated 1 non 1
Swedish 6 calculates 2 Microsoft 1 redundant 1
German 6 sentences 2 Word 1 extract 1
Automati 6 bold 1 document 1 redundant 1
c
key 5 format 1 Tools 1 compiled 1
original 5 mobile 1 find 1 information 1
news 4 phones 1 HTML 1 paragraph 1
search 4 PDA 1 format 1 numerical 1
SweSum 4 synthetica 1 mini 1 values 1
l
English 4 read 1 produced 1 developed 1
Spanish 4 Written 1 Accuracy 1 today 1
French 4 boring 1 average 1 important 1
Italian 4 listen 1 length 1 SMS 1
Greek 4 engines 1 based 1 WAP 1
Farsi 4 compress 1 heuristic 1 pushed 1
ed
techniqu 3 descriptio 1 methods 1 user 1
e ns
computer 3 results 1 entries 1 Nyhetsguide 1
n
system 3 engine 1 pointing 1 available 1
Danish 3 Google 1 base 1 directed 1
Norwegi 3 keyword 1 belong 1 class 1
an
Persian 3 directed 1 called 1 frequency 1
Internet 2 subscripti 1 open 1
on
roots 2 tagged 1 considers 1

According to Section 3.6 lower limit for significance is computed with L = 25, NS = 21
ms = 7+[ 0.1(L – NS )] = 7+[0.1(25-21)] = 7.4
The significant words having frequency greater than the threshold frequency (lower limit
of significance) i.e. 7.4 are ‘text’ and ‘summary’.
Table III shows the Scores SS1, SS2, SS3 and SSS for each sentence as explained in
Section 3.6

In Proceedings of the 6th Annual CISTM Conference, July 31 -August 2 , 2008, IIT Delhi , India
www.cistm.org
Pg 30-12

Table III
The scores SS1, SS2 , SS3 and SSS for each sentence

SENTENCES SS1 SS2 SS3 SSS


Automatic [text summarization] is the technique, where a 0.0
2 0 2.05
computer summarizes a text. 5
A text is entered into the computer and a [summarized text]
0.0
is returned, which is a non redundant extract from the 2 0 2.05
5
original text.
The technique has its roots in the 60's and has been
0.0
developed during 30 years, but today with the Internet and 0 0 0.00
0
the WWW the technique has become more important.
0.0
Automatic [text summarization] can be used: 2 0 2.05
5
To summarize news to SMS or WAP-format for mobile 0.0
0 0 0.00
phones/PDA. 0
0.0
To let a computer synthetically read the [summarized text]. 2 0 2.00
0
0.0
Written text can be too long and boring to listen to. 0 0 0.00
0
In search engines to present compressed descriptions of the 0.0
0 0 0.00
search results (see the Internet search engine Google). 0
In keyword directed subscription of news which are
0.0
summarized and pushed to the user (see) Nyhetsguiden (In 0 0 0.00
0
Swedish)
To search in foreign languages and obtain an automatically 0.0
1.5 0 1.50
translated [summary of the automatically summarized text]. 0
Microsoft Word has since 1997 a summarizer for
0.0
documents. (See under Tools where you can find 0 0 0.00
0
Summary)
(SweSum is the first automatic[ text summarizer] for 0.0
2 0 2.00
Swedish.) 0
It [summarizes Swedish news text] in HTML/text format 0.0
1 0 1.00
on the WWW. 0
During the [summarization 5-10 key words - a mini 0.6 0.0
0 0.67
summary] is produced. 7 0
Accurancy 84% at 40% summary of news with an average 0.0
0 0 0.00
original length of 181 words. 0
Automatic [text summarization] is based on statistical,
linguistical and heuristic methods where the 0.0
2 0 2.00
summarization system calculates how often certain key 0
words appear.
0.0
The key words belong to the so called open class words. 0 0 0.00
0
The summarization system calculates the frequency of the 0 0 0.0 0.00
In Proceedings of the 6th Annual CISTM Conference, July 31 -August 2 , 2008, IIT Delhi , India
www.cistm.org
Pg 30-13

key words in the text, which sentences they are present in, 0
and where these sentences are in the text.
It considers if the [text is tagged with bold text] tag, first 0.6 0.0
0 0.67
paragraph tag or numerical values. 7 0
All this information is compiled and used to [summarize 0.0
1 0 1.00
the original text] 0
SweSum is also available for Danish, Norwegian, English,
0.0
Spanish, French,Italian, Greek, Farsi (Persian) and German 0 0 0.00
0
texts.

[……..] cluster of significant words

TSS = 21*0.25 = 5.4 as explained in 3.7.


Hence the first 6 sentences having the maximum SSS value are extracted as the summary
as shown in Table IV.

TABLE IV
Summary of input text generated by Extracting Significant Sentences

Automatic text summarization is the technique, where a computer summarizes a


text. A text is entered into the computer and a summarized text is returned, which is
a non redundant extract from the original text. Automatic text summarization can be
used: To let a computer synthetically read the summarized text. SweSum is the first
automatic text summarizer for Swedish. Automatic text summarization is based on
statistical, linguistical and heuristic methods where the summarization system
calculates how often certain key words appear.

It is quite apparent that the Summary obtained in Table IV by extracting significant


sentences is quite sensible, meaningful and consistent. The summary generated is related
to the main topic of the document. It has a good ordering of sentences and cohesive.
Microsoft’s Word AutoSummarize tool can be taken as a standard summarizer of the text
documents. Word's AutoSummarize feature can automatically summarize a document to
identify its key points. It does this by analyzing words and sentences and assigning them
a score. More frequently used words get a higher score, identifying them as key points.
AutoSummarize Word tool has been used to generate a summary from the input text for
comparison (Table V)

TABLE V
Summary generated by the Word tool AutoSummarize

Automatic text summarization is the technique, where a computer summarizes a text.


Automatic text summarization can be used: To let a computer synthetically read the
summarized text. SweSum is the first automatic text summarizer for Swedish. It
summarizes Swedish news text in HTML/text format on the WWW.
During the summarization 5-10 key words - a mini summary is produced. It considers
if the text is tagged with bold text tag, first paragraph tag or numerical values.
In Proceedings of the 6th Annual CISTM Conference, July 31 -August 2 , 2008, IIT Delhi , India
www.cistm.org
Pg 30-14

The summary generated by AutoSummarize has almost the same sentences extracted as
the summary generated by Luhn’s Keyword Cluster Method of extracting significant
sentences. It proves that automatic extracts can be defined, programmed and produced in
an operational system to supplement and perhaps compete with traditional ones.

5. Conclusion

In this paper, we developed an algorithm to extract significant sentences from a single


document. Main advantages of our method are its simplicity without requiring use of a
corpus and its high performance comparable to tfidf. As more electronic documents
become available, we believe this method will be useful in many applications.
The method can be extended by considering additional scores.

Further the authors plan to extend the work by applying machine learning technique to
generate automatic summaries.

References

In Proceedings of the 6th Annual CISTM Conference, July 31 -August 2 , 2008, IIT Delhi , India
www.cistm.org
Pg 30-15

• Sparck K J (1971) Automatic keyword classification for retrieval, London: Butterworth.


• Kupiec J, Pederson J and F Chen (1995) A Trainable Document Summarizer, Proceedings of the 18th
International Conference on Research in Information Retrieval (SIGIR ’95), PP 55-60.
• Barzilay R and Elhabad M (1997) Using lexical chains for text Summarization, In Intelligent Scalable
Summarization Workshop, ACL.
• Chuang W T and Yang J (2000), Extracting sentence segments for text summarization: A machine
learning approach, proceedings of the 23rd International Conference on Research in Information Retrieval
(SIGIR ’00), pp. 152-159.
• Salton G and McGill M J (1983) Introduction to Modern Information Retrieval, McGraw-Hill, New
York.
• Salton G, Wong A and C S Yang (1975) A Vector Space Model for Automatic Indexing,
Communication of ACM.
• Chert K, Huang S J, Lin W C and Chen H H (1998) An NTU-Approach to Automatic Sentence
Extraction for Summary Generation, Language & Information Processing System Lab.
• Tombros A (1997). Reflecting user information needs through query biased summaries. Thesis
submitted towards the award of MSc in Advanced Information Systems in the University of Glasgow.
• Edmundson H P (1969), New Methods in Automatic Extracting, Journal of the ACM, 16 (2). 264-285.
• Luhn H P (1958), The Automatic Creation of Literature Abstracts, IBM J. of Res. & Dev. No. 2, 159-
65.
• Adenike M, Adesina L and Gareth J. F. (2001) Applying summarization techniques for term selection
in relevance feedback, In Proceedings of the 24th Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval, pages 1-9, New Orleans, Louisiana, United States.
• Jing H and McKeown K (2000) Cut and paste based text summarization, In Proc. of the 1st Meeting of
the North American Chapter of the Association for Computational Linguistics, pages 178—185.
• Nenkova A, and Passonneau R (2004) Evaluating content selection in summarization: The pyramid
method, In HLT/NAACL..
• Nenkova A, Passonneau R and McKeown K (2007) The Pyramid Method: Incorporating human
content selection variation in summarization evaluation, ACM Transactions on Speech and Language
Processing (TSLP).
• Lei Y, Jai M, Fuji R, Kuriowa S (2007) Automatic Text Summarization Based on Lexical Chains and
Structural Features.
• Jen-Yuan Y , Hao-Ren K , Wei-Pang Y , Heng Meng I (2005) Text summarization using a trainable
summarizer and latent semantic analysis, Information Processing and Management: an International
Journal, v.41 n.1, p.75-95.
• Chan Y, Wang X and Guan Y (2005) Automatic Text Summarization Based on Lexical Chains, LNCS
3610, pages 947-951.

In Proceedings of the 6th Annual CISTM Conference, July 31 -August 2 , 2008, IIT Delhi , India
www.cistm.org

View publication stats

You might also like