0% found this document useful (0 votes)
64 views15 pages

Amitkv Lit Survey Summarization

The document discusses various types of text summarization techniques including abstractive, extractive, single/multi-document, generic/query-focused, supervised/unsupervised, mono/multi-lingual, personalized, sentiment-based, and update summarization. It also discusses approaches to evaluate summaries including extrinsic evaluation using tasks and intrinsic evaluation of quality, informativeness, relevance, and reading comprehension against reference summaries. Common intrinsic evaluation metrics mentioned are ROUGE and pyramid method.

Uploaded by

Mahesh Abnave
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views15 pages

Amitkv Lit Survey Summarization

The document discusses various types of text summarization techniques including abstractive, extractive, single/multi-document, generic/query-focused, supervised/unsupervised, mono/multi-lingual, personalized, sentiment-based, and update summarization. It also discusses approaches to evaluate summaries including extrinsic evaluation using tasks and intrinsic evaluation of quality, informativeness, relevance, and reading comprehension against reference summaries. Common intrinsic evaluation metrics mentioned are ROUGE and pyramid method.

Uploaded by

Mahesh Abnave
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Survey on Text Summarization

Amit Vhatkar, Pushpak Bhattacharyya, Kavi Arya


Indian Institute of Technology, Bombay
{asvhatkar, pb, kavi} @cse.iitb.ac.in

Type of Summary Factors


Abstract Single and Multi-document Number of Document
Extractive and Abstractive Output(if exact or abstract is required)
Generic and Query-focused Purpose (whether general or query related data is required)
Automatic text summarization is Supervised and Unsupervised Availability of training data
Mono, Multi and Cross-lingual Language
considered to be one of the hard problems Personalized Information specific to user’s need
Sentiment-based Opinions are detected
because computationally there is no exact Update Current Updates regarding topic
way of evaluating summary but the human E-mail bases For summarizing e-mails
web-based For summarizing web pages
can distinguish between good summary and
bad summary. Also, summaries can be of
Table 1: Various Types of Summarization Tech-
various types like abstractive where new
niques(Gambhir and Gupta, 2017)
words and phrases are used, unlike extractive
summarization where top scoring sentences
from input text gets extracted as a summary
senting document(s) in condensed form with-
sentence. Traditionally the focus of researcher
was on building natural language generation out loss of content and without much/negligi-
which requires proper planning and realization ble repetition is an important task.
of language. Various machine learning based
approaches based on sequence labelling and 1.1 Types of Summarization
SVR has been applied to extract summary
sentences from the input text. Nowadays Broadly summarization approaches are
deep neural network models like sequence-to- categorized as abstractive and extractive. In
sequence, LSTM, pointer-generator model are an extractive type of summarization sen-
getting implemented to generate summaries. tences from the input, texts are presented as
This report will give a brief idea about types
of summary, summary evaluation measures
it is as part of summary whereas in case
and various ways to get summary. abstractive summarization new sentences de-
picting gist of a topic are formed. Summariza-
Main terms - Automatic text summariza- tion approaches based on the number of doc-
tion, machine learning, abstractive, extrac- uments are classified as a single document
tive, deep neural networks and multi-document. When only one doc-
ument is used to generate a condensed form
1 Introduction
of text then it is termed as single document
Document summarization has been studied summary and when more that one documents
vastly in the NLP research community for are searched for desired information then it is
more than 3 decades. As a number of doc- termed as multi-document summarization.
ument and available textual information in- Purpose of summary leads to generic
creases day by day due to the advancement of and query-focused summarization. In a
the Internet, obtaining precise information be- generic type of summarization entire docu-
comes a challenging task. While acquiring in- ment(s) is searched for various information
formation from a large set of documents user contents, unlike query-focused where the doc-
may choose to skip some documents or topics ument(s) are searched for only the topic men-
which may lead to loss of some of the impor- tioned in the query. The task of summariza-
tant points. In order to avoid such loss, repre- tion can be applied to and sentiment from the
document and such type of summarization is evaluation approach of summary gets tested
called as sentiment-base summarization. In for its usefulness to these various supporting
the update type of summary, it is assumed tasks. Sometimes this type of evaluation is
that the user is aware of basic information re- gets termed as task-based evaluation. Extrin-
lated to the topic and only need to know recent sic evaluation is further categorised as follows
updates regarding the topic.
If generated summary language is 1. Relevance assessment: Generated sum-
same as input document(s) then it is called mary is tested against relevance to the
as mono-lingual summarization and when the topic. This method is mostly used to
language of summary varies with that of in- topic/query-focused types of summariza-
put document(s) summary then it is called tion.
as multi-lingual summarization. Sometimes
2. Reading comprehension: It tests weather
based on pro file of user nature of summa-
generated summary can be used to an-
rization gets varied such type of summariza-
swer multiple choice tests.
tion is termed as personalized summariza-
tion. Apart from these, there are web-based, 1.2.2 Intrinsic Evaluation
e-mail based type of summarization as shown
in the table 1. Generally, reference summaries are used to
evaluate generated summary mostly on the ba-
1.2 Summary Evaluation Techniques sis of informativeness and coverage. The rel-
evance of summary to the input document(s)
has an issue of finding a relevant topic from
the document(s) as relevance has not a rigid
definition. As shown in figure1 intrinsic eval-
uation is categorised as follows:

1. Quality evaluation: Quality of text in


summary is checked on the basis of lin-
guistic parameters like grammatically,
structures and coherence, vocabulary,
Figure 1: Summary Evaluation Techniques(Gambhir
and Gupta, 2017)
non-redundancy etc.

2. Informativeness evaluation: This is the


Automatic generation of the summary is a
most used type of summary evaluation
hard task since we don’t know what part of
techniques. There are two ways in which
the information should contribute to the sum-
informativeness of summary is evalu-
mary. The varying perspective of summary
ated, they are as follows,
makes it harder to evaluate automatically gen-
erated summary even from the trained human. Automatic: don’t need human anno-
Someone may see a certain point important tation
while others may think that point less impor- Semi-automatic: needs human anno-
tant. Purpose of summary can help to eval- tation
uate automatically generated summary. As
described in the survey paper (Gambhir and Following session explains some of the infor-
Gupta, 2017) evaluation of summary can be mativeness intrinsic evaluation techniques.
broadly categorized as follows,
• ROUGE
1.2.1 Extrinsic Evaluation ROUGE (Recall-Oriented Under-
There are various tasks that help to generate a study for Gisting Evaluation) makes use
summary of the text. In the extrinsic type of of reference summary for evaluation. It
looks for co-occurrences of various lev- erated text are checked against the ones
els grams in the generated summary and from reference summaries.
reference summary. Five different met-
rics are available to capture ROUGE. • Pyramid Method
It is semi-automatic intrinsic in-
ROUGE-N: checks for overlap of N formativeness evaluation method which
gram makes use of nation of Summary Content
ROUGE-L: checks for longest com- Unit(SCU) which is nothing but the set
mon sub-sequences(LCS) of sentences with the similar quotient of
ROUGE-W: weighted LCS, favours informativeness. SCUs generated as part
longest LCS of summary and one which are similar
to various human level SCUs gets higher
ROUGE-S: skip-bigram based co- weight.
occurrence check
ROUGE-SU: checks of co-
occurrence except bi-gram and uni- 1.3 Outline
gram.
Rest of the document is organised as per
• BLEU (Bilingual Evaluation Under- chronological approaches applied and sug-
study) gested by the community to provide a solu-
It is a modified form of precision. tion for Text Summarizing. section 2 suggests
The modification comes from overlap machine learning based approaches which are
between candidate summary and refer- further categorised into sequence labelling
ence summary. Here overlap of words task and statistical approaches. section 3
in summary is calculated with respect to briefs about recent summarization approaches
the maximum count of that word from all from sequence-to-sequence(Nallapati et al.,
reference summaries. It can be written in 2016) RNN to attention models(Rush et al.,
the equation as follows, 2015). This section is followed by conclusion
and future work.
P = mmax /wt (1)
2 Machine Learning Based
where mmax is maximum time occur- Summarization Approaches
rence of word from all reference sum-
Machine learning is broadly used to per-
maries and wt is total number of words
form two types of tasks namely classifica-
present in generated summary.
tion and regression. In the case of clas-
• Basic Element(BE) sification task, class of given input is de-
Sentences are expressed in the cided based on the similarity of its features
form of using three word namely head, against features represented by classes. Na-
modifier/argument and relation(between ture of regressing task is to predict certain
head and modifier). Then these are value which is a function of features of the
mapped against various equivalence ex- given input. Molding machine learning ap-
pressions. proaches to generate text summary is an inter-
esting area of research, where summarization
• DEPEVAL is posed as either labelling task which can be
This evaluation method is similar sequence labelling(Shen et al., 2007) where
to BE method wherein parsers are used labels of other sentences are also considered
in this method unlike minipar in BE. De- or general labelling task where approaches
pendency triplets (head —modifier— re- like SVM-based ensemble(Chali et al., 2009)
lation) are from the automatically gen- or SVR based ranking algorithm(Li et al.,
2007) are used to decide rank of the sentence Systems R-1 R-L R-W R-SU
depending on features of sentence. Also var- Baseline 0.3347 0.3107 0.1138 0.1127
ious statistical approaches like graph-based Single 0.3708 0.3035 0.1113 0.1329
ranking approach(Mihalcea, 2004), manifold- Ensemble 0.3883 0.3197 0.1177 0.1463
ranking based approach(Wan et al., 2007) and
discourse structure based approach (Marcu, Table 2: Result of SVM-Based Ensemble Approach to
1997) etc. can be applied to get summary sen- Multi-Document Summarization (Chali et al., 2009)
tences from the document(s).
Support
2.1 Summarization as Labelling Task
Vector Regression(Li et al., 2007)
2.1.1 Labelling Using SVM and SVR It is multi-document extractive
• SVM Based Ensemble Approach to test summarization approach which
Multi-Document makes use of documents made available
Summarization(Chali et al., 2009) This is by DUC-2006 for training purpose.
topic-focused extractive multi-document DUC-2006 Data set contains 50 topics
text summarization approach. Proposed each having 25 news documents and 4
approach is targeted for Document Un- reference summaries for each topic. Pro-
derstanding Conference(DUC) 2007. posed system (Li et al., 2007) has three
steps: Text preprocessing, Sentence
Problem Definition(Chali et al.,
scoring and Post-processing. Prepro-
2009): Given a complex question and
cessing of text carries segmentation of
collection of relevant documents, the
sentences and removing of news heads
task is to synthesize a fluent, well-
from the document (DUC- 2006’s data
organized 250-word summary of the
comprise of news articles). Sentence
document that answers the question(s)
scoring makes use of various features
in the topic.
of sentences like word-based features,
Features: Query related and other phrase-based NE features, semantic-
important features like N-gram overlap, based WordNet feature, centroid feature,
LCS, WLCS, skip-bigram, gloss overlap, NE number feature, sentence position
BE overlap, length of sentence, position etc. SVR uses combination these fea-
of the sentence, NE, cue word match, ti- tures to generate sentence ranking.
tle match etc are extracted for each sen- Hypothesis(Li et al., 2007): More
tence. similar a sentence to four summaries,
SVM Ensemble: By using cross- larger its score must be. Authors had
validation with 25% data out and 75% come up with two strategies to score
data for training, 4 different SVM model sentence based on similarity of sentence
are trained using above mentioned fea- with reference summaries. Let, s be the
tures. An ensemble of these 4 classifiers sentence under consideration, sim be
are used for deciding the rank of the sen- the similarity function and Si be the ith
tence and top N sentences are labelled summary document.
as a summary sentence and others are la- Average: Here final score of the sen-
belled as non-summary sentence. tence is average of its score with refer-
Result: Author has compared their ence summaries.
approach with a baseline which selects P
Score(s) = i sim(s, Si )
lead sentences and Single SVM approach
on various level of ROUGE measure. Maximum: Here final score of the
sentence is the maximum score among
• Multi-document Summarization Using all reference summaries.
Score(s) = maxsim(s, Si ) G = (V, E) where V is set of vertices and E
i
is set of edges connecting those vertices. Let
After obtaining scores for each In(Vi ) be the set of nodes pointing to the node
sentence all sentences are represented as Vi and Out(Vi ) be the set of nodes to whom
a feature vector along with their similar- node vi points. In the case of un-directed
ity score i.e. D={Vs , score(s)}. Finally graph In(Vi ) will be same as Out(Vi ).
regressing function is learned by SVR
model. Accuracy of this method is men- • Introduction
tioned in table 3 where baseline systems Summarization approach described in
train SVR with the same set of features the paper (Mihalcea, 2004) make use of
but with a manual assignment of weights modified versions of following tradition
and Best submitted systems is the one graph-based approaches,
which performed best in DUC-2006. HITS (Hyperlinked Induced Topic
Search)
System Rouge-2 This algorithm is proposed for rank-
Best submitted system 0.09558 ing web pages. It generates two val-
SVR-based system 0.09057 ues for each node in the graph namely
Baseline system system 0.08012 Authority Score(HIT SA (Vi )) and Hub
Table 3: Performance of SVR-based Summarization ScoreHIT SH (Vi ). Authority value
Technique (Li et al., 2007) represents number of incoming links
whereas hub value denotes number of
outgoing links. Formulation of these val-
2.2 Other Statistical Based Approaches
ues are as given below,
There are methods which makes use of statis- X
tics for ranking sentences. Some of them HIT SA (Vi ) = HIT SH (Vj )
use graph-based(Mihalcea, 2004) approach Vj In(Vi )
while others use manifold-ranking(Wan et al., (2)
2007). Also historically some of the ap-
proaches have considered use of discourse X
structure(Marcu, 1997) for summary genera- HIT SH (Vi ) = HIT SA (Vj )
tion. This subsection explains some of these Vj Out(Vi )

statistical based summary generation algo- (3)


rithms in details. Positional Power Function
Positional power function (P OSP ) takes
2.2.1 Graph-based Ranking Algorithm
number of successors of node and their
for Sentence Extraction
scores into consideration for calculat-
Traditionally graph-based approaches ing score for a specific node. On the
are used for analysing link structures of Word other hand, positional weakness function
Wide Web also for analysing citation and so- (P OSW ) determines score of node based
cial networks etc. But the approach sug- on its ancestors and their scores.
gested in paper (Mihalcea, 2004) goes one
1 X
step ahead and makes use of graph structure P OSP (Vi ) = (1+P OSP (Vj ))
for extracting important sentences from doc- |V |
Vj Out(Vi )
uments. This approach can be categorised as (4)
unsupervised extractive multi-document sum-
marization. Authors have named it as Tex-
1 X
tRank. P OSW (Vi ) = (1+P OWW (Vj ))
To have common notation throughout |V |
Vj In(Vi )
discussion about this approach, let’s consider (5)
Graph
PageRank Algorithm
Un-directed Graph Dir. forward Dir. backward
This graph based algorithm was designed HIT SAW 0.4912 0.4584 0.5023
W
to analyse web links. It gives single HIT SH 0.4912 0.5023 0.4584
P SPPW 0.4878 0.4538 0.3910
score value to the node after considering P SPWW
0.4878 0.3910 0.4538
out-going and in-coming links in single PageRank 0.4904 0.4202 0.5008

equation as shown below,


Table 4: Result of Graph-Based Ranking Algorithm for
X P R(Vj ) Text Summarization Approach(Mihalcea, 2004)
P R(Vi ) = (1−d)+d∗
|Out(Vj )|
Vj In(Vi )
concept. Given two sentences Si and
(6)
Sj and sentences are formed from set of
where d  [0,1] and usually set to be 0.85.
Ni words. The term in the denominator
These algorithms start executing in
is used for normalization purpose which
iterative manned from random node with
avoids giving more weights to long sen-
random initialization of weights till con-
tences.
vergence. Some threshold values have
to be considered for convergence. It is
assumed that if difference between val- |Wk |Wk Si &Wk Sj |
ues of current iteration and previous it- Similarity(Si , Sj ) =
log(|Si |+log|Sj |)
eration is less than the threshold then al- (7)
gorithm has converged. Also, it has been In the second step, graph can be repre-
observed that starting node will not affect sented as follows,
values obtained after convergence but it
Un-directed
affects number of iterations.
Directed forward: orientation of
• Weighted Graphs edges follows pattern from text.
In TextRank approach nodes are
Directed backward: orientation of
considered as sentences there can be
edges is exactly opposite as that of flow
multiple partial links between nodes. To
of sentence text.
capture the importance of these partial
links, each link gets weight assigned to In the last, after running ranking algo-
it. This causes above mentioned formu- rithm, sentences get sorted on the score
las to change so as to adapt the notion of and top N scoring sentences are consid-
weights of the link. Final values of score ered as summary sentences.
after consideration of weight changes as • Evaluation and Discussion
compared to unweighted formulation but Authors have measured the perfor-
the shape of the convergence curve and mance of TextRank with on DUC-2002
number of required iteration remains al- Dataset with ROUGE as evaluation mea-
most same. sure. The results are as shown in ta-
• Sentence Extraction ble 4. If directed graphs with HITS for
The first step of TextRank is build- sentence ranking is used TextRank gives
ing a graph with a node representing best results.
sentence and link between nodes depict- Author mentions that as score for each
ing similarity i.e. content overlap be- sentence is calculated this method can
tween nodes. Author of this approach be extended to generate long summaries
has hypothesized as follows sentences and this method makes use of informa-
that address certain concept in text gives tion drawn from text only making it fall
the reader ’recommendation’ to refer to into the category of unsupervised algo-
other sentences that address the same rithms.
2.2.2 Manifold-Ranking Based known points and all other points get
Topic-Focused Multi-Document ranked as 0.
Summarization Points the get spread based on their
This is topic-based multi-document extrac- ranking score to their nearby nodes
tive text summarization which makes use of
Previous step is repeated till conver-
dependence between sentences of documents
gence.
and between sentences of topic. Summary
generated by this approach tends to have high Let χ be the set of all sentences
bias towards topic and less redundancy. Pa- (x1, ..., xn ) in documents in-
per (Wan et al., 2007) states that topic-focused cluding topic description(x0 ) i.e.
summaries should keep the information men- χ = {x0 , x1 , ..., xn } ⊂ Rm . Let,
tioned in documents, tries to make the infor- f : χ → R be to ranking function
mation as novel as possible and importantly which assign rank fi to each xi . Let
it should be biased to the topic. Rest of this y = [y0 , ..., yn ]T and y0 = 1 since x0 is
session describes how the proposed approach topic sentence.
tries to achieve these properties. Normalization in third step of algo-
rithm is required to guaranty conver-
• Overview
gence of algorithm. Fourth step is re-
The manifold-ranking based ap-
quired to spared the ranking to neigh-
proach mainly comprised of two steps,
bouring nodes. Parameter α in step four
in first step, manifold-ranking score for
describes relative contribution to ranking
each sentence get computed and in sec-
score from neighbours and initial score.
ond step diversity penalty gets imposed
The affinity matrix W is used to capture
on the score calculated in the previ-
the notion of importance of link between
ous step. In order to score the sen-
sentences the document.
tences paper (Wan et al., 2007) sug-
gest to assign weights to inter-document W = λ1 Wintra + λ2 Winter (8)
and intra-document connections between
sentences. Where, λ1 , λ2 [0, 1], if λ1 = λ2 = 1 then
As described above, to maintain both inter and intra document gets equal
properties of good summary proposed importance. Paper (Wan et al., 2007) also
approach makes use of biased informa- suggest greedy approach to impose di-
tion richness and information novelty. versity penalty. As shown in equation 9
These terminologies are defined in de- rank of sentence is decreased by the fac-
tails in paper (Wan et al., 2007). In final tor of ω(penalty degree factor) after con-
step after imposing penalty for obtaining sidering similarity with other sentences.
diversity sentences with top scores are If ω = 0 then no diversity penalty get
considered as summary sentences. imposed on the rank of the sentence.

• Manifold-Ranking Process RankScore(xj ) = RankScore(xj )−ω·Sji ·fi∗


Manifold-ranking process assumes (9)
that nearby points are likely to have the • Evaluation
same ranking and points on the same Author mentions use of DUC-
structure/cluster/manifold are likely to 2003, 2005 data set with ROUGE as
have same score. Ranking process can an evaluation metric. Parameter set-
be described as ting for the algorithm was, ω = 8,
Form a network of data point in cur- λ1 = 0.3, λ2 = 1 and α = 0.6.
rent case sentences and topic description. Lead baseline takes first sentence one-
Initially, positive rank gets assigned to by-one in the last document which are
Systems R-1 R-2 R-W
Manifold-Ranking 0.37332 0.07677 0.11869
Similarity-Ranking1 0.36088 0.07229 0.11540
S16 0.35001 0.07305 0.10969
Similarity-Ranking2 0.34542 0.07283 0.1115
S13 0.31986 0.05831 0.10016
S17 0.31809 0.04981 0.09887
Coverage Baseline 0.30290 0.05968 0.09678
Lead Baseline 0.28200 0.0468 0.09077
Algorithm 1 Manifold-Ranking Algorithm
(Wan et al., 2007) Table 5: Result of SVM-Based Ensemble Approach to
Multi-Document Summarization (Wan et al., 2007)
1: Compute the pair-wise similarity val-
ues between sentences (points) using the
standard Cosine measure. The weight as- chronologically ordered whereas cover-
sociated with term t is calculated with the age baseline takes first sentence one-by-
tft ∗ isft formula, where tft is the fre- one from first document to last docu-
quency of term t in the sentence and isf t ments. Similarity-Ranking1 don’t ap-
is the inverse sentence frequency of term ply weight to inter and intra-document
t, i.e. 1 + log(N/nt ), where N is the sentences whereas Similarity-Ranking2
total number of sentences and n t is the doesn’t impose diversity penalty. S13,
number of the sentences containing term S16, S17 are top performing systems in
t. Given two sentences (data points) xi DUC-2003. Table 5 shows the compari-
and xj , the Cosine similarity is denoted son of Manifold-Ranking based Summa-
as sim(xi , xj ), computed as the normal- rization with above-mentioned summa-
ized inner product of the corresponding rization approaches.
term vectors.
2: Connect any two points with an edge
• Parameter Tuning
if their similarity value exceeds 0. We Graph in figure 2 shows the result
define the affinity matrix W by Wij = of various systems captured on the val-
sim(xi , xj ) if there is an edge linking xi ues of the penalty degree factor. Using
and xj . Note that we let Wi i = 0 to avoid this information author states that no di-
loops in the graph built in next step. versity penalty and too much of diver-
3: Symmetrically normalize W by S =
sity penalty degrades the performance of
D−1/2 W D−1/2 in which D is the diago- the summarization systems. Graph in
nal matrix with (i,i)-element equal to the figure 3 is plotted for ratio of weights
sum of the i-th row of W. assigned to the links between inter-
4: Iterate f (t + 1) = αSf (t) + (1 − α)y.
document sentences and intra-document
until convergence, where α is a parameter sentence against ROUGE-1 measure. As
in (0,1). a score of inter-document link is always
5: Let fi∗ denote the limit of the sequence
greater than intra-document link author
{fi (t)}. Each sentences xi (1 ≤ i ≤ n) in paper (Wan et al., 2007) suggest to
gets its ranking score fi∗ . give more importance to the links be-
tween sentences of the same documents.

3 Recent Summarization Approaches


So far, we have discussed traditional and ma-
chine learning based ways of summary gen-
eration. Machine learning based approaches
are mostly of an extractive type of summa-
are made to the Sequence-to-Sequence(S2S)
models by adding the notion of attention. At-
tention allows the model to search for a spe-
cific location to learn from. Neural Atten-
tion Model for Sentence Summarization as
explained in section 4 is one of such mod-
els. Finally, we describe pointer-generator
network model in section 5 for text summa-
rization, which enhances attention model by
probabilistically choosing between generation
Figure 2: Performance of Manifold-Ranking Based and extraction.
Summarization Approach Based on Penalty Degree
Factor (Wan et al., 2007) 3.1 Sequence-to-Sequence RNNs for Text
Summarization(Nallapati et al., 2016)
In summarization length of output i.e. sum-
mary in not much related to the length of the
input text unlike machine translation where
length of out put i.e. translation is a function
of the length of the input text. In terms of the
loss of information, MT has to be loss-less
while summary can skip over unimportant
topics. The paper (Nallapati et al., 2016) pro-
pose an approach to perform abstractive sum-
marization based on basic encoder-decoder
Figure 3: Performance of Manifold-Ranking Based
Summarization Approach Based on Importance Given RNN. The authors also have introduced new
to Inter Document Link and Intra-Document Link dataset for abstractive summarization but we
(Wan et al., 2007) are not considering that part in our discussion.
Rest of this session describes the model.
rization. Recently due to the success of neu- 3.1.1 System Overview
ral networks, various newer approaches using Capturing keywords and handling out of vo-
deep neural networks are getting proposed. cabulary (OOV) words is a challenging task
Along with summarization, Machine Trans- for deep neural network based approaches. To
lation(MT) is also getting discussed vastly. identify the topic discussed in the text feature-
In machine translation, input text presented rich encoder is proposed in the paper (Nallap-
in one language is converted to another lan- ati et al., 2016).
guage. Like summarization, machine transla-
tion also makes use of language interpretation
and generation modules.
On top of the basic idea presented in the ap-
proach of Neural Machine Translation(NMT),
various other approaches were proposed,
summarization using sequence-to-sequence
RNN(section 3.1) is one of them. Recur-
ring Neural Network(RNN) is an advanced
form of feedforward neural network, where Figure 4: Feature-Rich Encoder Proposed in the Pa-
the neuron is used to train recursively in a sin- per (Nallapati et al., 2016)
gle pass of training and such multiple passes
can be used to train the model. Enhancements As shown in figure 4 apart from the em-
bedding of the word, named-entity tag(NER), sequence. Here W’s are learn-able parame-
part-of-speech(POS) tag, term frequency(TF), ters. As stated, in case of pointer generation,
inverse document frequency(IDF) are consid- the proposed model probabilistically gener-
ered as additional features of the input word. ates appropriate pointer to the location in the
For the case of continuous features like TF input document. The equation 11 and 12 rep-
and IDF fixed number of bins are used to con- resents the way to find appropriate position in
vert them to categorical feature and then one- input text to copy word from. In equation pi
hot encoding is applied. These newly added denotes pointing in input text which is used
features are concatenated to the original word to produce i-th word in summary and Pia is
embedding. attention distribution for every word in the in-
put document. If two locations get same prob-
3.1.2 Switching Generator/Pointer
ability value then hard assignment is done by
Model
giving preference to the first occurring word.
The vocabulary of decoder gets fixed at the
time of training so it cant emit unseen or OOV Pia ∝ exp(v a · (Wha hi−1 + Wea E[oi−1 ]
words. Previously these OOVs are handled (11)
+ Wca hdj + ba ))
by replacing it with UNK(unknown). This
approach makes use of switching generator/-
pointer model for handling OOVs. pi = argmax(Pia (j)) f orj  {1, . . . , Nd }
j
(12)

As shown in figure 5, if pointer is used for


producing word as the summary word, de-
coder uses its embedding as input for next
time-step, otherwise embedding of previous
time-step is used.

Figure 5: Switching Generator-Pointer Model Pro- 3.1.3 Capturing Hierarchical Document


posed in the Paper (Nallapati et al., 2016) Structure
Till now in proposed approach, each word
The authors have termed their approach for is checked for producing the summary more-
handling OOV as the switch. This switch is over, model suggests using sentence level in-
located at the decoder and based on the value formation along with word level information
produced by switch decoder decided to gen- for producing summary. Sentence level atten-
erate a new word or simply copying the word tion is mixed with word level attention and
from input text. Whenever switch is on de- then updated word level attention is used by
coder generates word from embedding and decoder to produce the summary words.
whenever switch is off decoder generates a
pointer to the word in input text. Switch pro-
duces probabilistic value based as shown in
figure 10.
P (si = 1) = σ(v s · (Whs )hi + (10)
Wes E[oi−1 ] + Wcs ci + bs )
Above equation, give the probability of
switching begin on in i-th time-step of de-
coder. hi is hidden state, E[oi−1 ] embed-
ding vector of previous emission, ci is a con- Figure 6: Hierarchical Encoder for Capturing Sentence
text vector which handles attention over input Level Information (Nallapati et al., 2016)
Figure 6 shows incorporation of sentence Summarization(Rush et al., 2015)
level attention for summarization. Updated
word level attentions are used as input to the It is also called as Attention Based Summa-
decoder. Sentence level attentions are con- rization(ABS). It tries to make use of the lin-
catenated by the positional embedding of the guistic structure for generation of summaries,
sentence. for that it captures the attention in input se-
quence to produce correct output. This is
successor of this model is presented in sec-
P a (j)Psa (s(j)) tion 3.1. The approach presented in the pa-
P a (j) = PNdw (13)
a a
k=1 Pw (k)Ps (s(j))
per (Rush et al., 2015) is of abstractive sen-
tence summarization which takes sentence as
The equation 13 show update performed on input and converts it into a condensed form.
the word level attention, where Pwa (j) original This approach can be further extended to pro-
word attention at j-th position in input text, duce summary of documents.
s(j) denotes ID of sentence at j-th position,
Psa (l) is weight given to the l-th sentences at- 4.1 Proposed Model
tention, Nd is total number of words. Posi-
Let’s consider x as input sequence of words
tional embedding is concatenated to the em-
presented in the sentence, yi be to i-th gen-
bedding of sentence so as to give importance
erated word and yc denotes context of output
to sentences present at a specific location in
word y. Then conditional log probability of
the input text.
the summary sentence can be written as in
3.1.4 Result and Analysis equation 14.
Authors have compared their approach with
the model presented in Rush et al(2015)(Rush N −1
X
et al., 2015). Original model proposed by log p(y|x; θ) ∝ log p(y(i−1) |x, yc ; θ)
Rush et al(2015) (Rush et al., 2015) is termed i=0
as ABS and its variant which combines addi- (14)
tional log-linear model with manually picked where N is length of output sentence which
features is termed as ABS+. is fixed as proposed in the model and θ rep-
resents other learn-able parameters. This is
Model Rouge-1 Rouge-2 Rouge-L same as Markov model with c begin variable
TOPIARY 25.12 6.46 20.12
ABS 26.55 7.06 22.05 defining degree of Markov assumption.
ABS+ 28.18 8.49 23.81 The objective of the model can be formalized
Sequence-to-Sequence RNN 28.35 9.46 24.59
as shown in below equation,
Table 6: Performance of Sequence-to-Sequence Ab-
stractive Summarization Model(Nallapati et al., 2016) argmax log(p(y|x)) (15)
y

All these models are trained on Gigaward 4.2 Natural Language Model
corpus and the proposed model is trained only
on the first sentence of Gigaword corpus. For The approach proposed in the paper (Rush
testing DUC-2003 corpus was selected and et al., 2015) makes use of basic feedforward
ROUGE as a measure of performance. DUC- neural networks and generates probability dis-
2003 top performing model TOPIARY is also tribution over output sequence. Probability of
considered for comparison. Table 6 shows re- generating next word is given in equation 16
sults of the evaluation. where, θ = (E, U, V, W ) and C is window
size of context, E is embedding martix and
4 Neural Attention Model for Sentence V and W are weights associated with hidden
state and encoder. it also sufferers from stop words.
0
enc1 (x, yc ) = pT x
p(yi−1 |yc , x; θ) ∝ exp(V h + W enc(x, yc )) p = [1/M, . . . , 1/M ]
(16) 0
x = [F x1 , . . . , F xM ]

yc = [Eyi−C+1 . . . Eyi ] (17)


• Convolutional Encoder (enc2 ):
This type of encoders makes use
of standard time-delay neural networks
h = tanh(U y) (18)
between convolution layers and max
pulling. In order to avoid issues faced
by bag-of-words it considers interaction
between words in the input sentence.

• Attention-Based Encoder (enc3 ):


Convolutional encoders are hard to
learn therefor authors has suggested sim-
pler attention-based encoder with follow-
ing formulation,

0
enc3 (x, yc ) = pT x (19)
Figure 7: System Architecture of Attention Based Sen- 0 0

tence Summarization, (a) Decoder with additional en- p ∝ exp(x P yc ) (20)


0
coder element (b) Attention based encoder (Rush et al., x = [F x1 , . . . , F xM ] (21)
2015) 0
yc = [Gyi−C+1 , . . . Gyi ] (22)
i+Q
Figure7 represents the same formulation in 0 0
X
∀ xi = xi /Q (23)
pictorial view. In figure, part (a) explains the q=i−Q
calculation required for getting probability of
specific word as summary word and part (b) Where G is embedding for context
show formulation of attention based encoder and Q is smoothing window, which
(enc3 ) which is discussed in 4.3. allows to focus on certain part of the
sentences if current context aligns with
4.3 Encoder and Decoder current input word x. This approach
Encoder takes input words x and already gen- is very similar to the bag-of-words
erated words yc as input and transforms this approach with replacement of uniform
using feature matrix F . Three encoders are distribution to probablistic distribution
suggested in the paper. they are as follows, over alignment of input words.

• Bag-of-Words Encoder (enc1 ):


It assigns equal probability to each N LL(θ)
word presented in the input sentence and =
the multiplies it with the transformed x. XJ N
X −1
(j)
This approach is same as uniform dis- − log p(yi+1 |x(j) , yc ; θ)
tribution over input words. This model j=1 i=1
looses the connection between words and (24)
Let’s consider training set with J pairs of 5.1 Sequence-to-Sequence Attention
sentence ((x1 , y 1 ), . . . , (xJ , y J )) and as- Model
sociated summary sentence. To train the
model negative log likelihood objective
function is used and to generate the sum-
mary sentence. In decoder beam search
algorithm is proposed, as complicity of
greedy algorithm is exponential in terms
of the window. The equation24 describes
objective function for training and learn-
ing various parameters,
Figure 8: Baseline Sequence-to-Sequence Attention
4.4 Result and Analysis Model for Abstractive Text Summarization (See et al.,
2017)
The authors compare their results based on
ROUGE measure and DUC-2004 as evalua-
tion data set. They compare their approach The paper (See et al., 2017) discussed base-
with TOPIARY which is top performing sys- line model which is similar to the one we have
tem in DUC-2004 and MOSES+ which is discussed in section 3.1. Proposed baseline
phrase based statistical MT system. Table 7 uses single bidirectional LSTM as encoder
shows actual results of the evaluation. and single layer unidirectional LSTM as de-
coder. This baseline model is depicted in fig-
ure 8 from which it gets clear that the word
Model Rouge-1 Rouge-2 Rouge-L
beat gets generate based on present context of
TOPIARY 25.12 6.46 20.12
sentence. Let encoder hidden states be hi and
MOSES 26.5 8.13 22.85
decoder hidden states be si then attention dis-
ABS 26.55 7.06 22.05 tribution at time-step t can be formulated as
ABS+ 28.18 8.49 23.81 shown in equation 25 and 26 where v, Wh , Ws
and batten are learnable parameters.
Table 7: Comparison of Results of ABS with Vari-
ous Approaches Summarization Systems Summariza-
tion on DUC-2004 Dataset (Rush et al., 2015)
eti = v T tanh(Wh hi + Ws st + battn ) (25)

5 Summarization with
Pointer-Generator Network(See et al., at = sof tmax(et ) (26)
2017)
Attention can be considered as the location
Recent summarization approaches discussed to produce next word from. Attention is used
so far tries to generate the summaries irre- to get weighted sum of hidden state which
spective of correctness of factual data and represents overall hidden state h∗ . This hid-
without considering novelty of information den state along with hidden state of decoder
in produced summary. Abstractive summa- then used to probability distribution Pvocab
rization proposed in the paper (See et al., over all words in vocabulary. The equation 27
2017) tries to overcome these shortcomings captures the calculation required to generate
along with handling of OOVs. The author probability distribution over all vocabulary
discusses three approaches (1)Baseline model words.
(section 5.1) (2)Pointer generator model (sec-
tion 5.2) and (3)Coverage mechanism (sec-
0 0
tion 5.3). Rest of this session discusses these Pvocab = sof tmax(V (V [st , h∗t ] + b) + b )
approaches in detail. (27)
document, attention term becomes zero. neg-
ative log-likelihood is used as loss function to
train the model and learn the parametes.

5.3 Coverage Mechanism


The main purpose of coverage mechanism
is to avoid repetition in the generated sum-
mary. To achieve this, paper (See et al., 2017)
suggest maintaining coverage vector ct which
is attention distribution over all previous de-
Figure 9: Pointer-Generator Model for Abstractive coder time-steps. The equation 30
Text Summarization (See et al., 2017)
t−1
X
t
0 0 c = at (30)
where, V, V , b, b are learnable parameters. 0
t =0
Negative log-likelihood is used to train and
learn the parameters. Updated equation form of the equation 31
after considering coverage vector is as shown
5.2 Pointer-Generator Network bellow,
This is hybrid model which combines the
baseline model and the model of pointer net-
work proposed in Vinyals et al., 2015. Pointer eti = v T tanh(Wh hi + Ws st + wc cti + battn )
generation model tries to handle OOVs either (31)
by copying from input text or by generating The author also suggests to add coverage
from decoder vocabulary. Figure 9 describes loss to negative log likelihood, then equa-
use of working of pointer-generator model. tion 32 describe overall loss for learning pa-
Generation probability is calculated as shown rameters, where λ also gets learnt.
in equation 28 where w’s and bptr are learn-
able parameters. X
losst = −log P (wt∗ ) + λ min(ati , cti )
i
pgen = σ(whT h∗t + wsT st + wxT xt + bptr ) (28) (32)
min of attention and coverage is useful for
Unlike the approach proposed in sec- penalizing only overlapping part.
tion 3.1, where pointer-generation a switch
is binary variable, in proposed model switch 5.4 Result and Analysis
is modelled as continuous variable between The authors compare their model with ab-
range [0, 1]. The authors call this is a stractive model presented in section 3.1
soft switch and σ function is used to decide and then the combination of sequence-to-
between generation and pointer mechanism. sequence(s2s) with baseline by training on
The notion of extended vocabulary which is a 150k vocabulary words and 50k vocabulary
combination of vocabulary and all words ap- words. Table 8 shows results of the evalua-
pearing in the input text. The equation 29 give tion on the basis of ROUGE measure on CN-
probability distribution of vocabulary words. N/Daily Mail training dataset where the pro-
posed model makes use of 256-dimensional
hidden states and 128-dimensional word em-
X
P (w) = pgen Pvocab (w) + (1 − pgen ) ati
i:wi =w bedding.
(29) The Author has concluded that abstrac-
When w happens to be OOV, Pvocab becomes tive summarization is hard to achieve using
zero and in case of non-appearance in source pointer generator model since the probability
Model Rouge-1 Rouge-2 Rouge-L
Abstractive Model (Nallapati et al., 2016) 35.46 13.3 32.65 summarization. In Canadian Conference on Artifi-
s2s 150k vocab 30.49 11.17 28.08 cial Intelligence, pages 199–202. Springer.
s2s 50k vocab 31.33 11.81 28.83
Pointer Generator 36.44 15.66 33.42 Mahak Gambhir and Vishal Gupta. 2017. Recent auto-
Pointer Generator + Coverage 39.53 17.28 36.38
matic text summarization techniques: a survey. Ar-
tificial Intelligence Review, 47(1):1–66.
Table 8: Comparison of Results of Models Suggested
in Paper(See et al., 2017) with Basic Sequence-to- Sujian Li, You Ouyang, Wei Wang, and Bin Sun. 2007.
Sequence Model Proposed in Paper (Nallapati et al., Multi-document summarization using support vec-
2016) tor regression. In Proceedings of DUC. Citeseer.

Daniel Marcu. 1997. From discourse structures to text


summaries. Intelligent Scalable Text Summariza-
of generation from 0.3 to 0.53 during training tion.
but while testing it gets stuck at 0.17.
Rada Mihalcea. 2004. Graph-based ranking algorithms
6 Conclusion for sentence extraction, applied to text summariza-
tion. In Proceedings of the ACL Interactive Poster
and Demonstration Sessions.
In this survey we have categorized
ways of summarization as traditional ap- Ramesh Nallapati, Bowen Zhou, Caglar Gulcehre,
proaches, machine learning based approaches Bing Xiang, et al. 2016. Abstractive text summa-
rization using sequence-to-sequence rnns and be-
and recent approaches which uses no- yond. arXiv preprint arXiv:1602.06023.
tion of deep neural network for gen-
erating summary. We have also de- Alexander M Rush, Sumit Chopra, and Jason We-
ston. 2015. A neural attention model for ab-
scribed various of types of summariza- stractive sentence summarization. arXiv preprint
tion like abstractive-extractive, multi-lingual- arXiv:1509.00685.
monolingual, supervised-unsupervised etc.
Abigail See, Peter J Liu, and Christopher D Man-
Some of summary evaluation measures like ning. 2017. Get to the point: Summarization
ROUGE, BLEU, DEPVAL etc. are also de- with pointer-generator networks. arXiv preprint
scribed. arXiv:1704.04368.
Recently, due to advances in computa- Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, and
tional power, sophisticated models based on Zheng Chen. 2007. Document summarization us-
neural networks, joint learning, reinforcement ing conditional random fields. In IJCAI, volume 7,
pages 2862–2867.
learning etc. are getting proposed and year by
year more accurate and acceptable summaries Xiaojun Wan, Jianwu Yang, and Jianguo Xiao.
are getting produced. Also various evaluation 2007. Manifold-ranking based topic-focused multi-
document summarization. In IJCAI, volume 7,
measures like ROUGE, BLEU, METEOR etc. pages 2903–2908.
are used to decide quality of generated text.
After exploring this much, we can con-
clude that Text Summarization is vastly stud-
ied topic in the field of AI-NLP and research
is still going on to achieve human-level ex-
cellence for producing summaries. As there
is not exact measure to declare a summary
as good or bad and as the readers percep-
tion changes as per domain knowledge, topic
of text summarization remains open for re-
searchers.

References
Yllias Chali, Sadid A Hasan, and Shafiq R Joty. 2009.
A svm-based ensemble approach to multi-document

You might also like