0% found this document useful (0 votes)
34 views

An Approach To Abstractive Text Summarization

This document describes an approach to abstractive text summarization based on discourse rules, syntactic constraints, and word graphs. It proposes using discourse rules and syntactic constraints to generate sentences from keywords, and using a word graph to represent word relations and combine multiple sentences. The approach aims to address issues with generating incorrect meanings from existing word graph methods by separating the process into sentence reduction and sentence combination stages.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

An Approach To Abstractive Text Summarization

This document describes an approach to abstractive text summarization based on discourse rules, syntactic constraints, and word graphs. It proposes using discourse rules and syntactic constraints to generate sentences from keywords, and using a word graph to represent word relations and combine multiple sentences. The approach aims to address issues with generating incorrect meanings from existing word graph methods by separating the process into sentence reduction and sentence combination stages.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/282282594

An approach to abstractive text summarization

Article · March 2015


DOI: 10.1109/SOCPAR.2013.7054161

CITATIONS READS
17 470

2 authors, including:

Huong Le
Hanoi University of Science and Technology
30 PUBLICATIONS   184 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Machine Learning View project

All content following this page was uploaded by Huong Le on 15 January 2016.

The user has requested enhancement of the downloaded file.


An approach to Abstractive Text Summarization

Huong Thanh Le Tien Manh Le


Hanoi University of Science and Technology Hanoi University of Science and Technology
Hanoi, Vietnam Hanoi, Vietnam
[email protected] [email protected]

Abstract—Abstractive summarization is the technique of approaches. The input of our system is an extractive
generating a summary of a text from its main ideas, not by summary after anaphora resolution. That means all pronouns
copying verbatim most salient sentences from text. This is an have been replaced by corresponding nouns/noun phrases
important and challenge task in natural language processing. (NPs).
In this paper, we propose an approach to abstractive text The rest of this paper is organized as follows. Section 2
summarization based on discourse rules, syntactic constraints, analyzes existing problems with the word graph and
and word graph. Discourse rules and syntactic constraints are proposes our strategies to deal with them. Our sentence
used in the process of generating sentences from keywords. reduction’s method is introduced in Section 3. Section 4
Word graph is used in the sentence combination process to
presents our method of merging sentences using word graph.
represent word relations in the text and to combine several
sentences into one. Experimental results show that our
Experimental results are discussed in Section 5. Finally,
approach is promising in solving the abstractive Section 6 concludes the paper and gives some insight for
summarization task. future work.

Keywords- abstractive text summarization, discourse relation,


II. CONSTRUCTING GRAPH
word graph A word graph consists of nodes and edges. Existing
approaches on AS [3,9] use nodes to store information about
I. INTRODUCTION words and theirs POS tag and edges to represent adjacency
Automatic text summarization is the technique which relations between word pairs. A new sentence is generated
automatically creates an abstract or summary of a text. It by connecting all words in a path of the word graph.
gained widespread interest due to overwhelming amount of The approaches using word graph for single document
textual information available in electronic format. Text summarization still have problems as many sentences with
summarization techniques can be broadly grouped into incorrect meaning can be generated. This is because the
abstractive summarization (AS) and extractive generation algorithms find paths among words on the graph,
summarization (ES). Most research on text summarization regardless of their syntactically correctness and the original
are ES [2,11] since it is easier and faster than AS. ES extracts text. An example of sentences with incorrect meaning is
verbatim most salient sentences from text. Meanwhile, AS is shown below.
relied on Natural Language Processing (NLP) techniques to Example 1: Mẹmother BáchBach muabuy thuốcmedecine vềback
copy-paste sentence fragments from the input document and chofor uốngdrink. Sau_khiafter uốngdrink, BáchBach cóhas
maybe combine the selected content with extra linguistic biểu_hiệnsymptom đỏred môilip vàand nổiappear bọngbubble nướcwater
information in order to generate the final summary. There ởat tayhand vàand chânleg.
are two main problems with ES. First, the textual coherence Mẹ có
is not guaranteed as resolving anaphora resolution is not paid
attention in this approach. Second, redundant phrases still Bách biểu_hiện
exist in the summary. AS can solve this problem by carrying mua đỏ tay
out NLP techniques to post-process the output of ES such as
sentence truncation, aggregation, generalization, reference thuốc , môi ở
adjustment and rewording [4,6,8]. However, AS is still a chân
major challenge for NLP community despite some work on về và nước
sub-sentential modification [4,6]. cho nổi bọng
Recent approaches in AS use word graphs to represent a uống
document [3,9]. These graphs are then used to produce
document abstracts, allowing the algorithm to compress and Sau_khi
.
merge information. Representing documents by word
graphs is a new and potential approach for generating Figure 1. The word graph representation for Example 1
abstractive summary. However, this approach still has many
problems, as discussed in Section 2. In this paper, we In Fig. 1, each small circle represents a word of the text;
concentrate on rhetorical structure and word graph to the symbol ⊗ means the end of a sentence. Each arrow is
generate an abstractive summary. Several strategies are created by connecting two adjacent words in a sentence. The
proposed to solve existing problems with word graph based above word graph can generate sentences “BáchBach muabuy
thuốcmedecine vềback chofor uốngdrink” and “Mẹmother BáchBach đã ký hợp tác
cóhas biểu_hiệnsymptom đỏred môilip vàand nổiappear bọngbubble Hà_Giang
nướcwater ởat tayhand vàand chânleg”, which do not reflex the . phát_triển
trở_thành
correct meaning of the text. In addition, “đỏ môi và chân|red
lip and leg” should not be generated since it does not reflex the điểm nước
correct meaning of the original text. Pagerank method or
adding information about sentence position used in [3] and hấp_dẫn ngoài
.
[9] cannot deal with this problem. khách
The problem of incorrect meaning of a new phrase does thành_phố và Văn_hóa
not only happen with NPs, but also with other phrases such Du_lịch trong
nhiều Sở
as verb phrases (VPs) and adjective phrases (AdjPs). In order với
to solve this problem, instead of finding paths containing của
keywords using scores or shortest paths as in [3] and [9], our
abstractive summary generation process is separated into two Figure 2. The word graph representation for Example 2
stages: sentence reduction and sentence combination. The
sentence reduction step is based on input sentences, The sentence “Sởdepartment văn_hóaculture vàand ngoàioutside
keywords of the original text and syntactic constraints. Word nướccountry.” is generated by the above word graph. However,
graph is used only in the sentence combination stage. this is only a NP, not a sentence. Moreover, this NP is an
The problem of incorrect meaning of a new phrase is incorrect name of a Vietnamese department. This is because
solved in the sentence reduction stage using two strategies. the word “vàand” at the middle of the original department
First, all basic phrases1 (including basic NPs, basic VPs, and name having two output branches in the word graph: one to
basic AdjPs) from the extractive sentences that contain the rest of the department name and one to another branch.
keywords are used as essential materials for the sentence This sentence is created by visiting the second branch from
reduction stage. A new sentence is created by connecting the the word “vàand” in this case. This sentence never appears in
first phrase to the last one in the original sentence and then our system since this is the case of incorrect meaning and is
expanding its left and right sides to satisfy syntactic solved by our two strategies in the sentence reduction stage
constraints. A detailed description of this procedure is mentioned above.
introduced in Section 3. Another drawback of existing word graph based
In the above example, to generate a new sentence from approaches is that these researches do not care about
the original sentence “Mẹmother BáchBach muabuy thuốcmedecine word/phrase meaning. Different words/phrases that refer to
vềback chofor uốngdrink.”, the basic NP in this sentence that the same concept are represented as different nodes in their
contains the keyword “Bách” is “Mẹ Bách”. Therefore, “Mẹ graphs. As a result, sentences that contain these nodes cannot
Bách” (not “Bách”) is used as the subject of this sentence. be merged to create a new sentence with richer information
To generate a new sentence from the original sentence than the old ones. To solve this problem, an anaphora
“Sau_khiafter uốngdrink, BáchBach cóhas biểu_hiệnsymptom đỏred resolution module2 has been integrated into our ES system.
môilip vàand nổiappear bọngbubble nướcwater ởat tayhand vàand Then the output of our ES system is used as the input of our
chânleg.”, “Mẹ Bách” cannot be the subject of the new abstractive summarizer. This is different than that of [3] and
sentence since it does not appear in the original sentence. [9] in which anaphora resolution has not been considered.
The second strategy to solve the first problem is to From such a type of input, all nodes that refer to a concept
consider stop words, prepositions, numerals, auxiliary words, are grouped into one. That is, the text field of a node will
and negative words (e.g., “khôngnot”, “chẳngnever”) as store multi-values, as illustrated in Fig. 3. If the original
separated nodes. Otherwise, the real meaning of the sentence sentence uses one value in this text field, the system can use
could be changed when generating new sentences. any value in this group to generate a new sentence. We
The second problem of existing approaches using word consider two cases: (i) synonym words; and (ii) different
graph is that ungrammatical sentences can be generated by expressions refer to the same concept.
the word graph. Let us consider the example below. To deal with the first case, a synonym dictionary is used.
Example 2: Hà_GiangHaGiang trở_thànhbecomes điểmplace For example, “phát_biểusay” and “tuyên_bốdeclare” are two
hấp_dẫnattractive kháchguess du_lịchtourist tronginside vàand synonyms, they are just considered as one node in the graph.
ngoàioutside nướccountry. Sởdepartment Văn_hóaculture vàand The second case is solved by using coreference
Du_lịchtourist Hà_GiangHaGiang đãhas kýsign hợp_táccooperation resolution. For example, if the original sentence is
phát_triểndevelop du_lịchtourist vớiwith Sởdepartment Văn_hóaculture “Vũ_DưVuDu dùnguse thuốcmedecine Biseptol.” and by
vàand Du_lịchtourist củaof nhiềumany thành_phốcity. coreference resolution, we know that “bệnh_nhânpatient”,
“Vũ_DưVuDu”, “DưDu”, and “bệnh_nhânpatient Vũ_DưVuDu”
refer to the same object, the original word graph in Fig. 3a
can be expanded as in Fig. 3b. Such of coreference
1
The Vietnamese chunker, created by Nguyen Le Minh and Cao
Hoang Tru, belongs to the VLSP project
2
http://vlsp.vietlp.org:8080/demo/?page=home is used for Due to the scope of this paper, the anaphora resolution step is not
extracting basic phrases from sentences. mentioned in this paper.
resolution’s rules are proposed by us and are integrated in III. SENTENCE REDUCTION
our system. The input of sentence reduction module is the extractive
Vũ_Dư dùng thuốc Biseptol . summary of the document and keywords of the original
(a) document. Keywords are extracted from the original
Bệnh_nhân| Vũ_Dư| Dư| document by computing its tf-isf (term frequency - inverted
Bệnh_nhân Vũ_Dư| sentence frequency) and getting top k per cent keywords
Bệnh nhân Dư dùng thuốc Biseptol . with highest tf-isf. The optimal value of k in our system is
(b) 15% and it is determined by our experiments. The output of
Figure 3. The graph representation for the sentence “Vũ_Dư dùng thuốc this module is another version of summary that is shorter
Biseptol.” than its input text.
By studying how humans write summaries, Jing and
If the next sentence involves “Vũ_Dư” such as McKeown [6] found that professional abstractors often reuse
“Bệnh_nhânpatient bịsuffer from biến_chứngside_effect nặngheavy”, the the text in an original document, and then edit the extracted
word “bệnh_nhânpatient” is also mapped to the first node of sentences for producing the summary. Applying this idea,
the graph as in Fig. 3b. instead of creating new sentences from keywords, we locate
Our graph to represent the input text is organized as important phrases in original sentences (basic phrases in the
follow. The graph G = (V, E) consists of a set of vertexes original sentences that contain keywords) and use them as
(nodes) V and a set of edges E. A vertex keeps four kinds of essential materials for generating an abstractive summary.
information: This method permits us reduce ungrammatical phrases and
• a text field stores words or phrases that refer to a produce sentences whose meaning are close to the original
concept; sentences.
• a POS field stores the grammatical role of the text field. To create a new sentence from important phrases of the
If the text field has several values, the largest POS tag original sentence, the fragment that spans from the first
will be assigned. important phrase to the last one in the original sentence is
An edge connects two vertexes in the graph. Two generated. This fragment is considered as an essential part in
vertexes are connected if their texts are adjacent in the input the original sentence. Then other words of the original
text. sentence are added to the beginning and end of this fragment
The input of the algorithm to create a graph is the to create a syntactically correct sentence whose meaning of
original document and its extractive summary. The the original sentence is still remained.
extractive summary has been tokenized, POS tagged, solved The process of generating a sentence from the essential
coreferences and defined unsplittable phrases. The words fragment of a sentence is divided into two steps: (i)
and unsplittable phrases are called textual units of the input completing the beginning of a sentence; and (ii) completing
text. The output of the algorithm is a graph G = (V, E) the end of a sentence.
represents the extractive summary. Steps to generate the
graph that represents the extractive summary of a text is A. Completing the end of a sentence
shown below: The input of this process is the original sentence that has
• Detect all phrases that refer to a proper name in the been divided into basic phrases (NP, VP, etc.) and the
original text (e.g., “bệnh_nhânpatient”, “Vũ_DưVuDu”, essential fragment of the original sentence. To investigate
“DưDu”, and “bệnh_nhânpatient Vũ_DưVuDu”). These grammatical problems with fragments generated by our
phrases are called unsplittable phrases. system, we carried out an experiment using a data set of 200
documents collected from online newspapers. Experimental
• For each textual unit from sentences in the extractive
results shown that fragments end with the following phrases
summary:
cannot be the end of a syntactically correct sentence:
• Add a vertex vi corresponding to this textual unit • The fragment ends with a NP, which can be the subject
when: (i) the textual unit is a stop word, of a clause/sentence, or the object of the main VP of the
prepositions, numerals, and negative words; or (ii) original sentence.
the textual unit with its POS does not exist in the • The fragment ends with a VP, which follows by a NP or
graph. If the textual unit is a proper name or an an AdjP in the original sentence; or the VP ends with a
unsplittable phrase, add all coreference verb (V) and follows by a preposition phrase in the
words/phrases of this textual unit to the text field original sentence.
of the new node.
• Create a directed edge by connecting the vertex Based on our observation, the process of filling the end
corresponding to the previous textual unit with the of a sentence is as follow:
vertex corresponding to the considered textual unit. • If the fragment ends with a NP and there is an AdjP or a
As mentioned earlier in Section 2, our process of VP right after that at the original sentence, connect that
generating an abstractive summary is divided into two AdjP or VP to the end of the fragment.
stages: sentence reduction and sentence combination. The • The fragment ends with a VP and there is a NP or an
stage of sentence reduction is introduced next. AdjP or a VP right after that at the original sentence,
connect that NP or AdjP or VP to the end of the Example 3: [You should meet Thanh today3.1] [after you
fragment. finish this work3.2]. [He will go to Saigon tomorrow.3.3]
• The fragment ends with a VP. That VP ends with a verb
and follows by a preposition phrase in the original 3.1-3.3
EXPLANATION
sentence. In this case, the preposition phrase is
connected to the end of the fragment. 3.1-3.2 S
B. Completing the beginning of a sentence CIRCUMSTANCE
S
The input of this process is the original sentence that has 3.1 3.2 3.3
been divided into basic phrases (NP, VP, etc.) and the Figure 4. The Discourse Tree of Example 3
essential fragment of the original sentence after completing
the end of a sentence. By investigating fragments returned by To construct the discourse structure of a text, the
our experimental results, we found out that the NP and the following tasks should be performed: (i) segmenting text into
VP at the beginning of the fragments may not be the main edus; (ii) recognizing discourse relations between spans; and
NP or the main VP of the original sentence. This is because (iii) constructing a discourse tree that represents the
keywords of the sentence may be in the object of the main discourse structure of the text.
verb of the sentence; or in the preposition phrase of the Most of research on RST for English bases on cue
sentence. phrases such as because, but, although, etc. to segment text
Finding the subject or the main verb of the sentence by [12]. For example, the sentence “We cannot be sure the
locating the first NP or the first VP of the sentence, product is safe although we have tested it.” can be splitted
respectively, is not always correct since these phrases can be into two edus “We cannot be sure the product is safe” and
at the adVP of the sentence. Finding these phrases by “although we have tested it.”, based on the cue phrase
locating the NP or the VP right before these phrases is not although. In addition to cue phrases, syntactic information is
always correct either. Therefore, filling the beginning of the also used in [8] to segment text into edus.
fragment is more complicate than filling the end of the Researchers have defined many discourse relations such
fragment. as list, sequence, elaboration, cause, result, evidence, etc.
By studying written text, we found that the important These relations are divided into three types: N-N, N-S, S-N.
parts of a sentence are often located at the beginning of the In this research, since we only concern in remove
sentence. Therefore, when creating a new sentence from the unimportant part at the beginning of sentences, only S-N
essential fragment of the original sentence, the beginning of relations are concerned. Identifying names of discourse
the fragment is expanded to the beginning of the original relations and constructing the discourse tree of the text are
sentence. After that, some rules are applied to remove out of scope of this research.
unimportant parts at the beginning of the new fragment. To The next section will introduce our method of
recognize these unimportant parts, rules to detect discourse recognizing S-N relations from sentences and removing
relations at the sentence-level [8] are applied. In order to unimportant part at the beginning of a sentence.
understand rules to detect discourse relations at the sentence- 2) Removing unimportant part at the beginning of a
level, Rhetorical Structure Theory (RST) is introduced next. sentence
1) Rhetorical Structure Theory The first step of this stage is to recognize S-N relations
Rhetorical Structure Theory (RST) [10] is a method of from sentences. Based on this, sentence reduction is done by
representing the coherence of text. It models the rhetorical keeping the N part in the summary.
structure of a text by a hierarchical tree that labels discourse As mentioned in Section 3.2.1, text segmentation can be
relations between spans. This hierarchical tree diagram is done by using cue phrases [12] and syntactic information [8].
called a “rhetorical tree” or “discourse tree”. The leaves of As far as we know, there is no Vietnamese syntactic parser
an RST tree correspond to elementary discourse units (edus), whose accuracy is higher than 90%. Therefore, it is not
which are clauses or clause-like units with independent reliable to use the output of syntactic parser for the text
functional integrity, whereas the internal tree nodes segmentation task. The segmentation process for Vietnamese
correspond to larger spans. cannot rely simply on cue phrases neither, as analyzed
Fig. 4 represents the discourse tree of Example 3. Instead below.
of displaying the full text of each tree node, we cite the first Since Vietnamese is a monosyllabic language, a cue
and last edus that contribute to it (e.g., “3.1-3.2”, “3.1-3.3”). phrase may be recognized incorrectly as a part of another
An internal tree node contains one or several names (e.g., word. A word may also be recognized incorrectly as a cue
elaboration, explanation) of the discourse relations that hold phrase. Let us consider Example 4 below:
between adjacent, non-overlapping spans. The span that Example 4:
participates in a discourse relation is either a nucleus (N) or a a. TôiI rấtvery buồnsad khiwhen emyou khôngdid not đếncome.
satellite (S). The nucleus plays a more important role than b. Chẳng_mấy_khirarely anhyou đếncome to nhàhouse tôimy.
the satellite in respect to the writer’s intention. If both spans In Example 4a, the word “khiwhen” is a cue phrase. In
have equal roles, they are both considered as nuclei in the Example 4b, “khi” is a part of the word
relation. “chẳng_mấy_khirarely”, and it is not a cue phrase. To deal
with this problem, information about cue phrases is
combined with information about words and their POS tag to sentence. Since all keywords of the second sentence are in
detect cue phrases in a given sentence3. the NP “các_emthey học_sinhpupil Trườngschool
The list of cue phrases is created by our empirical Mường_LýMuongLy”, the two sentences in Example 4 are
research on Vietnamese text and by inheriting cue phrases combined to create the new sentence “các_emthey
and its template from [5,8,12]. Examples of our template học_sinhpupil Trườngschool Mường_LýMuongLy chỉonly ăneat cơmrice
using cue phrases are: vớiwith muốisalt”
Bởi_vìsince S nêntherefore N.
Nếuif S thìthen N. học_sinh Trường Mường Lý .
In general, the strategy of sentence reduction is language
independent. However, the process of filling the end of a các em chỉ ăn cơm với muối .
sentence is language dependent since each language has its
own grammar principles. của
Đó là tình_cảnh cuộc_sống
IV. SENTENCE COMBINATION
After sentence reduction, the process of sentence
Figure 5. The graph representation for Example 5
combination is carried out. By studying how humans write
summaries, we found that the following cases can be merged c. A sentence has a component that provides more detailed
to create a new sentence with richer information: information for a clause of the previous sentence.
a. Two short and consecutive sentences with the same <sentence 1> = <left text 1> <clause 1>
subject: <sentence 2> = <left text 2> <component 2>
<sentence 1> = <noun|NP> <VP 1> <component 2> in <sentence 2> starts with a phrase with
<sentence 2> = <noun|NP> <VP 2> similar meaning to <clause 1> in <sentence 1>. Notice that
Two sentences are considered as consecutive if they are two consecutive sentences rarely use the same words to
adjacent in the extractive summary. Two sentences have the express a meaning, but synonyms are used instead. A
same subject (i.e., <noun|NP>) if they start from the same synonym dictionary is created by us to detect such cases.
node with the POS is a noun or a NP in the graph. The If all keywords of <sentence 2> is in <component 2>, the
merged sentence in this case is two sentences are merged into one.
<new sentence> = <noun|NP> <VP 1> vàand <VP 2> <new sentence> = <left text 1> <component 2>
Example 6: “MỹU.S. đãhas bày_tỏexpressed lo_ngạiconcern
b. A sentence has a component that provides more detailed vềabout mối đe_dọathreat xâm_nhậpintrusion mạngInternet
information for a noun or a NP of the previous sentence. ngày_càngday by day gia_tăngincreasing” , ôngMr. Hagel
This sentence always starts with a phrase mentioned to phát_biểusaid. Điềuthe problem đángworth chú_ýattention làis ôngMr.
the previous sentence such as “đó làthis is”, “điều đóthis Hagel đãhas đưaissue ra tuyên_bốstatement ngay trước mặtin front of
problem”. The list of such phrases is manually created by các đại_diệnrepresentatives củaof chính_phủgovernment
our empirical research. Trung_QuốcChinese tạiin Đối_thoạidialogue Shangri - La .
<sentence 1> = <left text 1> <noun|NP1> <right text 1>
<sentence 2> = <a phrase mentioned to the previous phát_biểu|
sentence > <left text 2> <NP2> <right text 2> Shangri-La
đã bày_tỏ gia_tăng ” , ông Hagel tuyên_bố . .
in which <NP2> starts with <noun|NP1> and contains ...
proper name in its remaining part. In this case, an edge is
created from the node corresponding to <NP2>, to the node “ Mỹ Điều đáng chú_ý là đã ngay tại
corresponding to <right text 1> in the graph. If all keywords ...
of <sentence 2> is in <NP2> only, the two sentences are đưa_ra Đối_thoại
merged into one: Figure 6. The graph representation for Example 6
<new sentence> = <left text 1> <NP2> <right text 1>
Example 5: Các_emthey chỉonly ăneat cơmrice vớiwith In Example 6, the clause “ôngMr. Hagel phát_biểusaid” in
muốisalt. Đóthis làis tình_cảnhsituation cuộc_sốnglife củaof the first sentence has the same meaning with “ôngMr. Hagel
các_emthey học_sinhpupil Trườngschool Mường_LýMuongLy. đãhas đưaissue ra tuyên_bốstatement” in the second sentence.
The graph representation for Example 5 is shown in Fig. Therefore, these two sentences are combined to create a new
5. The second sentence in Example 5 starts with the phrase sentence:
“đó làthis is” and contains the NP “các_emthey học_sinhpupil “MỹU.S. đãhas bày_tỏexpressed lo_ngạiconcern vềabout mối
Trườngschool Mường_LýMuongLy”, which is a detailed đe_dọathreat xâm_nhậpintrusion mạngInternet ngày_càngday by day
description of the noun “Các_emthey” in the previous gia_tăngincreasing” , ôngMr. Hagel đãhas đưaissue ra
tuyên_bốstatement ngay trước mặtin front of các đại_diệnrepresentatives
3
củaof chính_phủgovernment Trung_QuốcChinese tạiin
The softwares vnTokenizer and vnTagger, created by Le Hong Đối_thoạidialogue Shangri - La .
Phuong (at http://mim.hus.vnu.edu.vn/phuonglh/softwares), are The strategy of sentence combination is language
used for segmenting a Vietnamese text into words and tagging independent.
POS.
V. EXPERIMENTAL RESULTS AND DISCUSSION VI. CONCLUSIONS AND FUTURE WORK
As far as we know, there is no abstractive summarizing This paper has introduced an approach to abstractive text
corpus for Vietnamese. Therefore, to carry out experiments summarization, which consists of two stages: sentence
with the summarizing system, we have to create a corpus by reduction and sentence combination. The sentence reduction
ourselves. Our corpus consists of 50 documents collected stage is based on discourse rules to remove redundant
from several Vietnamese newspaper websites (e.g., Dantri, clauses at the beginning of a sentence, and syntactic
VnExpress, etc.) and belongs to two categories: economy constraints to complete the end of the reduced sentence. The
and culture. The lengths of documents are various from 300 sentence combination stage is based on word graph to
words to 1000 words. Each document has 22 sentences in present relations among words, clauses and sentences from
average. The abstractive summaries were created manually the input text. New sentences that combine information from
by hand (one summary per document) with approximately several sentences are generated by using word graph.
100 words in length. Experimental results show that our approach is promising in
The input of our abstractive summarizer is the output of solving the AS task.
our extractive one, which generates summaries with To improve the system, our future works include: (i)
approximately 120 words in length. The output of our propose methods to improve the meaning completeness of
abstractive summarizer contains 100 words in average. sentences generated in the sentence reduction phrase; (ii)
Among 433 sentences generated by our abstractive propose methods to further compress sentences; and (iii)
summarizer, 95% sentences are syntactic correct; 72% of investigating strategies to efficiently combine sentences in
those sentences are complete in meaning with unimportant the summary.
parts at the end of sentences being removed. Most cases of
incomplete sentences are due to the process of completing ACKNOWLEDGMENTS
the end of a new sentence in the sentence reduction phrase. This work was supported by the Vietnam Ministry
Reasons for this problem are: project, under Grant B2012 – 01 - 24.
• In the case of elaborative clauses situating between the REFERENCES
main NP and the main VP of a sentence, the system
[1] Dijkstra, E. W. 1959. A note on two problems in connexion
misrecognizes the VP of the clause as the main VP of with graphs. Numerische Mathematik, vol. 1, pp. 269–271.
the sentence.
[2] Gunes, E. and Radev, D.R. 2004. Lexrank: graph-based
• The basic phrases of a sentence is detected incorrectly lexical centrality as salience in text summarization. J. Artif.
by the Vietnamese chunker, whereas information about Int.Res., 22(1):457–479.
basic phrases are the key point in completing the end of
[3] Ganesan, K., Zhai, C., Han, J. 2010. Opinosis: A Graph-
a new sentence. Based Approach to Abstractive Summarization of Highly
The abstractive summaries generated by our system are Redundant Opinions. In Proc. of Coling 2010, pages 340–348.
also compared with the summaries in the corpus, using the
[4] Knight, K. and Marcu, D. 2000. Statistics-based
ROUGE (Recall-Oriented Understudy for Gisting
summarization - step one: sentence compression. In Proc. of
Evaluation) measurement [7]. The ROUGE measures count AAAI 2000.
the number of overlapping units such as n-gram, word
sequences, and word pairs between the computer-generated [5] Hoang, T.P. 1980. Vietnamese grammar. Publisher of
summary and the ideal summaries created by humans. In our professional school.
experiments, since each document has only one summary, [6] Jing, H. and McKeown, K. R. 2000. Cut and paste based text
we only compare a candidate summary with a reference one. summarization. In Proc. of NAACL 2000.
Using the above formula, we get values of Rouge-1 and [7] Lin, C.Y. 2004. ROUGE: A Package for Automatic
Rouge-2 of 0.2513 and 0.1344, respectively. Since there is Evaluation of Summaries. In Proc.of NTCIR Workshop 2004.
no work on generating abstractive summaries using the same [8] Le, H.T., Abeysinghe, G. and Huyck, C. 2004. Generating
corpus with us, we cannot compare our experimental results Discourse Structures for Written Texts. In Proc. of COLING
with other research. However, according to [3], Rouge-1 and 2004, Switzerland.
Rouge-2 values when comparing abstractive summaries [9] Lloret, E., Palomar, M. 2011. Analyzing the Use of Word
created by two people are 0.3088 and 0.1069, respectively4. Graphs for Abstractive Text Summarization. In Proc. of
It indicates that our approach is promising in solving the text IMMM 2011.
summarization task. However, since text generation in [10] Mann, W. C. and Thompson, S. A. 1988. Rhetorical Structure
general and automatically abstractive text summarization in Theory: Toward a Functional Theory of Text Organization.
particular is still a challenge task, more work should be done Text, vol. 8(3), 243-281.
to improve the quality of the system.
[11] Mihalcea, R. and Tarau,P. 2004. TextRank: Bringing order
into texts. In Proc. of EMNLP-04.
[12] Marcu, D. 1997. The Rhetorical Parsing, Summarization, and
Generation of Natural Language Texts”, PhD Thesis,
Department of Computer Science, University of Toronto.
4
The corpus used in [3] is different than ours.

View publication stats

You might also like