moawad2012
moawad2012
Text Summarization
Ibrahim F. Moawad Mostafa Aref
Information Systems Dept. Computer Science Dept.
Faculty of Computer and Information Sciences Faculty of Computer and Information Sciences
Ain shams University Ain shams University
Cairo, Egypt Cairo, Egypt
[email protected] [email protected]
Abstract— One of the important Natural Language Processing In this paper, a novel approach is presented to generate an
applications is Text Summarization, which helps users to manage abstractive summary automatically for the input text using a
the vast amount of information available, by condensing semantic graph reducing technique. This approach exploits a
documents’ content and extracting the most relevant facts or new semantic graph called Rich Semantic Graph (RSG) [3, 4].
topics included. Text Summarization can be classified according RSG is an ontology-based representation developed to be used
to the type of summary: extractive, and abstractive. Extractive as an intermediate representation for Natural Language
summary is the procedure of identifying important sections of the Processing (NLP) applications. The new approach consists of
text and producing them verbatim while abstractive summary three phases: creating a rich semantic graph for the source
aims to produce important material in a new generalized form.
document, reducing the generated rich semantic graph to more
In this paper, a novel approach is presented to create an
abstracted graph, and finally generate the abstractive summary
abstractive summary for a single document using a rich semantic
graph reducing technique. The approach summaries the input from the abstracted rich semantic graph.
document by creating a rich semantic graph for the original The paper is organized as follows. A simple background
document, reducing the generated graph, and then generating the and related work are presented in Section II. Section III
abstractive summary from the reduced graph. Besides, a presents the proposed approach architecture, while Section IV
simulated case study is presented to show how the original text describes and explains the approach phases. To illustrate how
was minimized to fifty percent. the approach works, and what its expected utility is, a
simulated case study called “Graduate Students” is presented in
Keywords- Text Summarization; Abstractive Summary;
Semantic Representation; Rich Semantic Graph; Semantic Graph Section V. Finally, Section VI concludes the paper.
semantic graph and then using the document and graph features
Hoping to defuse criticism that it is not doing its share to oppose Baghdad,
Japan said up to $2 billion in aid may be sent to nations most affected by the U.N.
embargo on Iraq. President Bush on Tuesday night promised a joint session of
Congress and a nationwide radio and television audience that ``Saddam Hussein will
sentences) that have been retrieved from the text. For each
generated triplet, they assign a set of features comprising
Rich Semantic Graph
linguistic, document, and graph attributes. They then train the
Creation
linear Support Vector Machine classifier to determine those
triplets that are useful for extracting sentences which are later
compose the summary.
In their approach, Leskovec, et al. aimed to create an
extractive summary from the source document only, and hence
they did not consider the abstractive summary. Besides, they
use the semantic graph it its ordinary form to represent the
input document, therefore the generated graph will be very Rich Semantic Graph Domain Ont.
huge, because the graph granularity level is high. Reduction &
WordNet
In this paper, a novel approach is presented to create an
abstractive text summary automatically. This approach
summaries the input document by creating a rich semantic
graph for the original document. The generated rich semantic
graph enriches the traditional semantic graph by associating
attributes to the graph nodes. After that, the approach reduces Summarized Text
the generated rich semantic graph to more abstracted graph, Generation
and then it generates the abstractive summary from the
abstracted rich semantic graph.
Cracks appeared Tuesday in the U.N. trade embargo
against Iraq as Saddam Hussein sought to circumvent the
economic noose around his country. Japan, meanwhile,
announced it would increase its aid to countries hardest
hit by enforcing the sanctions. Hoping to defuse criticism
that it is not doing its share to oppose Baghdad,
Japan said up to $2 billion in aid may be sent to nations
most affected by the U.N. embargo on Iraq. President
Bush on Tuesday night promised a joint session of
Congress and a nationwide radio and television audience
that ``Saddam Hussein will fail'' to make his conquest of
Kuwait permanent. ``America must stand up to
aggression, and we will,'' said Bush, who added that the
133
module is responsible to accept the input text, and converts it to Finally, co-reference and pronominal resolution reference
preprocessed sentences. The Rich Semantic Sub-graphs resolution processes identify co-reference named entities and
Generation module is responsible to transform each resolve pronominal references in the whole input text. The
preprocessed sentence to a set of ranked rich semantic sub- preprocessing module has two main objectives: resolving the
graphs. Finally, the Rich Semantic Graph Generation module is syntactic ambiguity and then retrieving both set of tags
responsible to generate set of ranked RSGs for the input ranked (syntactic and morphological) and typed dependency relations
semantic sub-graphs. These RSGs represent different semantic between words for the input text. For example, Fig. 3 shows
representations of the whole document, where the most ranked the syntactic and morphological tags, and the typed
RSG is considered. dependency relations for the "Sara is a graduate student."
sentence using syntactic analyzer tool called "lingsoft" [15] and
1) The Preprocessing module: It consists of four main a parser tool built by Stanford University Natural Language
processes: named entity recognition, morphological and Group [16].
syntactic analysis, cross-reference resolution, and pronominal
resolution processes. The named entity recognition process 2) The Rich Semantic Sub-graphs Generation module: The
locates atomic elements into predefined categories such as main objective of the Rich Semantic Sub-graphs Generation
person names, organizations, etc. In morphological analysis, module is to generate multiple rich semantic sub-graphs for
each word is divided into morphemes and figures out its each input preprocessed sentence. Each preprocessed sentence
grammatical categories, the syntactic analysis parses the whole is composed of a sequence of words: Si = [Wi1, Wi2, … Win],
sentence to describe each word syntactic function and build the where Wij is a word j belonging to a sentence i. Each word is
parse tree, and typed dependencies expresses syntactic represented as a triple sequence Wij = [St, T, D], where St
knowledge in terms of direct relationships between words. represents the word stem, T represents the set of tags
(morphological and syntactic), and D represents the set of
typed dependency relations. This module includes three
processes: Word Senses Instantiation, Concepts Validation, and
Semantic Sentences Ranking processes.
Cracks appeared Tuesday in the U.N. trade embargo against Iraq as Saddam Hussein
sought to circumvent the economic noose around his country. Japan, meanwhile,
announced it would increase its aid to countries hardest hit by enforcing the sanctions.
Hoping to defuse criticism that it is not doing its share to oppose Baghdad,
Japan said up to $2 billion in aid may be sent to nations most affected by the U.N.
embargo on Iraq. President Bush on Tuesday night promised a joint session of
Congress and a nationwide radio and television audience that ``Saddam Hussein will
fail'' to make his conquest of Kuwait permanent. ``America must stand up to
aggression, and we will,'' said Bush, who added that the U.S. military may remain in the
x
Saudi Arabian desert indefinitely. ``I cannot predict just how long it will take
134
entailment. There are many rules can be derived based on
Morphological and syntactic tags many factors: the semantic relation, the graph node type (noun
x "sara" <*> <Proper> N NOM SG @SUBJ or verb), the similarity or dissimilarity between graph nodes,
x "be" <SV> <SVC/N> <SVC/A> V PRES SG3 VFIN @+FMAINV etc. Table 1 presents a set of heuristic rule examples that can
x "a" <Indef> DET CENTRAL ART SG @DN> be applied on the graph nodes of two simple sentences:
x "graduate" A ABS @AN> Sen1=[SN1,MV1,ON1] and Sen2=[SN2,MV2,ON2]. Each
x "student" N NOM SG @PCOMPL-S sentence is composed of three nodes: Subject Noun (SN) node,
Typed dependencies Main Verb (MV) node, and Object Noun (ON) node. For
x nsubj(student, Sara) example, in rule 1, both main verbs (MV1 and MV2) are
x cop(student, is) merged and both sentence objects (ON1 and ON2) are merged if
x det(student, a)
the two sentence subjects are instances of the same noun (N),
x amod(student, graduate)
both sentence verbs are similar, and finally both sentence
objects are similar.
Figure 3. Example of syntactic and morphological tags,
and typed dependency relations
C. The Summarized Text Generation Phase
x Sentences Ranking process: It aims to rank and to This phase aims to generate the abstractive summary from
threshold the highest ranked rich semantic sub-graphs the reduced Rich Semantic Graph (RSG) [18]. To achieve its
for each sentence. To generate single rich semantic task, the phase accesses the domain ontology, which contains
graph and to keep the semantic consistency for the the information needed in the same domain of RSG to generate
whole sentence, the process considers the first ranked the final texts.
rich semantic sub-graph only. The ranking method is
based on deriving the average weight of each concept TABLE I. REDUCTION HEURISTIC RULE EXAMPLES
(word sense) and the average weight of the whole
sentence concepts based on (1) and (2) respectively.
The weight of the word concept is derived according to
its usage popularity (Wordnet usage popularity) [17].
In (1), n represents the Wordnet usage popularity of the
concept C, and N is the total number of senses for the
concept word. In (2), M represents the total number of
concepts in a sentence. For example, in the "Sally is
specialized in computer-science." sentence, the word
"Sally" has only one concept (one sense), so its weight
%
equals to 10. The word "specialized" has three &
concepts (senses) and, their weights equals to (10, 7, #!%%%%"!%%"$
and 6). The word "computer-science" has only one #!&%&%"!&&"$
concept and, its weight equals to 10. Based on these
values, the output rank values of these sentence rich % %
semantic sub-graphs are (10, 9, and 8.6). & &
% &
C weight = 10 * (5 * ((n-1)/N)) % &
%
&
S weight = ( ∑ Mm=0 C weight )/M % &
3) The Rich Semantic Graph Generation module: Finally, % %
& &
the Rich Semantic Graph Generation module is responsible to % &
generate the final rich semantic graphs of the whole input % &
document from the highest-ranked rich semantic sub-graphs of
the document sentences. The semantic sub-graphs of the input % &
document will be merged to form the final rich semantic graph. % &
%
&
B. The Rich Semantic Graph Reduction Phase
This phase aims to reduce the generated rich semantic % &
% %
graph of the original document to more reduced graph. In this & &
phase, a set of heuristic rules are applied on the generated rich % &
semantic graph to reduce it by merging, deleting, or
consolidating the graph nodes. These rules exploit the
WordNet semantic relations: hypernym, holonym, and
135
Besides, the WordNet ontology is accessed to generate 2) The Sentence Planning module: It improves the fluency
multiple texts according to the word synonyms. The generated or understandability of the text. To achieve this objective, the
multiple texts are evaluated and ranked, where the most ranked words of the text should be related to each other, the clauses
text is considered. The texts evaluation is achieved according should exhibit no unintentional redundancy, and the different
two criteria: the most frequently used words and the discourse
sentences with the same subject should be aggregated. The
sentence relations.
sentence planning module receives noun and verb objects and
Fig. 4 shows the main modules composing the Summarized generates semi-paragraphs. The sentence planning consists of
Text Generation phase, where there are four modules namely four main processes: Lexicalization, Discourse Structuring,
the Text planning, the Sentence Planning, the Surface Aggregation, and Referring Expression processes.
Realization, and the Evaluation modules. Firstly, the text x Lexicalization Process: In this process, for each
planning module aims to select the appropriate content material verb/noun object, its synonyms are selected by
to be expressed in the final text. Secondly, the sentence accessing the WordNet ontology to generate the target
planning module specifies the sentence boundaries, and content. To select the most appropriate synonyms, a
generates and orders an intermediate paragraphs. Thirdly, the weight W is assigned for every synonym. This weight
sentence realization module generates grammatically-corrected is calculated using (3), where E is the existence
paragraphs. Finally, because of generating multiple texts, the probability of the synonym in the input rich semantic
module of text evaluation evaluates the final multiple texts graph, NR represents the synonym WordNet rank, RT
based on the most frequently used words using the WordNet represents the total value of all synonym ranks, NGS
ontology and the relations between sentences. represents the WordNet group by similarity for
1) The Text planning module: It decides what information synonym, and TG represents the total number of
should be included in the generated text. In our approach, to groups by similarity for all synonyms. According to
preserve all semantic information embedded in the input experimental test, the best weight value starts from 8,
so this value has been considered as a threshold value.
semantic representation (Rich Semantic Graph), all graph
Therefore, the word synonyms that have weight greater
objects (noun and verb objects) are considered to be passed to than or equal 8 will be selected only.
the sentence planning module.
NR NGS
W ( E (1 )( )) / 3) *10
RT TG
x Discourse Structuring Process: It builds a suitable
structure to contain the selected object synonyms in the
Text Planning
form of pseudo-sentences (the first form of the
generated sentences). Initially, the noun objects are
sorted in descending order according to their number
Set of Objects of attributes. For each noun object, a pseudo-sentence
is composed for each attribute, and a pseudo-sentence
is composed for each related verb to that object.
Sentence Planning x Aggregation Process: It decides how pseudo-sentences
are combined into semi-paragraphs. Two processes are
applied: subject grouping and predicate grouping
Semi-Paragraphs processes [19]. The subject grouping process is
responsible for grouping clauses with common
elements with the same subjects, while the predicate
Surface Domain Ont. grouping process is responsible for grouping the clause
Realization & with the same predicate. Using the domain ontology,
WordNet the discourse relations are retrieved. The module uses
Paragraphs
the PDTB (Pann Discourse Tree Bank Model) relations
[20]. The discourse role of the object is defined in the
input semantic graph. It is defined as the discourse
relation type and the argument span in which the object
Evaluation is located in the input semantic graph. Then, the
relations retrieved from domain ontology among the
pseudo-sentences connect them with each other. The
Cracks appeared Tuesday in the U.N. trade embargo
against Iraq as Saddam Hussein sought to circumvent the
economic noose around his country. Japan, meanwhile,
details of this process are very application dependent.
announced it would increase its aid to countries hardest
hit by enforcing the sanctions. Hoping to defuse criticism
x
that it is not doing its share to oppose Baghdad,
136
into a list, and the process starts replacing the subject "Student 2" are instances of the Student noun class, both
with the appropriate pronoun after leaving the first "Specialize 1" and "Specialize 2" verbs are similar, both "Field
pseudo-sentence subject. The process restricts the 1" and "Field 2" objects are instances of subclasses of the same
replacement of the subject for every three pseudo- super-class. Therefore, both "Specialize 1" and "Specialize 2"
sentences, and then it starts again. verbs were merged into "Specialize 1", both "Field 1" and
"Field 2" objects were replaced and merged into "Field 3",
3) The Surface Realization module: This module aims to which has more abstracted value ("Artificial Intelligence").
transform the enhanced semi-paragraphs into paragraphs by
correcting them grammatically (inflect words for tense, etc.)
Angle Chris is a graduate student. Mrs. Chris is
and adding the required punctuation (capitalization adding
specialized in Machine learning field. John Michel is
semicolon, etc). In the proposed approach, the techniques of
Simplenlg (Simple natural language generation) [21] can be a graduate student. He is specialized in Intelligent
exploited to achieve these objectives. Agents field. During his study, Mr. Michel passed the
4) The Evaluation module: The main objective of this preparatory courses. Angle Chris published two
module is to evaluate and then rank the paragraphs according papers in international conferences. Also, John
to two factors: the coherence between paragraph sentences, and Michel published two papers in international
the most frequently used paragraph word synonyms. According conferences.
to experimental test, we have found that the coherence measure
generates very close results, so the most frequently used
paragraph word synonyms is used as an additional evaluation Figure 5. The original text of graduate students example
factor. Firstly, text coherence evaluation is applied for
assessing whether the paragraphs are coherent or not [22]. Specialize 1
Field 1
Therefore, each paragraph is evaluated and ranked according to Tense: Present
the number of coherence relations between its sentences. Student 1 Agent: Student 1 Value: Machine
Secondly, the most frequently used paragraph word synonyms Name: Angle Chris
Object: Field 1 learning
Type: Singular
are aggregated by accessing the WordNet rank. Finally, the Level: Graduate
final paragraphs can be sorted according to the coherence Type: Singular
Publish 1
evaluation rank and then by the most frequently used paragraph Research 1
word synonyms rank. Tense: Past
Agent: Student 1 Value: Scientific
Object: Research 1 Paper
V. GRADUATE STUDENTS CASE STUDY Location: international Adjective: Two
conferences Type: Plural
To show how the proposed approach works, a simulated
case study called “Graduate students” is presented in details.
Fig. 5 shows the input text, which consists of single paragraph
Specialize 2
talking about two graduate students (Angle Chris and John
Michel). It consists of 7 sentences and contains 53 words. After Tense: Present
Field 2
applying the Preprocessing, Rich Semantic Sub-graphs Agent: Student 2
Value: Intelligent
Generation, and Rich Semantic Graph Generation modules of Object: Field 2
Agents
the Rich Semantic Graph Creation phase, a rich semantic graph Type: Singular
is created as shown in Fig. 6. The rich semantic graph nodes
represent the instantiated objects of the domain ontology Pass 1
classes for the input text nouns and verbs. It contains 8 noun Student 2
Tense: Past Course 2
nodes representing the sentence subjects and objects, and 5 Name: John Michel Agent: Student 2
verb nodes (with gray background color) representing the Level: Graduate Object: Course 2 Value: Preparatory
sentence main verbs. For example, the "Mrs. Chris is Type: Singular Time: during study Type: Plural
specialized in Machine learning field." sentence is represented
with the "Student 1", "Specialize 1", and "Field 1" nodes. Publish 2
By applying the reduction heuristic rules on the generated Research 2
Tense: Past
rich semantic graph from the input text, the reduced graph is Agent: Student 2 Value: Scientific
shown in Fig. 7. Initially, rule number 1 was fired and applied Object: Research 2 Paper
on the original semantic graph, where both "Student 1" and Location: international Adjective: Two
"Student 2" are instances of the Student noun class, both conference Type: Plural
"Publish 1" and " Publish 2" are similar, and both "Research 1"
and " Research 2" are similar. Therefore, both "Publish 1" and
"Publish 2" are merged into "Publish 1", and both "Research 1" Figure 6. The rich semantic graph of graduate students original text
and "Research 2" are merged into "Research 1". After that, rule
number 4 was fired and applied, where both "Student 1" and
137
Finally, the Text planning, the Sentence Planning, the REFERENCES
Surface Realization, and the Evaluation modules of the [1] E. Lloret, M. Palomar, "Text summarisation in progress: a literature
Summarized Text Generation phase are applied on the reduced review", Artificial Intelligence Review, Vol. 37, No. 1, pp. 1-41, 2012.
Rich Semantic Graph to generate the abstractive summary [2] D. Das, A. Martins, "A Survey on automatic text summarization",
shown in Fig. 8. As shown in the summary, the "Angle Chris Unpublished, Literature survey for Language and Statistics II, Carnegie
and John Michel are graduate students." sentence was Mellon University, 2007.
composed from both "Student 1" and "Student 2" nodes, and [3] M. Aref, I. Moawad, S. Ibrahim., "Rich Semantic Graph Generation
the "They are specialized in Artificial Intelligence field." System Prototype", The tenth Conference on Language Engineering,
Cairo, Egypt, 2010.
sentence was composed from the "Student 1", "Student 2",
[4] I. Moawad, M. Aref, S. Ibrahim, "Ontology-based Model for Generating
"Specialize 1", and "Field 3" nodes. The final summary Text Semantic Representation", the International Journal of Intelligent
consists of 4 sentences and contains 29 words. The final Computing and Information Sciences “IJICIS”, Vol. 11, No. 1, pp. 117-
abstractive text represents about 50% of the original text. 128, January 2011.
[5] D. Radev, E. Hovy, K. McKeown, "Introduction to the Special Issue on
Summarization", Computational Linguistics, Vol. 28, No. 4, pp. 399-
Specialize 1 408, 2002.
[6] K. Svore, L. Vanderwende, C. Burges, "Enhancing single-document
Field 3
Tense: Present summarization by combining RankNet and third-party sources", In
Student 1 Agent: Student 1,2 Proceedings of the EMNLP-CoNLL, pp. 448-457, 2007.
Value: Artificial
Object: Field 3 [7] D. Evans, K. McKeown, J. Klavans, "Similarity-based Multilingual
Name: Angle Chris Intelligence
Type: Singular Multi-Document Summarization", Technical Report CUCS-014-05,
Level: Graduate Department of Computer Science, Columbia University, Apr 2005.
Type: Singular
Publish 1 [8] A. Stergos, K. Vangelis, S. Panagiotis, "Summarization from medical
Research 1 documents: a survey", Artificial intelligence in medicine, Vol. 33, No. 2,
Tense: Past pp. 157-77, 2005.
Agent: Student 1, 2 Value: Scientific [9] J. Leskovec, M. Grobelnik, N. Milic-Frayling, "Learning Sub-structures
Object: Research 1 Paper of Document Semantic Graphs for Document Summarization", in
Location: international Adjective: Two KDD2004 Workshop on Link Analysis, 2004.
conferences Type: Plural [10] J. Leskovec, M. Grobelnik, N. Milic-Frayling, "Learning Semantic
Graph Mapping for Document Summarization", 2000.
[11] J. Leskovec, M. Grobelnik, N. Milic-Frayling, "Extracting Summary
Pass 1 Sentences Based on the Document Semantic Graph", Microsoft
Student 2
Research, 2005.
Tense: Past Course 2
Name: John Michel [12] D. Rusu, B. Fortuna, M. Grobelnik, D. Mladenić, "Semantic Graphs
Agent: Student 2
Level: Graduate Value: Preparatory Derived From Triplets With Application In Document Summarization",
Object: Course 2
Type: Singular International journal of Computing and Informatics, Vol.33, No.3, 2009.
Time: during study Type: Plural
[13] C. Fellbaum, "WordNet: An Electronic Lexical Database", MIT Press,
1998.
Figure 7. The reduced rich semantic graph [14] G. Miller, R. Beckwith, C. Fellbaum, D. Gross, K. Miller, Five Papers
on WordNet. Cognitive Science Laboratory, Princeton University,
Princeton, 1990.
Angle Chris and John Michel are graduate students. [15] ENGCG: Constraint Grammar Parser of English,
They are specialized in Artificial Intelligence field. http://www2.lingsoft.fi/cgi-bin/engcg, June 15, 2012.
They published two papers in international [16] Stanford Parser,http://nlp.stanford.edu:8080/parser/index.jsp, June 15,
2012.
conferences. During study, John Michel passed [17] A. Sharaf, "An Object-Oriented Model for Semantic Analysis of Natural
Preparatory courses. Languages", Master Thesis, Information Computer Science Dept., King
Fahd University of Petroleum and Mineral, Saudi Arabia, January 2001.
Figure 8. The graduate students summarized text [18] I. Fathy, D. Fadl, M. Aref, “Rich Semantic Representation Based
Approach for Text Generation”, The 8th International conference on
Informatics and systems (INFOS2012), Egypt, 2012.
VI. CONCLUSION
[19] H. Dalianis, E. Hovey, "Aggregation in Natural Language Generation",
In conclusion, a novel approach to create an abstractive EWNLG-93, Proceedings of the 4th European Workshop on Natural
summary for a single document using a semantic graph Language Generation, Pisa, Italy, 2004.
reducing approach was presented in this paper. The approach [20] R. Prasad, N. Dinesh, A. Lee, E. Miltsakaki, L. Robaldo, A. Joshi, B.
summaries the source document by creating a semantic graph Webber, "The Penn Discourse Treebank 2.0", Proceedings of the 6th
International Conference on Language Resources and Evaluation (LREC
called Rich Semantic Graph for the original document, 2008), Morocco, 2008.
reducing the generated semantic graph to more abstracted [21] A. Gangemi, R. Navigli, P. Velardi, "The OntoWordNet Project:
graph, and generating the abstractive summary from the Extension and Axiomatization of Conceptual Relations in WordNet", In
reduced graph. A case study showed that the proposed Proc. of International Conference on Ontologies, Databases and
approach succeeded to minimize the original text to fifty Applications of SEmantics (ODBASE 2003), Catania, Italy, pp. 820–
percent. In the future work, we are going to develop a 838, 2003.
prototype to conduct more several case studies using several [22] Z. Lin, H. Ng, M. Kan, "Automatically Evaluating Text Coherence
Using Discourse Relations", In Proceedings of the 49th Annual Meeting
documents with different sizes, and hence assess the results of of the Association for Computational Linguistics: Human Language
our work properly. Technologies (ACL-HLT 2011), Portland, Oregon, USA, 2011.
138