2automatic Text Summarization With Neural Networks
2automatic Text Summarization With Neural Networks
11. LITERATURE
REVIEW
0-7803-8278-11041$20.0002004 IEEE 40
TABLE I selection of the summary sentences in terms of their
SEVEN FEATURES OF A DOCUMENT
importance. These three steps are explained in detail in the
Feature Description next three sections.
f, Paragraph follows title
h Paragraph location in document A. Phase I: Neural Network Training
/i Sentence location in paragraph
fd First sentence in paragaph The first phase of the process involves tmining the neural
fs Sentence length networks to learn the types of sentences that should be
included in the summary. This is accomplished by training
fs Number of thematic words in the sentence
the network with sentences in several test paragraphs where
fi Number of title words in the sentence
each sentence is identified as to whether it should be
included in the summary or not. This is done by a human
Features X to f4 represent the location of the sentence
reader. The neural network learns the patterns inherent in
within the document, or within its paragraph. It is expected
sentences that should be included in the snmmary and those
that in structured documents such as news articles, these
that should not be included. We use a three-layered
features would contribute to selecting summary sentences.
feedfonvard neural network, which has been proven to be a
Brandow et al. in [ l l ] have shown that summaries
universal function approximator. It can discover the
consisting of leading sentences outperform most other
patterns and approximate the inherent function of any data
methods in this domain, and Baxendale in [4] demonstrated
to an accuracy of 1 0 0 % as long as there are no
that sentences located at the beginning and end of
contradictionsin the data set.
paragraphs are likely to be good summruy sentences.
We use a gradient method for training the network where
Feature fs, sentence length, is useful for filtering out short
the energy function is a combination of error function and a
sentences such as datelines and author names commonly
penalty function. The goal of training is to search for the
found in news articles. We also anticipate that shod
global minima of the energy function. The total energy
sentences are unlikely to be included in summaries [3].
function to be minimized during the training process is:
Feature fs, the number of thematic words, indicates the
number of thematic words in the sentence, relative to the
maximum possible. It is obtained as follows: from each e ( w , v )= E ( w , ~ ) + P ( ~ , v ) (1)
document, we remove all prepositions, and reduce the The error function to be minimized is the mean squared
remaining words to their morphological roots [12]. The error.
resultant content words in the document are counted for
occurrence. The top 10 most frequent content words are
considered as thematic words. This feature determines the
ratio of thematic words to content words in a sentence. This
feature is expected to be important because terms that occur The addition of the penalty function drives the associated
frequently in a document are probably related to its topic weights of unnecessary connections to very small values
[6], Therefore, we expect a high Occurrence of thematic while strengtheningthe rest of the connections. Therefore,
words in salient sentences. Finally, f e a t u a b indicates the the unnecessary connections and neurons can be pruned
number of title words in the sentence, relative to the without affecting the performance of the network. The
maximum possible. It is obtained by counting the number p ~ l t function
y is defined as:
of matches between the content words in a sentence, and
the words in the title. This value is then normalized by the P(w,v) = Pdecoy(F;(w,v)+
PZ(W,V)) (3)
maximum number of matches. This feature is expected to
be important because the salience of a sentence may be
affected by the number of words in the sentence also
appearing in the title. These features may be changed or
new features may be added. The selection of features plays
an important role in determining the type of sentences that
will be selected as part of the summary and, therefore,
would influence the performance of the neural network.
IV. TEXTSUMMARIZIATION
PROCESS
B. Phase II: Feature Fusion
There are three phases in OUT process: neural network
Once the network has learned the features that must exist
training, feature fusion, and sentence selection. The firs1
step Golves a neural network to recognize
type of sentences that should be included in the summary.
the in summary sentences, we need to discover the trends and
among the features that are inherent in the
nesecond step, feature fusioR prnnes the neural majority of Sentences. This is accomplished by the feature
and collapses hidden layer unit activations into discrete fusion phase, which consists of hvo steps: 1) eliminating
values with identified freauencies. This steD generalizes uncOmo-n features; and 2, the effects Of
the important features that must exist in the summary
sentences by fusing the features and finding trends in the
summaly sentences. The third step, sentence selection, uses '1 Uncommon Features
the modified neural network to filter the text and to select After the training phase, the connections having very
only the highly ranked sentences. This step controls the small weights can be pruned without affecting the
41
performance of the netwolk. For each input to hidden layer classifying the activation values in appropriate clusters. 2)
cokection (wii), if maxlvpwu:jl< 0.1 remove wv,and for Re-clustering with one-half of the original radius eliminates
k any possible overlaps among clusters. The combination of
I
each hidden to output layer connection (v,& if 1vjk I 0.1
these two steps corresponds to generalizing the effects of
sentence features.
remove vjk. Each cluster is identified by its centroid and frequency.
As a result, any input or hidden layer neumn having no Feature fusion phase provides control parameters, which
emanating connections can be safely removed from the can be used for sentence ranking.
network. In addition, any hidden layer neumn having no
abutting connections can be removed. This corresponds to C. Phase III: Sentence Selection
eliminating uncommon features from the network.
Once the network has been trained, pruned, and
Once the pruning step is complete, the network is trained
generalized, it can be used as a tool to determine whether or
with the same dataset in phase one to ensure that the recall
not each sentence should be included in the summary. This
accuracy of the network has not diminished significantly.
phase is accomplished by providing control parameters for
If the recall accuracy'of the network drops by more than
the desired radius and frequency of hidden layer activation
2%, the pruned connections and neurons are restored and a
clusters to select highly ranked sentences. The sentence
stepwise pruning approach is pursued. In the stepwise
ranking is directly proportional to cluster frequency and
pruning approach, the incoming and outgoing connections
inversely proportional to cluster radius. Only sentences
of the hidden layer neurons are pruned and the netwok is
that satisfy the required cluster boundaIy and frequency of
re-trained and tested for recall accuracy, one hidden layer
all hidden layer neurons are selected as high-ranking
neuron at a time.
summay sentences. The selected sentences posses the
common features inherent in the majority of summaty
2) Collapsing the Effects of Common Features
sentences.
After pruning the network, the hidden layer activation
values for each hidden layer neuron are clustered utilizing v. RESULTSAND h ' A L Y S I S
an adaptive clustering technique, where G, is the centroid
of cluster c . We used 85 news arlicles from the Internet with various
topics such as technology, sports, and world news to train
min (Dist ( G , , e ) < r c ) V c E U (4) the network. Each article consists of 19 to 56 sentences
with an average of 34 sentences. The entire set consists of
The clustering algorithm is adaptable, that is, the clusters 2,'835 sentences. Every sentence, which is represented by a
are created dynamically as activation values are added into feature vector, is labeled as either a summary sentence or an
the clusterspace. Therefore, the number of clusters and the unimportant sentence. A h u h reader performed the
number of activation values in each cluster are not known a labeling of the sentences. A total of 163 sentences were
priori. The centroid of each cluster represents the mean of labeled as summary sentence with an average of 12
the activation values in the cluster and can be used as the sentences per article. Text summarization process can be
representative value of the cluster, while the frequency of applied to both real-valued and binary-valued input vectors.
each cluster represents the number of activation values in Therefore, we trained three different newal networks, one
that cluster. By using the centroids of the clusters, each for real-valued inputs and two for binary-valued inputs.
hidden layer neuron has a minimal set of activations. This
helps with getting generalized outputs at the output layer. A . 'Real-ValuedFeature Vectors
This corresponds to collapsing the effects of common For the real-valued feature input, we trained a neural
features. In the sentence selection phase, the activation network (NI)with seven input layer neurons, twelve hidden
value of each hidden layer neuron is replaced by the layer neurons, and one output layer neuron. The input to
centroid of the cluster, which the activation value belongs each input layer neuron, which represents one of seven
to. The performance of the network is not compromised, as sentence features, is a real value. The value of the output
long as the cluster radius (rJ is less than the following nenron is either one (summary sentence) or zero
upper-bound, where the error tolerance, 6,is usually set to a (unimportant sentence). The network was trained for
value less thanO.O1.
10,000 cycles and achieved a recall accuracy of 99%. After
the feature fusion phase,fi (Paragraph follows title) andJ
(First sentence in the paragraph) were removed. In
addition, three hidden layer neurons were removed. The
removal of /I feature is understandable, since most of the
articles did not have sub-titles or section headings.
Therefore, the only paragraph following a title would be the
fmt paragraph, and this information is already contained in
featureh (paragraph location in document). The removal of
Since dynamic clustering is order sensitive, the activation feature indicates that the first sentence in the paragraph is
values are redustered. The radius of new clusters is set to not always selected to be included in the sum-. We
one-half of the original clusters. The benefits of re- then used the same 85 news articles as a test set for the
clustering are twofold: 1) Due to order sensitivily of modified network. The summaries compiled by the
dynamic clustering, some of the activation.values may be network were compared with the summaries compiled by
misclassifed. Re-clustering alleviates this deficiency by the human reader. The accuracy of the modified network
42
ranged from 90% to 95% with an average accuracy of vector. The network was trained for 10,000 cycles and
93.6% when compared to the desired results obtained from achieved a recall accuracy of 99%. Once again, after the
the human reader. The network selected one to five feature fusion phase, h (Paragraph follows title) and f .
sentences in 30.5% of the articles (26 articles) that were not (First sentence in the paragraph) were removed. In
selected as summary sentences by the human reader. This additiou, fourteen hidden layer neurons were removed. The
can be contributed to over-generalization of the network. accuracy of the modified network ranged from 97% to 99%
The network did not select one to three sentences in 12.9% with an average accuracy of 98.6% when compared to the
of the articles (11 articles) that were selected by the human desired results for the same 85 news articles. The network
reader. selected all sentences selected by the human reader.
However, for 5.8% of the articles (5 articles), the network
B. Binary- Valued Feature Vectors selected one to two sentences that were not selected as
summary sentences by the human reader. Table 111
For the binary-valued feature input, we discretized the
represents tlie features and their discrete values.
seven sentence features. Each feature is represented as a
sequence of binary values. Each b i m y value represents a TABLE 111
range of real values for the sentence feature. We used two DISCRETE VALUES
different approaches to discretize the real values. In the Feature Neurons Intends
first approach, we grouped the real numbers into internals.
In the second approach, we discretized the real numbers
into single real values.
TABLE I1
DISCRETE INTERVALS C. Assessing the Acairucy of the Networks
Feature Neurons Intervals In order to assess the accuracy of all three neural
f, 2 [Ol3[11 networks, we selected 25 different news articles. The human
h 4 [l-2],[3-4],[5-9],10+ reader and all three modified networks summarized the 25
h 4 [ 1-2],[36],[7-9],10+ news articles, independently. The average accuracy of the
fi 2 [OI,[ll real-valued neural network (NI) was 93%, the average
fs 4 [I-4],[5-7],[8-9],10+ accuracy of the discretized real-values into intervals neural
s6 3 [0-4],[5-9],10+ network ( N J was 96%, . and the average accuracy of the
fi 4 [0-3],14-6],17-9], 10+ discretized real-vulrres into single values neural network
(N3) was 99% when compared with the human reader’s
We trained a neural network (N2) with twenty-three sunnuaries.
input layer neurons, thihilty-five hidden layer neurons, and
one output layer neuron. Each feature is represented by a VI. CONCLUSIONS
binary vector. For example, f6 (number of thematic words
in the sentence) can be represented as [0 1 01, which The performance of the text summarization process
implies that there are five to nine thematic words in the depends predominantly on the style of the human reader.
sentence. The network was trained for 10,000 cycles and The selection of features as well as the selection of summary
sentences by the human reader from the training paragraphs
achieved a recall accuracy of 99%. Once again, after the
play an important role in the performance of the network.
feature fusion phase, f , (Paragraph follows title) and f .
The network is trained according to the style of the human
(First sentence in the paragraph) were removed. In
addition, seven hidden layer neurons were removed. The reader and to which sentences the human reader deems to be
important in a paragraph. This, in fact, is an advantage our
accuracy of the modified network ranged from 94% to 97%
with an average accuracy of 96.2% when compared to the approach provides. Individual readers can train the neural
desired results for the same 85 news articles. The network network according to their own styles. In addition, the
selected one to thee sentences in 14.1% of the articles (12 selected features can be modified to reflect the reader’s
articles) that were not selected as summaly sentences by needs and requirements.
the human reader, and did not select one to two sentences
REFERENCES
in 5.8% of the articles (5 articles) that were selected by the
lmman reader. [I] 1. Mani. Aslornotic Sunaarizorion, John Benjamins Publishing
Company, pp. 129-165,2001.
[Z] W.T. Chuang and J. Y m g “Extracting sentence segments for tehZ
2) Discretized Real Values into Single Values sumarizalioii: a machine learning approach”, Proceedhgr o/ (he
2sd Annual Inremorionol ACM SIGlR Confemme 011 Researeh and
We discretized the real numbers up to 10 into individual Development in Infomotion Rewieunl, Athens, Greece, pp. 152-159,
numbers where each real number is represented by a binary 2000.
value. In this case, we trained a neural network (N3) [3] J. Kupiec, J. Pedenon and F. Cheb “A Trainable Document
consisting of ffty-four input layer neurons, seventy hidden Summarizer”, Proceedings of the 18’ Anriuol lnrernarionol ACM
SIGIR Coiflwence on Reseorcb and Development in Iqorrnorion
layer neurons, and one output layer neuron. Each feature, Refriwal, Seanle, Washinglob pp. 68-73, 1995.
except forf, andfi, is represented by a 10-element binary
43
[4] P.B.Baxendale, “Machine-Made Index for Technical Literature: An
Experiment” IBM Journal of Research and Developmen< vol. 2, no.
4, pp. 354361, 195.3.
[5] C.Y. Liu and E. Hovy, “Identifying Topics by Position’’,
Proceedings of rhe 5* Conference M Applied Noruml Lmgwge
Processing (RNLP-97). Seattle, Washington, pp. 283-290, 1997.
161 H.P. Luhn, “ n e Automatic Creation of Literahwe Abstracts”, IBM
Jounrolfor Rereoreh ondDeveloprnen1, vol. 2.00.2, pp. 159.165,
1958.
[7] K. Sparck Jones, “A Statistical Interpretation of Team Specificity
and tts Application in Retrieval”. Journal ofDocurnenmion, vol. 28,
no. 1,pp. 11-21, 1972.
[8] H.P. Edmundson. “New Methods in Automatic Extracting”, Journal
ofthe ACM, vol. 16. “0.2. DD. 264285. 1969.
191 C.D. Paice, ‘The A u t o m i i i Generatid of Literature Abstracts: An
Approach Based on Self-Indicating Phrases”, Informarion Reniwol
Research. Proceedings of rke Join1 ACM/ECS Symposium in
Infwmolion Storage and Reniwd, Cambridge. England, pp 172-
191, 1980.
[IO] D. Marcu, “From Discourse Structures to TeB Summaries”,
Proceedings of the ACUEACL Workshp on InteNgmr Scoloble
T a l Sumrnmirntion, Madrid, Spain, pp. 256-262, 1997.
[Ill R. Brandow, K. Mitre and L. Rau, “Automatic condensation of
e l m o n i c publications by sentence selection”, lnfmrno1ion
Processing ondManogernen1, vol. 31, “0.5, pp. 675-685, 1995.
. . ..
1121 M. Porter. ”An aleorithm for suffix strivoine”.
- . Prmram. vol. 14. no.
I
44