SlideShare a Scribd company logo
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
_______________________________________________________________________________________
Volume: 04 Issue: 11 | Nov-2015, Available @ http://www.ijret.org 334
A TEMPLATE BASED ALGORITHM FOR AUTOMATIC
SUMMARIZATION AND DIALOGUE MANAGEMENT FOR TEXT
DOCUMENTS
Prashant G. Desai1
, Sarojadevi H2
, Niranjan N. Chiplunkar3
1
Lecturer, Department of Computer Science & Engineering, N.R.A.M.P., Nitte, Karnataka, India
2
Professor & Head, Department of Computer Science, N.M.A.M.I.T., Nitte, Karnataka, India
3
Principal, N.M.A.M.I.T., Nitte, Karnataka, India
Abstract
This paper describes an automated approach for extracting significant and useful events from unstructured text. The goal of
research is to come out with a methodology which helps in extracting important events such as dates, places, and subjects of
interest. It would be also convenient if the methodology helps in presenting the users with a shorter version of the text which
contain all non-trivial information. We also discuss implementation of algorithms which exactly does this task, developed by us.
Key Words: Cosine Similarity, Information, Natural Language, Summarization, Text Mining
--------------------------------------------------------------------***----------------------------------------------------------------------
1. INTRODUCTION
The world is witnessing a large quantity of digital
information today. This digital information is available in
different ways and from different sources. The World Wide
Web has been the dominant source of digital information.
This digital information has been presented to user mainly in
two formats: structured information and unstructured
information. Understanding and processing of structured
information is quite easier due to its very basic nature that
the information is arranged in a specific format. The same is
not true with unstructured information since there is no
specific format in which the information is arranged. The
information no matter how significant it is, spread across the
whole document. Therefore it is a challenging task to
understand, process and extract the useful and significant
information from unstructured documents such as emails,
blogs, and research documents available in doc, pdf and txt
formats. The ultimate goal of text mining is to discover
useful information and make knowledge out of it.
This paper describes a natural language processing approach
for identifying and extracting the important information
from the unstructured text.
2. RELATED WORK
The work presented in [6] describes a model for answering
the user inputted question, by referring to a large database
collection of question-answer sets. The words in the
question submitted by the user are expanded using the
synonyms of respective words. Researchers have mainly
concentrated on obtaining the synonyms of noun, verbs and
adjectives. Database search for finding the answer is done
after the question is expanded.
Researchers of [18] present a two-phased process for text
summarization. In the 1st phase text from original document
is analyzed for topic and passages are extracted. In the next
phase the extracted text is understood with the help of
WordNet and FrameNet based on which text is extracted,
joined together before being presented to the user as output.
[23] is a research work for extracting answer from on-line
psychological counseling websites. The information
available from these websites is used as knowledge base. An
index is created for making the search of key words
efficient. Using this index based search candidate answers
fetched for a user query is generated as response.
Three attributes‟ information is used for summarizing a
document in [1]. Those 3 words are- field, associated terms
and attribute grammars. The results of experiments
conducted by the authors‟ show that high level accuracy is
attained while implementing their concept.
The work illustrated by [5], formulates a set of rules for
summarizing the text document. To start with algorithm
analyses the structure of the natural language text. Later, the
prescribed rules formulated by the researchers are applied to
generate the summary.
[17] discusses a summarization technique based on fuzzy
logic. Before applying the fuzzy logic, sentences from the
original text document, are selected based on 8 pre-defined
rules such as title feature, sentence length, term weight,
sentence position, sentence to sentence similarity, proper
noun, thematic word, and numerical data. These are
implemented as simple IF-THEN rules in an inference
engine module of the algorithm.
[9] illustrates the significance of artificial neural networks
for finding key phrases from the text. Authors compare the
key words extracted from neural networks, naïve-bayes and
decision tree algorithms. Key words extracted from all these
algorithms are then compared with publicly available key
phrase extraction algorithm KEA. Laters authors are of the
opinion that, neural networks perform better for extracting
keywords.
A text mining approach is discussed in [22] for creating
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
_______________________________________________________________________________________
Volume: 04 Issue: 11 | Nov-2015, Available @ http://www.ijret.org 335
ontology from text belonging to a specific domain. The
methodology adopted for building ontology uses data pre-
processing, natural language analysis, concept extraction,
semantic relation extraction and ontology building.
In the study of [14], it is learnt that quality of automated
summary is dependent of key phrases. Therefore, the author
focuses on locating high quality key words from the text.
The procedure adopted for summarizing is pre-processing,
key phrase extraction and then sentence selection.
[16] provides guidelines which are implemented by them for
achieving text mining independent of the language of
original text. Important points made by the authors‟ are to
formulate the rules independent of the language, minimizing
the use of language specific resources, if the use of such
resources become inventible, keep them in a modular
representation so that adding new ones become easier, for
cross-lingual applications, represent the docuSDment in a
neutral language to overcome the complexity of dealing
with multiple languages.
3. METHODOLOGY
This methodology is implemented in two phases: 1. Text
Pre-Processing 2. Information Extraction. Soon after the
document is chosen by the user, document goes to file
classification module. This module identifies whether the
input document is a pdf or MS-Word document before
reading the contents.
Phase 1: Text Pre-Processing
This part of the implementation includes the modules shown
below:
1.Syntactic analysis
2. Tokenization
3. Semantic analysis
4. Stop word removal
5. Stemming
Phase 2: Information extraction
This part of the implementation includes the modules shown
below:
1. Training the dialogue control
2. Knowledge base discovery
3. Dialogue Management
4. Template based summarization
5. Domain intelligence suites
Figure 1 and Figure 2 show the building blocks of our
system architecture.
Phase1: Text Pre-Processing
3.1 Syntactic Analysis
The unstructured text does not have any pre-defined format
in which data is organized. Hence, it is very much required
to decide where a particular piece of data begins and where
exactly it ends. The task of syntactic analysis module is
exactly this, i.e., to decide beginning and ending of each
sentence in a document. Presently we assume that full stop
symbol marks the end of sentence. Any string of characters
up to full stop symbol is treated as one sentence. We do this
with help of java class string tokenizer.
Text
Document
File
classification
Processed text
Text Pre-processing
Syntactic
analysis
Semantic
analysis
Tokeniza
tion
Stop
word
removal
Stemming
Fig -1: Text Pre-Processing modules
3.2 Tokenization
The task of tokenizer is to break the sentence into tokens.
The broken pieces of a sentence may include words,
numbers and punctuation marks. This step is very much
essential for next level of processing the text. [15].
3.3 Semantic Analysis
Semantic analysis module understands the part of each word
played in a sentence. Then it assigns a tag to every word
such as noun, verb, adjective, adverb and so on. This
process of understanding the part played by each word and
then assigning an appropriate tag is called Part-Of-Speech
tagging or POS tagging. In order to implement this module,
Stanford University [19] POS tagger is used
3.4 Stop Word Removal
Some of the words appear in the natural language text more
often but have very little significance when we take over all
meaning of the sentence. Such words are termed stop words.
These words are removed before the document is considered
for next level of processing. [10]
3.5 Stemming
Stemming is the process of finding base form of a certain
word in the text document [4, 3]. Any text document may
include repetition of same word but in different forms of the
grammar such as word being present in past tense, or in
present tense and sometimes in present continuous tense and
so on. To avoid all these forms of the same considered as
different words, stemming is performed. Presently porter
stemmer [11] is used for stemming.
The text after going through all these modules is fed into the
second phase called Information extraction module.
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
_______________________________________________________________________________________
Volume: 04 Issue: 11 | Nov-2015, Available @ http://www.ijret.org 336
Phase2: Information extraction
This module takes the pre-processed text for discovering
important information using machine learning algorithms
designed and implemented by us. The process of the
algorithm begins with training the model.
3.6 Training The Dialogue Control
Here the administrator of the module trains the system.
During training, the system learns important or index terms,
named entities such as name of the persons, places, temporal
formats according to rules. These rules are defined and
discussed in our previous paper [12]. Intelligence of the
algorithm increases with every training. The concepts learnt
during training are stored in the knowledge base of the
system.
3.7 Knowledge Base Discovery
Knowledge base discovery is the task of creating
intelligence from structured (e.g., database, XML, etc.), and
unstructured text (e.g., documents, blogs, emails, etc.)
[8, 7]. Our research is purely focused on processing
unstructured text. Hence, knowledge discovery here means
the process of deriving intelligent information and storing
from unstructured text.
The knowledge base is capable of storing important terms
such as index terms, name of the persons, name of the
locations, in an efficient may. By calling the knowledge
base design as efficient, we mean that, structure is formed in
such a way that all the terms are stored together yet in a
distinguishable format. There by reducing the need for
creating multiple storage structures for storing different
category terms, reducing the search time and thus improving
the overall performance of the algorithm.
Pre-Processed
Text
Domain Intelligence
Suites
Training
Knowledge
Discovery
Data base
Information Extraction
Dialogue
Control
Template Based
Summary
Fig -2:Information Extraction modules
3.8 Dialogue Management
Dialogue is an interaction occurring between two parties:
either between 2 humans or between human and machine
[21]. This paper discusses implementation of an algorithm
for a dialogue between human and machine. The dialogue
management module is a program which offers interaction
between human and the computer. With this module the end
user can request the information using natural language text.
The dialogue control module which built upon the training
model, accepts the user request, understands, refers the
knowledge base and then produces the answers which
possibly contain the sought information. Algorithm below
illustrates the functioning of dialogue management.
1. Choose the input document
2. User is presented with 2 choices:
1) Event related options
2) Writing a customized query
3. With choice-1 events such as important dates, locations,
subjects of interest available in the document are extracted.
4. With choice-2 user has the provision for writing a
customized query
5. Tokenize the query given in natural language text
6. Identify and extract the index terms (which are also called
as index words, cue words, key words and so on)
7. Search the knowledge base using the index terms.
7.1 While searching the answers, algorithm considers the
synonyms of every index word.
7.2 Word Net [13] is used for collecting the synonyms of a
term.
8. Answers which have the index terms appearing in them
are short listed.
9. Then sequence of appearance of index terms in answer
and query is matched
10. Answers are presented to user along with score for every
answer.
11. Score of the answer is count of the index terms
appearing in sequence in which they appear in query.
3.9 Template Based Summarization
Text summarization is the process of putting together
meaningful text present in a document in a condensed
format. Template based summarization in our system is
designed to perform this operation.
Here user has the freedom of choosing what should be
present in the summary. In other words, user prepares
template based on which summary is generated. This
prepared template can include POS tags like Noun, Verb,
Adverb, etc, and the sequence in which user wants them to
appear in the document. User is not confined to only one
pattern of this kind but can have as many patterns as
possible. Once the POS patterns are finalized, user can
decide whether to include named entities and dates in the
summary expected. This completes the step of preparing the
template. The template based summary module takes into
account all these requisites of the user while it generates
summary.
4. EXPERIMENTS AND RESULTS
4.1 Evaluation Metrics
There are many similarity measures to decide the closeness
of two text documents [2] such as Cosine similarity,
Euclidean distance, Jaccard coefficient, Pearson Correlation
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
_______________________________________________________________________________________
Volume: 04 Issue: 11 | Nov-2015, Available @ http://www.ijret.org 337
Coefficient, Averaged Kullback-Leibler Divergence. The
experimental study carried out by [20] is of the opinion that
Cosine similarity is better suited for text documents. This
cosine similarity measure is used to evaluate automatically
obtained summary. While evaluation metrics such as
Precision, Recall, and F-Measure are used to evaluate the
dialogue management interface.
4.2 Cosine Similarity
If D = {d1, . . . , dn} is a set of documents and T = {t1, . . .
,tm} the set of unique terms appearing in D. The m-
dimensional vector is used represent the document . The
frequency of term t ∈ T in document d ∈ D is denoted by
tf(d, t). The vector of a document d is represented by the
equation = (tf(d, t1), . . . , tf(d, tm)). Given two
documents „a‟ and „b‟ whose term vectors are represented
by respectively, their cosine similarity is
calculated by the formula
4.3 Precision
Precision measures the number of correctly identified items
as a percentage of the number of items identified. It is
formally defined as
Precision (P) =
 
1
Correct Partial
2
Correct Spurious Partial
 
 
 
 
4.4 Recall
Recall measures the number of correctly identified items as
a percentage of the total number of correct items.
Recall is formally defined as
Recall(R) =
 
1
Correct Partial
2
Correct Missing Partial
 
 
 
 
4.5 Measure
The F-measure is often used in conjunction with Precision
and Recall, as a weighted average of the two. If P and R are
to be given equal weights, then we can use the equation
F1=
 
 
*
0.5* P R
P R

4.6 Experimental Setup
Ten text documents belonging mainly to research documents
and conference papers (CFP) are chosen for evaluating the
results. We manually generated a summary for each of these
ten documents. These manually created summaries are also
called reference summaries. The reference or manual
summaries for the research papers are generated on the basis
of contents having important parts-of-speech (POS) such as
Noun, Verb, Adverb, Proper nouns and so on. The manual
summaries for conference papers are chosen on the basis of
text containing important dates, subjects of interests and
names of the locations. Templates are prepared for
generating automated summary in such a way that,
templates contain patterns having sequences of POS which
were chosen for preparing manual summaries. This is done
mainly to verify the accuracy of the automated summary.
That is how humans prepare the manual summary and
algorithm prepares the summary with same references. Then
we compared the template based summary generated by the
system, with the respective manual summaries. In order to
compare system summary with the manual summary cosine
similarity measure is used, as both of these summaries are
stored as text documents. Table 1 displays the results in
terms of cosine similarity values.
Table -1: Cosine Similarity Indices
Length
Document
Category
Manual
Summary
Length
System
Summary
Length
Cosine
Similarity
Value
168
Technical
/ Research
Document
114 52 0.534
215 CFP 114 246 0.696
221
Technical
/ Research
Document
168 140 0.81
236
Technical
/ Research
Document
156 35 0.56
311
Technical
/ Research
Document
172 226 0.844
325
Technical
/ Research
Document
85 72 0.788
348 CFP 174 295 0.6666
512
Technical
/ Research
Document
265 246 0.785
676
Technical
/ Research
Document
165 379 0.84
1328
Technical
/ Research
Document
151 571 0.656
Table 2 illustrates results analysis for the dialogue
management interface, tested on 02 technical documents and
02 conference papers from the chosen dataset.
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
_______________________________________________________________________________________
Volume: 04 Issue: 11 | Nov-2015, Available @ http://www.ijret.org 338
4.7 Screen Shots
Figure 3 shows the template prepared and the summary
generated by the system. The template is prepared with two
patterns :
1. A “Determiner” POS tag followed “Plural Noun”
2. A “Base form of Verb” POS tag followed by a
“Singular Noun”.
The original document has 6 sentences. After processing this
original document with reference to the template prepared
by the user, system has produces an automated summary
which contains only 3 sentences of significance. This is
shown in figure 3.
Fig -3 Summary generated based on a template
Fig -4 Cosine similarity index between system and reference summaries.
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
_______________________________________________________________________________________
Volume: 04 Issue: 11 | Nov-2015, Available @ http://www.ijret.org 339
Table -2: Result Analysis of Dialogue Management Interface
Index Words
Used for information
extraction Submitted
Document
Category
Expected
Information
Information
Extracted
Correct Partial Spurious Missing Precision Recall
F
Measure
Text, Summarization
Technical
research
paper
Information
from the
document
on Text,
Summarizat
ion
All the
information
present in
the
document
having the
input words
23 8 8 0 0.69 0.87 0.19
Dates
CFP
Dates
embedded
inside the
text
Dates
present in
the
document
1 0 0 0 1.00 1.00 0.25
Places
Name of
Places
Places
present in
the text
2 0 0 1 1.00 0.67 0.20
Subjects of Interest
Techincal
subjects
Areas of
Interest
11 1 6 0 0.64 0.96 0.19
5. CONCLUSIONS
Text mining is a trending research area in the present
scenario. The simple reason for which it is being considered
as trending topic is the vast availability of information in the
form of natural language. In this paper, we presented
algorithms for automatic text summarization and dialogue
management. The focus of automatic text summarization
algorithm is on the requirements of user before original text
is summarized while the algorithm used for dialogue
management establishes interactions between user and the
computer. The main contribution of work reported here is
the template based algorithm for text summarization and
dialogue management. The experiments conducted during
the implementation of work have produced encouraging
results. These results indicate that the accuracy of both
automatic summarization and dialogue management
algorithms is more than70%. Hence, it can be concluded that
the performance of algorithmic implementation of our work
is satisfactory.
REFERENCES
[1] Abdunabi UBUL, EI-Sayed ATLAM, Kazuhiro
MORITA, Masao FUKETA, Junichi AOE, “A Method
for Generating Document Summary using Field
Association Knowledge and Subjectively Information”,
Proceedings of IEEE, 2010
[2] Anna Huang, “Similarity Measures for Text Document
Clustering”, proceedings of NZCSRSC, 2008, pp 49-56
[3] Brill, E. “A Simple Rule-Based Part-of-speech Tagger”.
Proceedings of 3rd Conference on Applied Natural
Language Processing, 1992, pp.152–155.
[4] Church, K.W., “A Stochastic parts program and noun
phrase parser for unrestricted text”. Proceedings 1st
Conference on Applied Natural Language Processing,
ANLP, pp. 136–143. ACL, 1988.
[5] C. Lakshmi Devasenal, M. Hemalatha, “ Automatic
Text Categorization and Summarization using Rule
Reduction”, Proceedings of ICAESM, 2012, pp594-598
[6] Dang Truong Son, Dao Tien Dung, “Apply a mapping
question approach in building the question answering
system for vietnamese language”, Proceedings of
Journal of Engineering Technology and Education,
2012, pp 380-386
[7] en.wikipedia.org/wiki/Knowledge_extraction.
[8] Jantima Polpinij , “Ontology-based Knowledge
Discovery from Unstructured Text”, Proceedings of
IJIPM, Volume4, Number4, 2013, pp 21-31
[9] Kamal Sarkar, Mita Nasipuri,Suranjan Ghose,
“Machine Learning Based Keyphrase Extraction:
Comparing Decision Trees, Naïve Bayes, and Artificial
Neural Networks”, Proceedings of JIPS, 2012, pp693-
712
[10]M.Suneetha, S. Sameen Fatima, “Corpus based
Automatic Text Summarization System with HMM
Tagger”, Proceedings of IJSCE, ISSN: 2231-2307,
Volume-1, Issue-3, 2011, pp 118-123
[11]Porter stemmer,
http://www.tartarus.org/~martin/PorterStemmer
[12]Prashant G. Desai , Sarojadevi H. , Niranjan N.
Chiplunkar, “Rule-Knowledge Based Algorithm for
Event Extraction”, Proceedings of IJARCCE, ISSN
(Online) : 2278-1021, ISSN (Print) : 2319-5940,
Volume 4, Issue 1, 2015, pp 79-85
[13]Princeton University. Word Net,
http://wordnet.princeton.edu/
[14]Rafeeq Al-Hashemi, “Text Summarization Extraction
System (TSES) Using Extracted Keywords”,
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
_______________________________________________________________________________________
Volume: 04 Issue: 11 | Nov-2015, Available @ http://www.ijret.org 340
Proceedings of International Arab Journal of e-
Technology, Volume. 1, Number. 4, 2010, pp 164-168
[15]Raju Barskar, Gulfishan Firdose Ahmed, Nepal Barska,
“An Approach for Extracting Exact Answers to
Question Answering (QA) System for English
Sentences”, Proceedings of ICCTSD, 2012, pp 1187 –
1194
[16]Ralf Steinberger, Bruno Pouliquen, Camelia Ignat,
“Using language-independent rules to achieve high
multilinguality in Text Mining”, Proceedings of Mining
Massive Data Sets for Security, 2008, pp 217-240
[17]Rucha S. Dixit, Prof. Dr.S.S.Apte, “ Improvement of
Text Summarization using Fuzzy Logic Based
Method”, Proceedings of IOSRJCE, ISSN: 2278-0661,
ISBN: 2278-8727 , Volume 5, Issue 6, 2012, pp 05-10
[18]Shuhua Liu, “Enhancing E-Business-Intelligence-
Service: A Topic-Guided Text Summarization
Framework”, Proceedings of the Seventh IEEE
International Conference on E-Commerce Technology,
2005
[19]Stanford University POS Tagger
[20]Subhashini, Kumar, “Evaluating the Performance of
Similarity Measures Used in Document Clustering and
Information Retrieval”, Proceedings of ICIIC, 2010, pp
27 - 31
[21]Trung H. BUI, Multimodal Dialogue Management -
State of the art. January 3, 2006
[22]Xing Jiang,Ah-Hwee Tan “Mining Ontological
Knowledge from Domain-Specific Text Documents ”,
Proceedings of the Fifth IEEE International Conference
on Data Mining, 2005
[23]Yuanchao Liu1, Ming Liu, Zhimao Lu, Mingkai Song,
“ Extracting Knowledge from On-Line Forums for
Non-Obstructive Psychological Counseling Q&A
System”, International Journal of Intelligence Science,
2012, pp 40-48

More Related Content

What's hot (18)

DOCX
Summarization in Computational linguistics
Ahmad Mashhood
 
PDF
Architecture of an ontology based domain-specific natural language question a...
IJwest
 
PDF
Sentence similarity-based-text-summarization-using-clusters
MOHDSAIFWAJID1
 
PDF
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGES
ijnlc
 
PDF
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE
Journal For Research
 
PDF
ONTOLOGICAL TREE GENERATION FOR ENHANCED INFORMATION RETRIEVAL
ijaia
 
PDF
L1803058388
IOSR Journals
 
PDF
IRJET- Automatic Recapitulation of Text Document
IRJET Journal
 
PDF
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION
cscpconf
 
PDF
Information_Retrieval_Models_Nfaoui_El_Habib
El Habib NFAOUI
 
PDF
Optimal approach for text summarization
IAEME Publication
 
PDF
A domain specific automatic text summarization using fuzzy logic
IAEME Publication
 
PDF
[IJET-V2I3P19] Authors: Priyanka Sharma
IJET - International Journal of Engineering and Techniques
 
PDF
TRANSLATING LEGAL SENTENCE BY SEGMENTATION AND RULE SELECTION
ijnlc
 
PDF
TRANSLATING LEGAL SENTENCE BY SEGMENTATION AND RULE SELECTION
IJNLC Int.Jour on Natural Lang computing
 
PDF
Answer extraction and passage retrieval for
Waheeb Ahmed
 
PDF
Legal Document
legal4
 
PDF
A Novel Method for Keyword Retrieval using Weighted Standard Deviation: “D4 A...
idescitation
 
Summarization in Computational linguistics
Ahmad Mashhood
 
Architecture of an ontology based domain-specific natural language question a...
IJwest
 
Sentence similarity-based-text-summarization-using-clusters
MOHDSAIFWAJID1
 
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGES
ijnlc
 
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE
Journal For Research
 
ONTOLOGICAL TREE GENERATION FOR ENHANCED INFORMATION RETRIEVAL
ijaia
 
L1803058388
IOSR Journals
 
IRJET- Automatic Recapitulation of Text Document
IRJET Journal
 
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION
cscpconf
 
Information_Retrieval_Models_Nfaoui_El_Habib
El Habib NFAOUI
 
Optimal approach for text summarization
IAEME Publication
 
A domain specific automatic text summarization using fuzzy logic
IAEME Publication
 
[IJET-V2I3P19] Authors: Priyanka Sharma
IJET - International Journal of Engineering and Techniques
 
TRANSLATING LEGAL SENTENCE BY SEGMENTATION AND RULE SELECTION
ijnlc
 
TRANSLATING LEGAL SENTENCE BY SEGMENTATION AND RULE SELECTION
IJNLC Int.Jour on Natural Lang computing
 
Answer extraction and passage retrieval for
Waheeb Ahmed
 
Legal Document
legal4
 
A Novel Method for Keyword Retrieval using Weighted Standard Deviation: “D4 A...
idescitation
 

Similar to A template based algorithm for automatic summarization and dialogue management for text documents (20)

PDF
Summarization using ntc approach based on keyword extraction for discussion f...
eSAT Publishing House
 
PPTX
MODULE 4-Text Analytics.pptx
nikshaikh786
 
PDF
A Review Of Text Mining Techniques And Applications
Lisa Graves
 
PPTX
Model of semantic textual document clustering
SK Ahammad Fahad
 
PDF
Evaluation of Techniques for Automatic Text Extraction
Omar Azzam
 
PDF
Automation tool for evaluation of the quality of nlp based
IAEME Publication
 
PPTX
DOC-20230121-WA0030. (1).pptx
ACT008AVULAVENKATANA
 
PDF
IRJET- Text Highlighting – A Machine Learning Approach
IRJET Journal
 
PDF
IRJET- Sewage Treatment Potential of Coir Geotextiles in Conjunction with Act...
IRJET Journal
 
PDF
NLP Based Text Summarization Using Semantic Analysis
INFOGAIN PUBLICATION
 
PPTX
Natural Language Processing
VeenaSKumar2
 
PDF
Y24168171
IJERA Editor
 
PPTX
Natural Language Processing_in semantic web.pptx
AlyaaMachi
 
PDF
Automatic Text Summarization Using Natural Language Processing (1)
Don Dooley
 
PDF
ResearchPaper
Prajakta Yerpude
 
PDF
ALGORITHM FOR TEXT TO GRAPH CONVERSION
ijnlc
 
PDF
ALGORITHM FOR TEXT TO GRAPH CONVERSION AND SUMMARIZING USING NLP: A NEW APPRO...
kevig
 
PPTX
Automatic keyword extraction.pptx
BiswarupDas18
 
PPTX
Natural Language Processing Advancements By Deep Learning - A Survey
AkshayaNagarajan10
 
PPT
Information extraction for Free Text
butest
 
Summarization using ntc approach based on keyword extraction for discussion f...
eSAT Publishing House
 
MODULE 4-Text Analytics.pptx
nikshaikh786
 
A Review Of Text Mining Techniques And Applications
Lisa Graves
 
Model of semantic textual document clustering
SK Ahammad Fahad
 
Evaluation of Techniques for Automatic Text Extraction
Omar Azzam
 
Automation tool for evaluation of the quality of nlp based
IAEME Publication
 
DOC-20230121-WA0030. (1).pptx
ACT008AVULAVENKATANA
 
IRJET- Text Highlighting – A Machine Learning Approach
IRJET Journal
 
IRJET- Sewage Treatment Potential of Coir Geotextiles in Conjunction with Act...
IRJET Journal
 
NLP Based Text Summarization Using Semantic Analysis
INFOGAIN PUBLICATION
 
Natural Language Processing
VeenaSKumar2
 
Y24168171
IJERA Editor
 
Natural Language Processing_in semantic web.pptx
AlyaaMachi
 
Automatic Text Summarization Using Natural Language Processing (1)
Don Dooley
 
ResearchPaper
Prajakta Yerpude
 
ALGORITHM FOR TEXT TO GRAPH CONVERSION
ijnlc
 
ALGORITHM FOR TEXT TO GRAPH CONVERSION AND SUMMARIZING USING NLP: A NEW APPRO...
kevig
 
Automatic keyword extraction.pptx
BiswarupDas18
 
Natural Language Processing Advancements By Deep Learning - A Survey
AkshayaNagarajan10
 
Information extraction for Free Text
butest
 
Ad

More from eSAT Journals (20)

PDF
Mechanical properties of hybrid fiber reinforced concrete for pavements
eSAT Journals
 
PDF
Material management in construction – a case study
eSAT Journals
 
PDF
Managing drought short term strategies in semi arid regions a case study
eSAT Journals
 
PDF
Life cycle cost analysis of overlay for an urban road in bangalore
eSAT Journals
 
PDF
Laboratory studies of dense bituminous mixes ii with reclaimed asphalt materials
eSAT Journals
 
PDF
Laboratory investigation of expansive soil stabilized with natural inorganic ...
eSAT Journals
 
PDF
Influence of reinforcement on the behavior of hollow concrete block masonry p...
eSAT Journals
 
PDF
Influence of compaction energy on soil stabilized with chemical stabilizer
eSAT Journals
 
PDF
Geographical information system (gis) for water resources management
eSAT Journals
 
PDF
Forest type mapping of bidar forest division, karnataka using geoinformatics ...
eSAT Journals
 
PDF
Factors influencing compressive strength of geopolymer concrete
eSAT Journals
 
PDF
Experimental investigation on circular hollow steel columns in filled with li...
eSAT Journals
 
PDF
Experimental behavior of circular hsscfrc filled steel tubular columns under ...
eSAT Journals
 
PDF
Evaluation of punching shear in flat slabs
eSAT Journals
 
PDF
Evaluation of performance of intake tower dam for recent earthquake in india
eSAT Journals
 
PDF
Evaluation of operational efficiency of urban road network using travel time ...
eSAT Journals
 
PDF
Estimation of surface runoff in nallur amanikere watershed using scs cn method
eSAT Journals
 
PDF
Estimation of morphometric parameters and runoff using rs & gis techniques
eSAT Journals
 
PDF
Effect of variation of plastic hinge length on the results of non linear anal...
eSAT Journals
 
PDF
Effect of use of recycled materials on indirect tensile strength of asphalt c...
eSAT Journals
 
Mechanical properties of hybrid fiber reinforced concrete for pavements
eSAT Journals
 
Material management in construction – a case study
eSAT Journals
 
Managing drought short term strategies in semi arid regions a case study
eSAT Journals
 
Life cycle cost analysis of overlay for an urban road in bangalore
eSAT Journals
 
Laboratory studies of dense bituminous mixes ii with reclaimed asphalt materials
eSAT Journals
 
Laboratory investigation of expansive soil stabilized with natural inorganic ...
eSAT Journals
 
Influence of reinforcement on the behavior of hollow concrete block masonry p...
eSAT Journals
 
Influence of compaction energy on soil stabilized with chemical stabilizer
eSAT Journals
 
Geographical information system (gis) for water resources management
eSAT Journals
 
Forest type mapping of bidar forest division, karnataka using geoinformatics ...
eSAT Journals
 
Factors influencing compressive strength of geopolymer concrete
eSAT Journals
 
Experimental investigation on circular hollow steel columns in filled with li...
eSAT Journals
 
Experimental behavior of circular hsscfrc filled steel tubular columns under ...
eSAT Journals
 
Evaluation of punching shear in flat slabs
eSAT Journals
 
Evaluation of performance of intake tower dam for recent earthquake in india
eSAT Journals
 
Evaluation of operational efficiency of urban road network using travel time ...
eSAT Journals
 
Estimation of surface runoff in nallur amanikere watershed using scs cn method
eSAT Journals
 
Estimation of morphometric parameters and runoff using rs & gis techniques
eSAT Journals
 
Effect of variation of plastic hinge length on the results of non linear anal...
eSAT Journals
 
Effect of use of recycled materials on indirect tensile strength of asphalt c...
eSAT Journals
 
Ad

Recently uploaded (20)

PPTX
MODULE 03 - CLOUD COMPUTING AND SECURITY.pptx
Alvas Institute of Engineering and technology, Moodabidri
 
PDF
AI TECHNIQUES FOR IDENTIFYING ALTERATIONS IN THE HUMAN GUT MICROBIOME IN MULT...
vidyalalltv1
 
PPTX
原版一样(EC Lille毕业证书)法国里尔中央理工学院毕业证补办
Taqyea
 
PPTX
Introduction to Internal Combustion Engines - Types, Working and Camparison.pptx
UtkarshPatil98
 
PDF
methodology-driven-mbse-murphy-july-hsv-huntsville6680038572db67488e78ff00003...
henriqueltorres1
 
PPTX
OCS353 DATA SCIENCE FUNDAMENTALS- Unit 1 Introduction to Data Science
A R SIVANESH M.E., (Ph.D)
 
PPTX
DATA BASE MANAGEMENT AND RELATIONAL DATA
gomathisankariv2
 
PPTX
UNIT 1 - INTRODUCTION TO AI and AI tools and basic concept
gokuld13012005
 
PPTX
Worm gear strength and wear calculation as per standard VB Bhandari Databook.
shahveer210504
 
PDF
AN EMPIRICAL STUDY ON THE USAGE OF SOCIAL MEDIA IN GERMAN B2C-ONLINE STORES
ijait
 
PDF
mbse_An_Introduction_to_Arcadia_20150115.pdf
henriqueltorres1
 
PDF
Halide Perovskites’ Multifunctional Properties: Coordination Engineering, Coo...
TaameBerhe2
 
PPTX
MODULE 04 - CLOUD COMPUTING AND SECURITY.pptx
Alvas Institute of Engineering and technology, Moodabidri
 
PPTX
Distribution reservoir and service storage pptx
dhanashree78
 
PPT
New_school_Engineering_presentation_011707.ppt
VinayKumar304579
 
PPTX
Numerical-Solutions-of-Ordinary-Differential-Equations.pptx
SAMUKTHAARM
 
PDF
Basic_Concepts_in_Clinical_Biochemistry_2018كيمياء_عملي.pdf
AdelLoin
 
PPTX
仿制LethbridgeOffer加拿大莱斯桥大学毕业证范本,Lethbridge成绩单
Taqyea
 
PDF
Digital water marking system project report
Kamal Acharya
 
PDF
NTPC PATRATU Summer internship report.pdf
hemant03701
 
MODULE 03 - CLOUD COMPUTING AND SECURITY.pptx
Alvas Institute of Engineering and technology, Moodabidri
 
AI TECHNIQUES FOR IDENTIFYING ALTERATIONS IN THE HUMAN GUT MICROBIOME IN MULT...
vidyalalltv1
 
原版一样(EC Lille毕业证书)法国里尔中央理工学院毕业证补办
Taqyea
 
Introduction to Internal Combustion Engines - Types, Working and Camparison.pptx
UtkarshPatil98
 
methodology-driven-mbse-murphy-july-hsv-huntsville6680038572db67488e78ff00003...
henriqueltorres1
 
OCS353 DATA SCIENCE FUNDAMENTALS- Unit 1 Introduction to Data Science
A R SIVANESH M.E., (Ph.D)
 
DATA BASE MANAGEMENT AND RELATIONAL DATA
gomathisankariv2
 
UNIT 1 - INTRODUCTION TO AI and AI tools and basic concept
gokuld13012005
 
Worm gear strength and wear calculation as per standard VB Bhandari Databook.
shahveer210504
 
AN EMPIRICAL STUDY ON THE USAGE OF SOCIAL MEDIA IN GERMAN B2C-ONLINE STORES
ijait
 
mbse_An_Introduction_to_Arcadia_20150115.pdf
henriqueltorres1
 
Halide Perovskites’ Multifunctional Properties: Coordination Engineering, Coo...
TaameBerhe2
 
MODULE 04 - CLOUD COMPUTING AND SECURITY.pptx
Alvas Institute of Engineering and technology, Moodabidri
 
Distribution reservoir and service storage pptx
dhanashree78
 
New_school_Engineering_presentation_011707.ppt
VinayKumar304579
 
Numerical-Solutions-of-Ordinary-Differential-Equations.pptx
SAMUKTHAARM
 
Basic_Concepts_in_Clinical_Biochemistry_2018كيمياء_عملي.pdf
AdelLoin
 
仿制LethbridgeOffer加拿大莱斯桥大学毕业证范本,Lethbridge成绩单
Taqyea
 
Digital water marking system project report
Kamal Acharya
 
NTPC PATRATU Summer internship report.pdf
hemant03701
 

A template based algorithm for automatic summarization and dialogue management for text documents

  • 1. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 _______________________________________________________________________________________ Volume: 04 Issue: 11 | Nov-2015, Available @ http://www.ijret.org 334 A TEMPLATE BASED ALGORITHM FOR AUTOMATIC SUMMARIZATION AND DIALOGUE MANAGEMENT FOR TEXT DOCUMENTS Prashant G. Desai1 , Sarojadevi H2 , Niranjan N. Chiplunkar3 1 Lecturer, Department of Computer Science & Engineering, N.R.A.M.P., Nitte, Karnataka, India 2 Professor & Head, Department of Computer Science, N.M.A.M.I.T., Nitte, Karnataka, India 3 Principal, N.M.A.M.I.T., Nitte, Karnataka, India Abstract This paper describes an automated approach for extracting significant and useful events from unstructured text. The goal of research is to come out with a methodology which helps in extracting important events such as dates, places, and subjects of interest. It would be also convenient if the methodology helps in presenting the users with a shorter version of the text which contain all non-trivial information. We also discuss implementation of algorithms which exactly does this task, developed by us. Key Words: Cosine Similarity, Information, Natural Language, Summarization, Text Mining --------------------------------------------------------------------***---------------------------------------------------------------------- 1. INTRODUCTION The world is witnessing a large quantity of digital information today. This digital information is available in different ways and from different sources. The World Wide Web has been the dominant source of digital information. This digital information has been presented to user mainly in two formats: structured information and unstructured information. Understanding and processing of structured information is quite easier due to its very basic nature that the information is arranged in a specific format. The same is not true with unstructured information since there is no specific format in which the information is arranged. The information no matter how significant it is, spread across the whole document. Therefore it is a challenging task to understand, process and extract the useful and significant information from unstructured documents such as emails, blogs, and research documents available in doc, pdf and txt formats. The ultimate goal of text mining is to discover useful information and make knowledge out of it. This paper describes a natural language processing approach for identifying and extracting the important information from the unstructured text. 2. RELATED WORK The work presented in [6] describes a model for answering the user inputted question, by referring to a large database collection of question-answer sets. The words in the question submitted by the user are expanded using the synonyms of respective words. Researchers have mainly concentrated on obtaining the synonyms of noun, verbs and adjectives. Database search for finding the answer is done after the question is expanded. Researchers of [18] present a two-phased process for text summarization. In the 1st phase text from original document is analyzed for topic and passages are extracted. In the next phase the extracted text is understood with the help of WordNet and FrameNet based on which text is extracted, joined together before being presented to the user as output. [23] is a research work for extracting answer from on-line psychological counseling websites. The information available from these websites is used as knowledge base. An index is created for making the search of key words efficient. Using this index based search candidate answers fetched for a user query is generated as response. Three attributes‟ information is used for summarizing a document in [1]. Those 3 words are- field, associated terms and attribute grammars. The results of experiments conducted by the authors‟ show that high level accuracy is attained while implementing their concept. The work illustrated by [5], formulates a set of rules for summarizing the text document. To start with algorithm analyses the structure of the natural language text. Later, the prescribed rules formulated by the researchers are applied to generate the summary. [17] discusses a summarization technique based on fuzzy logic. Before applying the fuzzy logic, sentences from the original text document, are selected based on 8 pre-defined rules such as title feature, sentence length, term weight, sentence position, sentence to sentence similarity, proper noun, thematic word, and numerical data. These are implemented as simple IF-THEN rules in an inference engine module of the algorithm. [9] illustrates the significance of artificial neural networks for finding key phrases from the text. Authors compare the key words extracted from neural networks, naïve-bayes and decision tree algorithms. Key words extracted from all these algorithms are then compared with publicly available key phrase extraction algorithm KEA. Laters authors are of the opinion that, neural networks perform better for extracting keywords. A text mining approach is discussed in [22] for creating
  • 2. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 _______________________________________________________________________________________ Volume: 04 Issue: 11 | Nov-2015, Available @ http://www.ijret.org 335 ontology from text belonging to a specific domain. The methodology adopted for building ontology uses data pre- processing, natural language analysis, concept extraction, semantic relation extraction and ontology building. In the study of [14], it is learnt that quality of automated summary is dependent of key phrases. Therefore, the author focuses on locating high quality key words from the text. The procedure adopted for summarizing is pre-processing, key phrase extraction and then sentence selection. [16] provides guidelines which are implemented by them for achieving text mining independent of the language of original text. Important points made by the authors‟ are to formulate the rules independent of the language, minimizing the use of language specific resources, if the use of such resources become inventible, keep them in a modular representation so that adding new ones become easier, for cross-lingual applications, represent the docuSDment in a neutral language to overcome the complexity of dealing with multiple languages. 3. METHODOLOGY This methodology is implemented in two phases: 1. Text Pre-Processing 2. Information Extraction. Soon after the document is chosen by the user, document goes to file classification module. This module identifies whether the input document is a pdf or MS-Word document before reading the contents. Phase 1: Text Pre-Processing This part of the implementation includes the modules shown below: 1.Syntactic analysis 2. Tokenization 3. Semantic analysis 4. Stop word removal 5. Stemming Phase 2: Information extraction This part of the implementation includes the modules shown below: 1. Training the dialogue control 2. Knowledge base discovery 3. Dialogue Management 4. Template based summarization 5. Domain intelligence suites Figure 1 and Figure 2 show the building blocks of our system architecture. Phase1: Text Pre-Processing 3.1 Syntactic Analysis The unstructured text does not have any pre-defined format in which data is organized. Hence, it is very much required to decide where a particular piece of data begins and where exactly it ends. The task of syntactic analysis module is exactly this, i.e., to decide beginning and ending of each sentence in a document. Presently we assume that full stop symbol marks the end of sentence. Any string of characters up to full stop symbol is treated as one sentence. We do this with help of java class string tokenizer. Text Document File classification Processed text Text Pre-processing Syntactic analysis Semantic analysis Tokeniza tion Stop word removal Stemming Fig -1: Text Pre-Processing modules 3.2 Tokenization The task of tokenizer is to break the sentence into tokens. The broken pieces of a sentence may include words, numbers and punctuation marks. This step is very much essential for next level of processing the text. [15]. 3.3 Semantic Analysis Semantic analysis module understands the part of each word played in a sentence. Then it assigns a tag to every word such as noun, verb, adjective, adverb and so on. This process of understanding the part played by each word and then assigning an appropriate tag is called Part-Of-Speech tagging or POS tagging. In order to implement this module, Stanford University [19] POS tagger is used 3.4 Stop Word Removal Some of the words appear in the natural language text more often but have very little significance when we take over all meaning of the sentence. Such words are termed stop words. These words are removed before the document is considered for next level of processing. [10] 3.5 Stemming Stemming is the process of finding base form of a certain word in the text document [4, 3]. Any text document may include repetition of same word but in different forms of the grammar such as word being present in past tense, or in present tense and sometimes in present continuous tense and so on. To avoid all these forms of the same considered as different words, stemming is performed. Presently porter stemmer [11] is used for stemming. The text after going through all these modules is fed into the second phase called Information extraction module.
  • 3. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 _______________________________________________________________________________________ Volume: 04 Issue: 11 | Nov-2015, Available @ http://www.ijret.org 336 Phase2: Information extraction This module takes the pre-processed text for discovering important information using machine learning algorithms designed and implemented by us. The process of the algorithm begins with training the model. 3.6 Training The Dialogue Control Here the administrator of the module trains the system. During training, the system learns important or index terms, named entities such as name of the persons, places, temporal formats according to rules. These rules are defined and discussed in our previous paper [12]. Intelligence of the algorithm increases with every training. The concepts learnt during training are stored in the knowledge base of the system. 3.7 Knowledge Base Discovery Knowledge base discovery is the task of creating intelligence from structured (e.g., database, XML, etc.), and unstructured text (e.g., documents, blogs, emails, etc.) [8, 7]. Our research is purely focused on processing unstructured text. Hence, knowledge discovery here means the process of deriving intelligent information and storing from unstructured text. The knowledge base is capable of storing important terms such as index terms, name of the persons, name of the locations, in an efficient may. By calling the knowledge base design as efficient, we mean that, structure is formed in such a way that all the terms are stored together yet in a distinguishable format. There by reducing the need for creating multiple storage structures for storing different category terms, reducing the search time and thus improving the overall performance of the algorithm. Pre-Processed Text Domain Intelligence Suites Training Knowledge Discovery Data base Information Extraction Dialogue Control Template Based Summary Fig -2:Information Extraction modules 3.8 Dialogue Management Dialogue is an interaction occurring between two parties: either between 2 humans or between human and machine [21]. This paper discusses implementation of an algorithm for a dialogue between human and machine. The dialogue management module is a program which offers interaction between human and the computer. With this module the end user can request the information using natural language text. The dialogue control module which built upon the training model, accepts the user request, understands, refers the knowledge base and then produces the answers which possibly contain the sought information. Algorithm below illustrates the functioning of dialogue management. 1. Choose the input document 2. User is presented with 2 choices: 1) Event related options 2) Writing a customized query 3. With choice-1 events such as important dates, locations, subjects of interest available in the document are extracted. 4. With choice-2 user has the provision for writing a customized query 5. Tokenize the query given in natural language text 6. Identify and extract the index terms (which are also called as index words, cue words, key words and so on) 7. Search the knowledge base using the index terms. 7.1 While searching the answers, algorithm considers the synonyms of every index word. 7.2 Word Net [13] is used for collecting the synonyms of a term. 8. Answers which have the index terms appearing in them are short listed. 9. Then sequence of appearance of index terms in answer and query is matched 10. Answers are presented to user along with score for every answer. 11. Score of the answer is count of the index terms appearing in sequence in which they appear in query. 3.9 Template Based Summarization Text summarization is the process of putting together meaningful text present in a document in a condensed format. Template based summarization in our system is designed to perform this operation. Here user has the freedom of choosing what should be present in the summary. In other words, user prepares template based on which summary is generated. This prepared template can include POS tags like Noun, Verb, Adverb, etc, and the sequence in which user wants them to appear in the document. User is not confined to only one pattern of this kind but can have as many patterns as possible. Once the POS patterns are finalized, user can decide whether to include named entities and dates in the summary expected. This completes the step of preparing the template. The template based summary module takes into account all these requisites of the user while it generates summary. 4. EXPERIMENTS AND RESULTS 4.1 Evaluation Metrics There are many similarity measures to decide the closeness of two text documents [2] such as Cosine similarity, Euclidean distance, Jaccard coefficient, Pearson Correlation
  • 4. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 _______________________________________________________________________________________ Volume: 04 Issue: 11 | Nov-2015, Available @ http://www.ijret.org 337 Coefficient, Averaged Kullback-Leibler Divergence. The experimental study carried out by [20] is of the opinion that Cosine similarity is better suited for text documents. This cosine similarity measure is used to evaluate automatically obtained summary. While evaluation metrics such as Precision, Recall, and F-Measure are used to evaluate the dialogue management interface. 4.2 Cosine Similarity If D = {d1, . . . , dn} is a set of documents and T = {t1, . . . ,tm} the set of unique terms appearing in D. The m- dimensional vector is used represent the document . The frequency of term t ∈ T in document d ∈ D is denoted by tf(d, t). The vector of a document d is represented by the equation = (tf(d, t1), . . . , tf(d, tm)). Given two documents „a‟ and „b‟ whose term vectors are represented by respectively, their cosine similarity is calculated by the formula 4.3 Precision Precision measures the number of correctly identified items as a percentage of the number of items identified. It is formally defined as Precision (P) =   1 Correct Partial 2 Correct Spurious Partial         4.4 Recall Recall measures the number of correctly identified items as a percentage of the total number of correct items. Recall is formally defined as Recall(R) =   1 Correct Partial 2 Correct Missing Partial         4.5 Measure The F-measure is often used in conjunction with Precision and Recall, as a weighted average of the two. If P and R are to be given equal weights, then we can use the equation F1=     * 0.5* P R P R  4.6 Experimental Setup Ten text documents belonging mainly to research documents and conference papers (CFP) are chosen for evaluating the results. We manually generated a summary for each of these ten documents. These manually created summaries are also called reference summaries. The reference or manual summaries for the research papers are generated on the basis of contents having important parts-of-speech (POS) such as Noun, Verb, Adverb, Proper nouns and so on. The manual summaries for conference papers are chosen on the basis of text containing important dates, subjects of interests and names of the locations. Templates are prepared for generating automated summary in such a way that, templates contain patterns having sequences of POS which were chosen for preparing manual summaries. This is done mainly to verify the accuracy of the automated summary. That is how humans prepare the manual summary and algorithm prepares the summary with same references. Then we compared the template based summary generated by the system, with the respective manual summaries. In order to compare system summary with the manual summary cosine similarity measure is used, as both of these summaries are stored as text documents. Table 1 displays the results in terms of cosine similarity values. Table -1: Cosine Similarity Indices Length Document Category Manual Summary Length System Summary Length Cosine Similarity Value 168 Technical / Research Document 114 52 0.534 215 CFP 114 246 0.696 221 Technical / Research Document 168 140 0.81 236 Technical / Research Document 156 35 0.56 311 Technical / Research Document 172 226 0.844 325 Technical / Research Document 85 72 0.788 348 CFP 174 295 0.6666 512 Technical / Research Document 265 246 0.785 676 Technical / Research Document 165 379 0.84 1328 Technical / Research Document 151 571 0.656 Table 2 illustrates results analysis for the dialogue management interface, tested on 02 technical documents and 02 conference papers from the chosen dataset.
  • 5. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 _______________________________________________________________________________________ Volume: 04 Issue: 11 | Nov-2015, Available @ http://www.ijret.org 338 4.7 Screen Shots Figure 3 shows the template prepared and the summary generated by the system. The template is prepared with two patterns : 1. A “Determiner” POS tag followed “Plural Noun” 2. A “Base form of Verb” POS tag followed by a “Singular Noun”. The original document has 6 sentences. After processing this original document with reference to the template prepared by the user, system has produces an automated summary which contains only 3 sentences of significance. This is shown in figure 3. Fig -3 Summary generated based on a template Fig -4 Cosine similarity index between system and reference summaries.
  • 6. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 _______________________________________________________________________________________ Volume: 04 Issue: 11 | Nov-2015, Available @ http://www.ijret.org 339 Table -2: Result Analysis of Dialogue Management Interface Index Words Used for information extraction Submitted Document Category Expected Information Information Extracted Correct Partial Spurious Missing Precision Recall F Measure Text, Summarization Technical research paper Information from the document on Text, Summarizat ion All the information present in the document having the input words 23 8 8 0 0.69 0.87 0.19 Dates CFP Dates embedded inside the text Dates present in the document 1 0 0 0 1.00 1.00 0.25 Places Name of Places Places present in the text 2 0 0 1 1.00 0.67 0.20 Subjects of Interest Techincal subjects Areas of Interest 11 1 6 0 0.64 0.96 0.19 5. CONCLUSIONS Text mining is a trending research area in the present scenario. The simple reason for which it is being considered as trending topic is the vast availability of information in the form of natural language. In this paper, we presented algorithms for automatic text summarization and dialogue management. The focus of automatic text summarization algorithm is on the requirements of user before original text is summarized while the algorithm used for dialogue management establishes interactions between user and the computer. The main contribution of work reported here is the template based algorithm for text summarization and dialogue management. The experiments conducted during the implementation of work have produced encouraging results. These results indicate that the accuracy of both automatic summarization and dialogue management algorithms is more than70%. Hence, it can be concluded that the performance of algorithmic implementation of our work is satisfactory. REFERENCES [1] Abdunabi UBUL, EI-Sayed ATLAM, Kazuhiro MORITA, Masao FUKETA, Junichi AOE, “A Method for Generating Document Summary using Field Association Knowledge and Subjectively Information”, Proceedings of IEEE, 2010 [2] Anna Huang, “Similarity Measures for Text Document Clustering”, proceedings of NZCSRSC, 2008, pp 49-56 [3] Brill, E. “A Simple Rule-Based Part-of-speech Tagger”. Proceedings of 3rd Conference on Applied Natural Language Processing, 1992, pp.152–155. [4] Church, K.W., “A Stochastic parts program and noun phrase parser for unrestricted text”. Proceedings 1st Conference on Applied Natural Language Processing, ANLP, pp. 136–143. ACL, 1988. [5] C. Lakshmi Devasenal, M. Hemalatha, “ Automatic Text Categorization and Summarization using Rule Reduction”, Proceedings of ICAESM, 2012, pp594-598 [6] Dang Truong Son, Dao Tien Dung, “Apply a mapping question approach in building the question answering system for vietnamese language”, Proceedings of Journal of Engineering Technology and Education, 2012, pp 380-386 [7] en.wikipedia.org/wiki/Knowledge_extraction. [8] Jantima Polpinij , “Ontology-based Knowledge Discovery from Unstructured Text”, Proceedings of IJIPM, Volume4, Number4, 2013, pp 21-31 [9] Kamal Sarkar, Mita Nasipuri,Suranjan Ghose, “Machine Learning Based Keyphrase Extraction: Comparing Decision Trees, Naïve Bayes, and Artificial Neural Networks”, Proceedings of JIPS, 2012, pp693- 712 [10]M.Suneetha, S. Sameen Fatima, “Corpus based Automatic Text Summarization System with HMM Tagger”, Proceedings of IJSCE, ISSN: 2231-2307, Volume-1, Issue-3, 2011, pp 118-123 [11]Porter stemmer, http://www.tartarus.org/~martin/PorterStemmer [12]Prashant G. Desai , Sarojadevi H. , Niranjan N. Chiplunkar, “Rule-Knowledge Based Algorithm for Event Extraction”, Proceedings of IJARCCE, ISSN (Online) : 2278-1021, ISSN (Print) : 2319-5940, Volume 4, Issue 1, 2015, pp 79-85 [13]Princeton University. Word Net, http://wordnet.princeton.edu/ [14]Rafeeq Al-Hashemi, “Text Summarization Extraction System (TSES) Using Extracted Keywords”,
  • 7. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 _______________________________________________________________________________________ Volume: 04 Issue: 11 | Nov-2015, Available @ http://www.ijret.org 340 Proceedings of International Arab Journal of e- Technology, Volume. 1, Number. 4, 2010, pp 164-168 [15]Raju Barskar, Gulfishan Firdose Ahmed, Nepal Barska, “An Approach for Extracting Exact Answers to Question Answering (QA) System for English Sentences”, Proceedings of ICCTSD, 2012, pp 1187 – 1194 [16]Ralf Steinberger, Bruno Pouliquen, Camelia Ignat, “Using language-independent rules to achieve high multilinguality in Text Mining”, Proceedings of Mining Massive Data Sets for Security, 2008, pp 217-240 [17]Rucha S. Dixit, Prof. Dr.S.S.Apte, “ Improvement of Text Summarization using Fuzzy Logic Based Method”, Proceedings of IOSRJCE, ISSN: 2278-0661, ISBN: 2278-8727 , Volume 5, Issue 6, 2012, pp 05-10 [18]Shuhua Liu, “Enhancing E-Business-Intelligence- Service: A Topic-Guided Text Summarization Framework”, Proceedings of the Seventh IEEE International Conference on E-Commerce Technology, 2005 [19]Stanford University POS Tagger [20]Subhashini, Kumar, “Evaluating the Performance of Similarity Measures Used in Document Clustering and Information Retrieval”, Proceedings of ICIIC, 2010, pp 27 - 31 [21]Trung H. BUI, Multimodal Dialogue Management - State of the art. January 3, 2006 [22]Xing Jiang,Ah-Hwee Tan “Mining Ontological Knowledge from Domain-Specific Text Documents ”, Proceedings of the Fifth IEEE International Conference on Data Mining, 2005 [23]Yuanchao Liu1, Ming Liu, Zhimao Lu, Mingkai Song, “ Extracting Knowledge from On-Line Forums for Non-Obstructive Psychological Counseling Q&A System”, International Journal of Intelligence Science, 2012, pp 40-48