0% found this document useful (0 votes)

17 views

Multi Document Summarization Research Paper 1

Uploaded by

mitrasevagroup

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views

Multi Document Summarization Research Paper 1

Uploaded by

mitrasevagroup

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

Multi Document Summarisation Using Rule Engine

Virendra P. Yadav, Anshul M. Gedam, Sanskar Korekar, Pranay There, Sujal Bitle, Tarang Bhaisare
Department of Computer Science and Engineering, P .C .E Nagpur, India

1.Abstract :

In today's digital generation, we have access to an enormous amount of

information. However, it can be a time-consuming task for users to read,
sort, and summarise all this information. To address this challenge, a method
called multi-document summarisation has been developed using a rule
engine.

Multi-document summarisation involves the extraction of valuable and

relevant information from a collection of uploaded documents. It aims to
create concise and coherent summaries of the content within these
documents. This approach is essential in various applications, including
market reviews, search engines, and business analysis.

Summarising multiple documents in this manner enables users to quickly

obtain the necessary information from the entire set of referenced
documents. The speci c approach used in multi-document summarisation is
extractive, which means it selects and combines existing content from the
source documents to create a summary.

The goal of the research paper mentioned is to employ a rule engine to

summarise text from multiple documents using an extractive approach. This
will result in a coherent and meaningful output, bene ting users in managing
and extracting information from a large volume of documents.

Keywords :document summarisation, extractive approach

2.Introduction :

In contemporary academic and research contexts, the increasing volume and

diversity of scholarly publications and information sources present formidable
challenges for researchers in terms of information assimilation and synthesis.
In response to these challenges, the eld of text summarisation and
document summarisation has emerged as a potential solution. The primary
goal is to contribute for the advancement of document summarisation
fi
fi
fi
through the utilisation of rule-based engines. Speci cally, the research seeks
to develop a system that automates the generation of concise document
summaries from a diverse range of documents accessible through a web
application. These summaries will be of a generic nature, encompassing the
entire content of the documents uploaded and presenting it in a condensed
form.

Type of text summarisation

On the basis of scale Multi Document

On the basis of approach Extractive

On the basis of domain

Domain Oriented
dependency
fi
A. On the basis of scale: Multi Documents -

The process of extracting data and information from multiple uploaded

documents to create a comprehensive summary is commonly referred to as
"multi-document summarisation." It is important necessary to remember that
multi-document summarisation is notably more challenging than its
counterpart, single-document summarisation, which is designed to extract
summaries from individual documents. Single-document summarisation is
inherently more straightforward in terms of document arrangement and
sequence due to its isolated focus.

The underlying rationale behind the pursuit of multi-document summarisation

lies in its ability to facilitate the needs of researchers, search engines, market
analysts, and other pertinent stakeholders in the ef cient analysis of
substantial volumes of data. This process enables the conversion of vast
datasets into concise and succinct summaries, all in reference to the multiple
documents that have been uploaded and aggregated for analysis.

B. On the basis of approach: Extractive Summarisation -

The methodology employed for multi-document summarisation utilising a rule-

based engine follows the "Extractive-Based Approach." This approach is a
key feature of the web application designed for document summarisation,
enabling the extraction of relevant content from the entirety of the data
contained in the multiple uploaded documents.

In the extractive-based approach, the document summariser operates by

extracting information based on several key factors. These factors include the
sentence's importance, the frequency of sentence repetition, the distinctions
among different sentences, and a comprehensive analysis to identify which
content is the most pertinent, informative, and valuable.

In essence, the extractive-based approach aims to condense extensive

volumes of information derived from various user-uploaded documents into
concise and relevant summaries. This process is designed to save time and
effort, providing users with a streamlined means of obtaining essential
information from their documents.
fi
C. On the basis of domain dependency: Domain Oriented

In the context of this research paper, the approach to multi-document

summarisation is oriented toward specific domains. This implies that the
domain of the resulting summary will be inherently linked to the content of the
uploaded documents themselves. The fundamental purpose of this domain-
oriented approach is to facilitate users in comprehending and rectifying the
information contained within the uploaded documents through concise and
relevant summaries that pertain to the subject matter of the documents.

It is crucial to emphasise that the domain is intimately connected to the

specified topic delineated within the uploaded documents. As an illustration, if
the multiple documents contain information of the domains of medicine,
research & law the multi-document summariser will generate summaries
within the respective domains of medicine, research & law. This domain-
oriented approach is poised to significantly enhance the usability and
relevance of the summarisation process, aligning the summaries with the
womanising content of the uploaded documents.

Literature Review :

1)Extractive Text Summarisation :-

Extractive summariser is used for the extraction of the most repetitive words ,
it also decides that which sentence and points are more important depending
upon their uses in the uploaded documents.

2)Term Frequency - Inverse Document Frequency (TFIDF) approach:-

TF-IDF is used to rectify the frequency of the words in the uploaded

documents. The higher the frequency in the words the more chances to
select the sentence and the words due to the question words occurrence.
3) Clustering based approach :-

The main purpose of the clustering approach is to determine the topics and
the summerize the text from the multi documents uploaded in proper
sequences and subtopics. The clustering approach forms the topics for the
domain and forms sentences in proper format for the following topic.

4)Rule based Method :- (Engine rule)

In multi-document summarisation the rule based is used to apply rules to the

extracted sentences from the uploaded documents. The rules are used to
phrase the sentences in proper sequence and generation of the paragraphs
in the required topic.

5)Domain-oriented Summarisation:-

The Domain-oriented Summarisation approach is applied for the output to be

obtained in a particular domain of multi documents uploaded. The domain is
the field to which the title is related. For example :- The domain of the multi
document uploaded is the medical domain then the output summary will be in
the medical domain only without any external domain application.

6)Applications of Multi-document Summarisers

The application of Multi-Document summarisation in current era is to

summarise the data from the multiple documents to overcome the time and
efforts of the user. As it is very time consuming to read , understand and then
summarise the data from the multiple sources. It will be a time saving and
advance technology for the researchers.

Methodology :

Rule Engine :

Rule engines, in the context of this study, operate through the establishment of
a structured and intelligible framework for rule definition. These rules are
conventionally formulated as "if-then" statements, wherein conditions undergo
assessment, and corresponding actions are executed upon their satisfaction.
Rule engines serve as a means to automate intricate decision-making
processes, oversee workflows, and enforce organisational policies.
Within the scope of this research, the rule engine is harnessed to mechanise
the process of generating pertinent summaries from a collection of uploaded
documents. The rule engine employs a rule-based approach for the extraction
of sentences, paragraphs, or key phrases, with a focus on content relevance
within the uploaded documents.

The central objective of the rule engine lies in the curation of significant and
valuable information within the resulting summary. In the context of this
research paper, the rule engine leverages cross-document coherence rules to
discern connections among the concepts, events, and entities present in the
provided documents. It also dynamically adjusts the length and structure of the
summary in response to the complexity of the source material.

The incorporation of the rule engine into the summarisation process has
contributed to its systematisation, flexibility, and adaptability in handling multi-
document summarisation tasks. This paper delves into the utilisation of the rule
engine to enhance the efficiency and effectiveness of multi-document
summarisation by harnessing rule-based techniques and cross-document
coherence rules to capture the essence of the source documents, thereby
facilitating a more refined and coherent summarisation process.

In essence, the rule engine autonomously determines the most suitable

arrangement of tasks, adapting to the speci c requirements of the situation.
This distinctive characteristic signi es a departure from the conventional linear
sequence of operations, as it introduces a dynamic and rule-driven approach to
task orchestration within the multi-document summarisation process.

This research underscores the signi cance of this innovative approach, which
allows the rule engine to optimise the summarisation process by rearranging
the order of Clustering, TF-IDF, Topic Modelling, and Cosine Similarity
operations based on the inherent needs of the documents and the rules it
employs. This adaptive and rule-centric methodology contributes to the rule
engine's ability to enhance the ef ciency and effectiveness of multi-document
summarisation, offering a novel perspective in the eld of automated document
summarisation.
fi
fi
fi
fi
fi
PDF Text Extraction using PyPDF2:

Library: PyPDF2 is a Python library used for working with PDF les and It
allows you to extract text and manipulate PDF documents.

Functionality: In the code provided, the PdfFileReader class from PyPDF2

is used to extract text from PDF les. Here's how it works:

The PdfFileReader is initialised with the path to a PDF le using

PdfFileReader(open( le_path, "rb")). The "rb" mode indicates that the le
should be opened in binary read mode.

The purpose of the getNumPages() method is for determining the total

number of pages in the PDF.

A loop is used to iterate through each page of the PDF with

pdf.getPage(page). For each page, extractText() is called to extract the
text content. The extracted text is then concatenated to create a complete
text representation of the PDF.

Usage: This technique is essential for extracting text content from PDF
documents, which may contain text, images, and various other elements.
Extracted text can then be processed, analysed, and summarised as
needed.

Text Extraction from DOC and DOCX using extract:

Library: textract is a Python library that simpli es the extraction of text

from various document formats, including DOC and DOCX les.

Functionality: In the code, textract is used to extract text from Microsoft

Word documents (both .doc and .docx). Here's how it works:

textract.process( le_path) is used to extract text from the speci ed

document le. The le_path is the path to the document le you want to
process.

The extracted text is further processed and decoded using UTF-8

encoding to ensure that it is in a usable text format.
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
Usage: This method is employed for extracting text from Microsoft Word
documents, which are commonly used for various types of reports,
articles, and documents. Extracting text from these formats allows the
code to work with textual content from a wide range of sources, enabling
summarisation and analysis.

Text Preprocessing :

Text preprocessing is a critical step in natural language process and

text analysis tasks. It involves cleaning and transforming raw text
data to make it more suitable for further analysis, such as text
classi cation, sentiment analysis, or information retrieval. Here's a
more detailed explanation of the mentioned text preprocessing
techniques:

• Lowercasing:
• Purpose: Convert all text to lowercase to ensure uniformity and
reduce the impact of letter case on text analysis. This makes
"Hello" and "hello" identical for analysis.

• Removing Special Characters and Punctuation:

• Purpose: Eliminate special characters and punctuation marks,
such as !@#$%^&*()_+[]{}|:;"'<>,.?/~, which may not contribute
signi cantly in the sentence meaning. Removing them helps
simplify the data and focuses on the core text.

• Removing Numbers:
• Purpose: Exclude numerical digits from the text. This is useful
for tasks where numerical values are not relevant, such as
sentiment analysis or topic modelling.

• Removing URLs and Email Addresses:

fi
fi
• Purpose: Omit web links (URLs) and email addresses, as they
are typically not informative for many text analysis tasks.
Removing them helps reduce noise in the data.

• Handling Non-Breaking Spaces and Extra Spaces:

• Purpose: Replace non-breaking spaces with regular spaces and
remove extra spaces to standardise text formatting. This ensures
that spaces are consistent and do not impact text analysis.

• Removing HTML/JSON Tags:

• Purpose: Strip away HTML and JSON tags to obtain plain text
from web pages or structured data. This is essential when
dealing with text extracted from web content or JSON les.

• Removing Non-Alphabetic Characters:

• Purpose: Eliminate characters that are not letters of the
alphabet, such as digits, symbols, or other non-alphabetic
characters. This step helps to focus on the textual content.

• Handling Emojis:
• Purpose: Depending on the analysis task, you can choose to
remove, replace, or retain emojis. Emojis can convey sentiment
and add meaning to text, so handling them depends on your
speci c goals.

• Handling Non-ASCII Characters:

• Purpose: Remove or replace non-ASCII characters to handle
text in different character encodings. This is important for working
with diverse text data in various languages and scripts.

• Removing Diacritics:
• Purpose: Use the “unicodedata” library to remove accents and
diacritical marks from characters. This simpli es the text and
fi
fi
fi
ensures that words with and without diacritics are treated the
same way.

• Handling Repeated Characters:

• Purpose: Reduce repeated characters to their single occurrence
to avoid skewing the analysis. For example, "coooool" becomes
"cool."

• Using Regular Expressions (re library):

• Purpose: Regular expressions are powerful tools for pattern
matching and text manipulation. It is used to nd and replace
speci c patterns or characters in the text. This is especially
helpful for complex text transformations.

Topic Modelling :

1. Latent Dirichlet Allocation (LDA):

• Purpose: LDA is a probabilistic model used for topic modelling in

text data. It assumes that documents are mixtures of topics, and
topics are mixtures of words. LDA aims to uncover these
underlying topics and their distributions within a given corpus of
text.

How it works:

• LDA overview the entire documents as the mixture of topics in

which each and every topic is categorised by the selection of
words and sentences.
fi
fi
• It decides and assigns the words in such a way that it forms a
proper paragraph of the domain of the multi documents
uploaded.
• The result is the topic set where each topic is the
representation of the sentences and each sentence is the
representation of the topic.

Applications: LDA is used in various NLP tasks, such as

document clustering, information retrieval, content
recommendation, and content summarisation.

2. Gensim:

• Purpose: Gensim is an open-source Python library designed for

topic modelling and document similarity analysis. It provides
ef cient tools for working with large text corpora and applying
LDA and other topic modelling algorithms.

Key Features:

• Implementation of LDA: Gensim provides a user-friendly

API for training LDA models on text data.

• Scalability: It can handle large corpora ef ciently, making it

suitable for real-world applications.

• Word Embeddings: Gensim also includes tools for training

Word2Vec models for word embeddings.

• Usage: Gensim is widely used in academia and industry for

various NLP tasks, including document clustering, content
recommendation, and information retrieval.
fi
fi
Typical Steps in Using LDA with Gensim:

• Preprocessing: Prepare your text data by tokenising,

lowercasing, and removing stop words and unwanted characters.
This ensures that the text data is clean and ready for modelling.

• Creating a Dictionary and Corpus:

• Use Gensim to create a dictionary that maps words to
numerical IDs.
• Create a corpus that represents each document as a bag-
of-words, where the word IDs are associated with their
frequency in the document.

• Training the LDA Model:

• Specify the number of topics you want to discover.
• Train an LDA model on the prepared corpus, specifying
the dictionary and the number of topics.
• The LDA model will learn topic-word distributions and
document-topic distributions.

• Interpreting Topics:
• Examine the top words associated with each topic to
understand the themes it represents.
• Analyse the document-topic distributions to see which
topics are prevalent in each document.

• Visualisation and Evaluation:

• Visualise the results, often using tools like pyLDAvis or
Matplotlib, to gain insights into topic distributions and
relationships.
• Evaluate the quality of topics using metrics like coherence
scores.
TF-IDF Analysis:
1. Term Frequency-Inverse Document Frequency (TF-IDF)
Analysis:

Purpose: TF-IDF is the abbreviation of Term Frequency-Inverse

Document Frequency, is a critical technique in NLP used to assess
the useful words in a document relative to a collection of documents
(corpus). It is employed to identify signi cant terms in a document
while reducing the less number of repetitive words having less
importance in the summary.

How it works:

• Term Frequency (TF): It calculates the number of frequency of

the occurrence of the sentences and the words in the multi
documents uploaded. It gives the ranking scores to the frequency
of the repetition.which means maximum the number of repetition
of the sentences and words the higher will be the score of the
sentence and will help to decide the importance of the sentence
in the summarisation.
• Inverse Document Frequency (IDF): IDF accounts for how
common or rare a term is across the entire corpus. Rare terms
are given higher scores, emphasising their uniqueness.
• Combining TF and IDF: The TF and IDF values are multiplied to
obtain the TF-IDF score for each term in a document. This score
re ects the importance of the term in that document relative to
the entire corpus.

Applications: TF-IDF is widely used in information retrieval, text

mining, and document classi cation. It helps identify keywords,
assess document similarity, and improve the performance of various
NLP tasks.
fl
fi
fi
2. T dfVectorizer (scikit-learn):

Purpose: The T dfVectorizer is a component of the scikit-learn

library, a popular Python library for machine learning and data
analysis. T dfVectorizer is speci cally designed to convert a
collection of text documents into a matrix of TF-IDF features.

Key Features:

• Document-Term Matrix: T dfVectorizer takes a collection of text

documents and transforms it into a document-term matrix, where
each row represents a document, and each column represents a
unique term.

• Customisation: Users can customise various parameters, such

as adjusting TF-IDF weightings, handling stop words, and
specifying n-grams (word combinations).

• Ef ciency: The scikit-learn implementation of T dfVectorizer is

ef cient and can handle large corpora with ease.

Usage: T dfVectorizer is employed in NLP for tasks like text

classi cation, clustering, and information retrieval. It transforms
textual data into a numerical format that can be used as input for
machine learning models.

3. Cosine Similarity (scikit-learn):

Purpose: Cosine similarity is a technique used to recognise the

sentences which are similar from the multiple documents. From this
cosine similarity the data collected will be more accurate and
precise in the summary.
fi
fi
fi
fi
fi
fi
fi
fi
fi
fi
How it works:

• For each pair of documents, their TF-IDF vectors are treated as

vectors in a high-dimensional space.
• The cosine of the angle between these vectors is computed. A
smaller angle (cosine value closer to 1) indicates higher similarity,
while a larger angle (cosine value closer to -1) suggests
dissimilarity.
• Cosine similarity values range from -1 (perfect dissimilarity) to 1
(perfect similarity).

Applications: Cosine similarity is widely used for tasks like

document retrieval, recommendation systems, and clustering. It
allows for the measurement of how closely documents are related
based on their content.

Clustering :

1. K-Means Clustering:

Purpose: K-Means clustering is a fundamental unsupervised

machine learning algorithm used to group similar data points into
clusters.On the other hand of text analysis, K-Means can be applied
to group similar documents based on their content.

How it works:

• Initialisation: K-Means begins by selecting a speci ed number

of clusters (K) and initialising the cluster centroids randomly or
using other methods.
fi
• Assignment: Each document is assigned to the nearest cluster
depending on similarity metric, cosine similarity.

• Update: The mean of the documents within each cluster is the

recalculation of the clusters centroids.

• Iteration: The assignment and update steps are repeated until

convergence, where data points no longer change clusters
signi cantly.

• Result: The result is K clusters, each containing a group of

documents with high similarity within the cluster.

Calculating the Number of Clusters: Selecting the optimal

frequency of clusters (K) is a crucial step. Common techniques,
such as the elbow method or using domain knowledge, can be
used. In your context, you mentioned determining K based on the
square root of the number of documents, but other approaches may
also be suitable, depending on the speci c use case.

Applications: K-Means clustering is widely used in document

clustering, image segmentation, customer segmentation, and other
unsupervised learning tasks. In text analysis, it can be employed to
group similar documents, which is useful for information retrieval,
recommendation systems, and content organisation.

2. NumPy:

Purpose: NumPy is a fundamental Python library for numerical and

array operations. It is essential for various mathematical and
statistical computations, making it valuable for tasks like clustering
and data analysis.
fi
fi
Key Features:

• N-Dimensional Arrays: NumPy provides support for N-

dimensional arrays, allowing for ef cient storage and
manipulation of large datasets.

• Mathematical Functions: NumPy offers a wide range of

mathematical functions and operations, such as mean,
maximum, minimum, and more.

• Linear Algebra: It includes functions for linear algebra

operations, making it a valuable tool for mathematical and
statistical analysis.

• Interoperability: NumPy is compatible with many other libraries

and tools commonly used in the scienti c and data analysis
communities.

Usage: In the comparison of document clustering, NumPy can be

used for various tasks, including calculating the similarity between
documents (e.g., cosine similarity), determining cluster centroids,
and performing numerical operations to assess cluster quality, such
as calculating the maximum similarity within clusters.

NumPy is particularly helpful when working with large datasets or

when performing complex mathematical operations within the
clustering process. It is often integrated with other libraries like
scikit-learn for a comprehensive analysis of textual data.

Extractive Summarisation :

1. Extractive Summarisation:

Purpose: Extractive summarisation is a technique used in natural

language processing to create a summary of a longer document by
fi
fi
selecting and extracting sentences or passages that are deemed
the most important or representative of the content. Unlike
abstractive summarisation, which generates summaries in a more
human-like manner, extractive summarisation directly extracts and
arranges existing sentences from the uploaded document.

How it works:

• Sentence Scoring: In extractive summarisation, sentences

within the source document are scored based on their relevance,
importance, or informativeness. Various algorithms and features
can also be used to determine the frequency.

• Selection: The sentences with the highest frequency of repetition

are selected for summarisation in the summary. These selected
sentences are then arranged in a coherent manner to form the
extractive summary.

• Content Retention: Extractive summarisation aims to retain the

most crucial information from the uploaded documents while
maintaining the original wording.

Applications: Extractive summarisation is widely used for

summarising long documents, news articles, research papers, and
other text sources. It simpli es content consumption and provides
users with concise overviews of lengthy documents.

2. LexRank Algorithm and LexRankSummarizer (sumy):

Purpose: LexRank is an extractive summarisation algorithm

designed to rank sentences based on their similarity and
importance. The LexRankSummarizer, implemented in the "sumy"
library, leverages the LexRank algorithm to create extractive
summaries.
fi
How it works:

• Graph-Based Approach: LexRank is a graph-based algorithm. It

rst constructs a similarity graph where nodes and edges. where
nodes and edges are used to represent sentence and similarity
between the sentences respectively.
• Sentence Similarity: Sentence similarity is typically computed
using techniques like cosine similarity, Jaccard similarity, or other
measures.
• Ranking Sentences: LexRank applies the PageRank algorithm,
which was originally developed for ranking web pages, to rank
the sentences in the graph. This ranking re ects the repetition
and necessity of each sentence in the document.
• Selection: Sentences with the highest LexRank scores are
selected for inclusion in the extractive summary.

Applications: LexRank, and consequently the

LexRankSummarizer, is particularly useful for summarising
documents where sentence importance is determined by their
relationships and similarities to other sentences in the document. It
has applications in content summarisation, information retrieval,
and document summarisation tasks.
fi
fl
Document (D1) Document (D2) Document (Dn)

Summary (S1) Summary (S2) Summary(Sn)

Extraction of useful sentences

Sequencing sentences

Summarisation of Multi Document’s

Outline the sequence of steps involved in summarising multiple

documents.
Result & Analysis :

Results:

1. Precision and Relevance: Rule-based systems tend to provide

precise summaries by focusing on speci c rules. If the rules are
well-de ned and accurately capture relevant information, the
summaries are likely to be highly relevant to the source
documents.

2. Coverage: The coverage of the summarisation system refers to

its ability to include important information from various parts of the
input documents. A good rule engine should cover a wide range of
topics and details from the source documents.

3. Redundancy: Redundancy in summaries occurs when similar or

identical information is repeated. Careful rule design can minimise
redundancy and ensure that the summary contains diverse and
valuable content.

4. Coherence: Coherence assesses the logical ow and smooth

transition between sentences in the summary. While extractive
approaches preserve the original sentences, crafting rules that
ensure coherent sentences in the obtained summary is crucial for
readability.

Analysis:

1. Rule Effectiveness: Evaluate the effectiveness of the rules in

capturing relevant information. Analyse which rules contribute most
signi cantly to the standards of the summaries and re ne or modify
them as needed.

2. Comparative Analysis: Compare the rule-based approach with

other summarisation techniques, such as abstractive methods or
fi
fi
fi
fl
fi
machine learning-based models. Determine the advantages and
disadvantages of the rule-based approach in different scenarios.

3. Scalability: Assess how well the rule engine scales with an

increasing number of source documents. Evaluate its performance
with small-scale and large-scale document sets to ensure ef ciency
and accuracy in different contexts.

4. Limitations: Identify the limitations of the rule-based approach,

such as its inability to handle ambiguous language or complex
contexts. Understanding these limitations can guide future
improvements and research directions.

Future Scope :

Enhancing Rule Engine Capabilities: Improving the rule engine

itself is crucial. Researchers can work on creating more
sophisticated and adaptable rule-based systems that can handle a
wider range of document types, languages, and domains. These
systems could employ machine learning techniques to automatically
generate or re ne rules.

Scalability: Making rule-based multi-document summarisation

systems scalable to handle large volumes of documents is
essential. Future research can focus on optimising performance
and ef ciency to process and summarise large datasets in real-time
or near-real-time.

Customisation and User-De ned Rules: Allowing users to de ne

their own rules or preferences for summarisation can enhance the
usability of such systems. Future research can explore ways to
make summarisation engines more customisable, so they can
generate summaries tailored to speci c user needs.
fi
fi
fi
fi
fi
fi
Hybrid Approaches: Combining rule-based summarisation with
other techniques like extractive or abstractive methods,
reinforcement learning, or deep learning can lead to more robust
and effective summarisation systems.

Conclusion :

The incorporation of a rule engine into the process of extractive

multi-document summarisation has demonstrated its remarkable
effectiveness and versatility in managing a wide array of intricate
textual content. This rule engine functions independently of any
prede ned sequence, instead dynamically restructuring the order of
operations in response to its own decision-making processes. This
unique attribute endows the rule engine with the capacity to
autonomously determine the most appropriate arrangement of
tasks, tailored to the speci c demands of the situation.

Through the utilisation of engine rule technique and cross-

document coherence rules, the rule engine has yielded substantial
improvements in the ef ciency and ef cacy of multi-document
summarisation. It excels in distilling the core essence of source
documents, resulting in more polished and coherent summarisation.

In contrast to sequential methodologies such as Clustering, TF-IDF,

Topic Modelling, and Cosine Similarity, the rule engine's adaptability
in orchestrating the sequence of operations introduces a dynamic
and rule-driven approach. This innovative approach optimises the
summarisation process by reordering the operations based on the
inherent requirements of the documents and the rules it employs.
This exibility proves particularly advantageous in scenarios
characterised by signi cant variations in content complexity and
structural composition.
fi
fl
fi
fi
fi
fi
References :

1. R.S, Ramya, M. Shahina Parveen, Savitha Hiremath, Isha

Pugalia, S.H. Manjula, and K.R. Venugopal. "A Survey on
Automatic Text Summarization and its Techniques." International
Journal of Intelligent Systems and Applications in Engineering
IJISAE 11.1s (2023): 63–71.

2. Zhang, Y., Ni, A., Mao, Z., Wu, C. H., Zhu, C., Deb, B.,
Awadallah, A. H., Radev, D., & Zhang, R. (2022). SUMMN: A
Multi-Stage Summarization Framework for Long Input Dialogues
and Documents. In Proceedings of the 60th Annual Meeting of
the Association for Computational Linguistics (Vol. 1: Long
Papers, pp. 1592-1604). Association for Computational
Linguistics.

3. Joshi, A., Fidalgo, E., Alegre, E., & Fernández-Robles, L.

(2019). SummCoder: An unsupervised framework for extractive
text summarization based on deep auto-encoders. Expert
Systems with Applications, 129, 200-215.

4. Kurian, S., & Mathew, S. (2020). Survey of scienti c document

summarization methods. Computer Science, 21(2), 3356.
https://doi.org/10.7494/csci.2020.21.2.3356.

5. Satre, S. M., Patil, M., & Raju, S. (2019). Multi-Document

Summarization using Fuzzy and Hierarchical Approach.
International Research Journal of Engineering and Technology
(IRJET), 06(04), 2607. https://www.irjet.net/ISSN: 2395-0056.

6. Agarwal, R., & Chatterjee, N. (2022). Improvements in Multi-

Document Abstractive Summarization using Multi Sentence
fi
Compression with Word Graph and Node Alignment. Expert
Systems with Applications, 190, 116154.

7. Asa, A. S., Akter, S., Uddin, M. P., Hossain, M. D., Roy, S. K., &
Afjal, M. I. (2017). A Comprehensive Survey on Extractive Text
Summarization Techniques. American Journal of Engineering
Research (AJER), 6(1), 226-239. https://www.ajer.org/ISSN:
2320-0847 | p-ISSN: 2320-0936

8. Liu, P. J., Saleh, M., Pot, E., Goodrich, B., Sepassi, R., Kaiser,
Ł., & Shazeer, N. (2018). Generating Wikipedia by Summarizing
Long Sequences. In Proceedings of the International
Conference on Learning Representations (ICLR),
arXiv:1801.10198v1 [cs.CL].

9. Tomer, M., & Kumar, M. (2021). Multi-document extractive text

summarization based on re y algorithm. Journal of King Saud
University – Computer and Information Sciences, Advance
online publication. https://doi.org/10.1016/j.jksuci.2021.04.004

10. White, C. T., Molino, N. P., Yang, J. S., & Conroy, J. M. (2022).
occams: A Text Summarization Package. Analytics, 2, 546–559.
https://doi.org/10.3390/analy

11. Afsharizadeh, M., Ebrahimpour-Komleh, H., Bagheri, A., &

Chrupala, G. (2022). A Survey on Multi-document
Summarization and Domain-Oriented Approaches. Journal of
Information Systems and Telecommunication, 10(1), 68-79.
DOI: https://doi.org/10.52547/jist.16245.10.37.68

12. Koh, H. Y., Ju, J., Liu, M., & Pan, S. (2022). An Empirical Survey
on Long Document Summarization: Datasets, Models and
Metrics. arXiv preprint arXiv:2207.00939v1 [cs.CL].
fi
fl
13. P, Keerthana. (2021). Automatic Text Summarization Using
Deep Learning. EPRA International Journal of Multidisciplinary
Research (IJMR), 7(4). https://doi.org/10.36713/epra2013

14. Uckan, T., & Karci, A. (2020). Extractive Multi-Document Text

Summarization Based on Graph Independent Sets. Egyptian
Informatics Journal, 21(3), 145-157.

15. White, C. T., Molino, N. P., Yang, J. S., & Conroy, J. M. (Year of
publication). occams: A Text Summarization Package. Analytics,
2, 546-559. https://doi.org/10.3390/analytics2030030.

16. Rao, P. R. K., & Devi, S. L. (2018). Enhancing Multi-Document

Summarization Using Concepts. AU-KBC Research Centre, MIT
Campus of Anna University, Chennai 600044, India. Email:
[email protected]; [email protected]. Journal Name,
Volume(Issue), Page range. [Published online: 10 March 2018]

17. Prabhala, B. (2014). Scalable Multi-Document Summarization

Using Natural Language Processing. Master's thesis, Rochester
Institute of Technology, B. Thomas Golisano College of
Computing and Information Sciences, Rochester, New York.
Supervised by Dr. Rajendra K. Raj.

Text Summarization Using NLP Final
No ratings yet
Text Summarization Using NLP Final
38 pages
5bbb PDF
No ratings yet
5bbb PDF
6 pages
Abriefoverviewofautomaticdocument Summarization: Abhishek Sathe
No ratings yet
Abriefoverviewofautomaticdocument Summarization: Abhishek Sathe
2 pages
Multi-Document Summarization by Sentence Extraction: Jade Goldstein Vibhu Mittal T Jaime Carbonell Mark Kantrowitzt
No ratings yet
Multi-Document Summarization by Sentence Extraction: Jade Goldstein Vibhu Mittal T Jaime Carbonell Mark Kantrowitzt
9 pages
Literature Review of Automatic Multiple Documents Text Summarization
No ratings yet
Literature Review of Automatic Multiple Documents Text Summarization
9 pages
Text Summarizer Using NLP (Natural Language Processing) : © JUL 2022 - IRE Journals - Volume 6 Issue 1 - ISSN: 2456-8880
No ratings yet
Text Summarizer Using NLP (Natural Language Processing) : © JUL 2022 - IRE Journals - Volume 6 Issue 1 - ISSN: 2456-8880
6 pages
Literature Study On Multi-Document Text Summarization Techniques
No ratings yet
Literature Study On Multi-Document Text Summarization Techniques
11 pages
A.V.C. College of Engineering: Mayiladuthurai, Mannampandal-609 305
No ratings yet
A.V.C. College of Engineering: Mayiladuthurai, Mannampandal-609 305
21 pages
Multi-Document Extractive Summarization For News Page 1 of 59
No ratings yet
Multi-Document Extractive Summarization For News Page 1 of 59
59 pages
A Statistical Approach To Perform Web Based Summarization: Kirti Bhatia, Dr. Rajendar Chhillar
No ratings yet
A Statistical Approach To Perform Web Based Summarization: Kirti Bhatia, Dr. Rajendar Chhillar
3 pages
(IJCST-V3I4P21) : Ms - Pallavi.D.Patil, P.M.Mane
No ratings yet
(IJCST-V3I4P21) : Ms - Pallavi.D.Patil, P.M.Mane
7 pages
A Review Paper On Extractive Techniques of Text Summarization
No ratings yet
A Review Paper On Extractive Techniques of Text Summarization
4 pages
The InfluxDB Handbook: Deploying, Optimizing, and Scaling Time Series Data
From Everand
The InfluxDB Handbook: Deploying, Optimizing, and Scaling Time Series Data
Robert Johnson
No ratings yet
Jaya D. Kapoor Alamuri Ratnamala Institute of Engineering and Technology, Shahpur Kailas K. Devadkar Sardar Patel Institute of Technology, Andheri
No ratings yet
Jaya D. Kapoor Alamuri Ratnamala Institute of Engineering and Technology, Shahpur Kailas K. Devadkar Sardar Patel Institute of Technology, Andheri
6 pages
An Approach For Design Search Engine Architecture For Document Summarization
No ratings yet
An Approach For Design Search Engine Architecture For Document Summarization
6 pages
Automatic Image Annotation: Fundamentals and Applications
From Everand
Automatic Image Annotation: Fundamentals and Applications
Fouad Sabry
No ratings yet
Text Summarization Using Word Frequency
No ratings yet
Text Summarization Using Word Frequency
3 pages
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
From Everand
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
Fouad Sabry
No ratings yet
A New Approach For Multi-Document Summarization: Savita P. Badhe, Prof. K. S. Korabu
No ratings yet
A New Approach For Multi-Document Summarization: Savita P. Badhe, Prof. K. S. Korabu
3 pages
Document and Knowledge Management Interrelationships
From Everand
Document and Knowledge Management Interrelationships
A. Afritopic
4.5/5 (2)
Afantenos&Al AIM Preprint
No ratings yet
Afantenos&Al AIM Preprint
22 pages
Application Design: Key Principles For Data-Intensive App Systems
From Everand
Application Design: Key Principles For Data-Intensive App Systems
Rob Botwright
No ratings yet
A Survey On Abstractive Text Summarization
No ratings yet
A Survey On Abstractive Text Summarization
7 pages
Rane, Govilkar - 2019 - Recent Trends in Deep Learning Based Abstractive Text Summarization-Annotated
No ratings yet
Rane, Govilkar - 2019 - Recent Trends in Deep Learning Based Abstractive Text Summarization-Annotated
8 pages
Text Summarization:An Overview: October 2013
No ratings yet
Text Summarization:An Overview: October 2013
6 pages
Model-Driven Online Capacity Management for Component-Based Software Systems
From Everand
Model-Driven Online Capacity Management for Component-Based Software Systems
André van Hoorn
No ratings yet
Malayalam 2
No ratings yet
Malayalam 2
4 pages
Syntactic Trimming of Extracted Sentences For Improving Extractive Multi-Document Summarization
No ratings yet
Syntactic Trimming of Extracted Sentences For Improving Extractive Multi-Document Summarization
8 pages
Automatic Text Summarization by Extracti
No ratings yet
Automatic Text Summarization by Extracti
15 pages
Research Paper Summer Izer
No ratings yet
Research Paper Summer Izer
6 pages
Comparative Study of Text Summarization Methods
No ratings yet
Comparative Study of Text Summarization Methods
6 pages
Wang 2008
No ratings yet
Wang 2008
8 pages
An Overall Survey of Extractive Based Automatic Text Summarization Methods
No ratings yet
An Overall Survey of Extractive Based Automatic Text Summarization Methods
6 pages
Automatic Text Summarization Using Python
No ratings yet
Automatic Text Summarization Using Python
8 pages
Automatic Summarisation II: Methods
No ratings yet
Automatic Summarisation II: Methods
84 pages
Robin 3 PDF
No ratings yet
Robin 3 PDF
6 pages
NLP Report
No ratings yet
NLP Report
14 pages
Multisource Summarization System1
No ratings yet
Multisource Summarization System1
22 pages
Camera Ready Paper
No ratings yet
Camera Ready Paper
5 pages
Research Final
No ratings yet
Research Final
6 pages
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
From Everand
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
Byron Ellis
No ratings yet
Paper 02
No ratings yet
Paper 02
12 pages
Manifold-Ranking Based Topic-Focused Multi-Document Summarization
No ratings yet
Manifold-Ranking Based Topic-Focused Multi-Document Summarization
6 pages
C++ File Handling Step by Step: A Practical Guide with Examples
From Everand
C++ File Handling Step by Step: A Practical Guide with Examples
William E. Clark
No ratings yet
ATSSI Abstractive Text Summarization Using Sentiment Infusion
No ratings yet
ATSSI Abstractive Text Summarization Using Sentiment Infusion
7 pages
Pooja PDF
No ratings yet
Pooja PDF
21 pages
Image Retrieval: Unlocking the Power of Visual Data
From Everand
Image Retrieval: Unlocking the Power of Visual Data
Fouad Sabry
No ratings yet
Graph Layout Support for Model-Driven Engineering
From Everand
Graph Layout Support for Model-Driven Engineering
Miro Spönemann
No ratings yet
Building and Operating Data Hubs: Using a practical Framework as Toolset
From Everand
Building and Operating Data Hubs: Using a practical Framework as Toolset
Georg Graner
No ratings yet
Term Paper: Title: Multiword Detection in Natural Language Processing (NLP)
No ratings yet
Term Paper: Title: Multiword Detection in Natural Language Processing (NLP)
3 pages
Image Retrieval: Fundamentals and Applications
From Everand
Image Retrieval: Fundamentals and Applications
Fouad Sabry
No ratings yet
Project
0% (1)
Project
65 pages
An Automatic Text Summarization Using Feature Terms For Relevance Measure
No ratings yet
An Automatic Text Summarization Using Feature Terms For Relevance Measure
5 pages
Concept Mining: Fundamentals and Applications
From Everand
Concept Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
A Framework For Multi-Document Abstractive Summarization Based On Semantic Role Labelling
No ratings yet
A Framework For Multi-Document Abstractive Summarization Based On Semantic Role Labelling
11 pages
The Study of Building the Data Warehouse
From Everand
The Study of Building the Data Warehouse
venkateswara Rao
No ratings yet
Document Summarization: Abhirut Gupta Mandar Joshi Piyush Dungarwal
No ratings yet
Document Summarization: Abhirut Gupta Mandar Joshi Piyush Dungarwal
47 pages
Blockchain Adoption in Supply Chain Management and Logistics
From Everand
Blockchain Adoption in Supply Chain Management and Logistics
Niels Hackius
No ratings yet
The HAProxy Handbook: Load Balancing for Modern Infrastructure
From Everand
The HAProxy Handbook: Load Balancing for Modern Infrastructure
Robert Johnson
No ratings yet
Exploring Semantic Technologies and Their Application to Nuclear Knowledge Management
From Everand
Exploring Semantic Technologies and Their Application to Nuclear Knowledge Management
IAEA
No ratings yet
Ir Case Study
No ratings yet
Ir Case Study
8 pages
Annexure 6 - Project Topic Approval Format
0% (1)
Annexure 6 - Project Topic Approval Format
11 pages
The Ivory Tower Lost - How College Students Respond Differently Than The General Public To The COVID-19 Pandemic
No ratings yet
The Ivory Tower Lost - How College Students Respond Differently Than The General Public To The COVID-19 Pandemic
8 pages
Topic Modelling
No ratings yet
Topic Modelling
14 pages
2024_Developing a supervised learning model for anticipating potential technology convergence.....
No ratings yet
2024_Developing a supervised learning model for anticipating potential technology convergence.....
18 pages
Natural Language Processing 5
No ratings yet
Natural Language Processing 5
24 pages
LSA, pLSA, and LDA Acronyms, Oh My!
No ratings yet
LSA, pLSA, and LDA Acronyms, Oh My!
114 pages
Playstore App Review Analysis: Capstone Project
No ratings yet
Playstore App Review Analysis: Capstone Project
11 pages
20 Latent Dirichlet Allocation
No ratings yet
20 Latent Dirichlet Allocation
27 pages
IET Image Processing - 2023 - Dheepak - MEHW SVM Multi Kernel Approach For Improved Brain Tumour Classification
No ratings yet
IET Image Processing - 2023 - Dheepak - MEHW SVM Multi Kernel Approach For Improved Brain Tumour Classification
19 pages
Boedeker Kearns 2019 Linear Discriminant Analysis For Prediction of Group Membership A User Friendly Primer
No ratings yet
Boedeker Kearns 2019 Linear Discriminant Analysis For Prediction of Group Membership A User Friendly Primer
14 pages
Cheng-Et-Al-2024-Return-To-The-United-States-Impact-Of-Reshoring-Announcements-And-Reshoring-Risks-On-Market-Valuation 18.11.56
No ratings yet
Cheng-Et-Al-2024-Return-To-The-United-States-Impact-Of-Reshoring-Announcements-And-Reshoring-Risks-On-Market-Valuation 18.11.56
31 pages
Usc PHD Thesis
100% (3)
Usc PHD Thesis
7 pages
Community Detection
No ratings yet
Community Detection
72 pages
Smart Tourism Destination A Critical Reflection
No ratings yet
Smart Tourism Destination A Critical Reflection
18 pages
Text Mining of Twitter Data Using A Latent Dirichlet Allocation Topic Model and Sentiment Analysis
No ratings yet
Text Mining of Twitter Data Using A Latent Dirichlet Allocation Topic Model and Sentiment Analysis
6 pages
Topoc Modeling PDF
No ratings yet
Topoc Modeling PDF
120 pages
Untitled
No ratings yet
Untitled
326 pages
Narrative AP
No ratings yet
Narrative AP
59 pages
[2024+issue]+ARDA_JOURNAL_17223---AL+(1)
No ratings yet
[2024+issue]+ARDA_JOURNAL_17223---AL+(1)
6 pages
Machine Learning and Soil Sciences A Review
No ratings yet
Machine Learning and Soil Sciences A Review
18 pages
Aspect Based Sentiment Analysis Approaches and Algorithms
No ratings yet
Aspect Based Sentiment Analysis Approaches and Algorithms
4 pages
How2: A Large-Scale Dataset For Multimodal Language Understanding
No ratings yet
How2: A Large-Scale Dataset For Multimodal Language Understanding
12 pages
A Model For Auto-Tagging of Research Papers Based On Keyphrase Extraction Methods
No ratings yet
A Model For Auto-Tagging of Research Papers Based On Keyphrase Extraction Methods
6 pages
Chengqing Zong - Rui Xia - Jiajun Zhang - Text Data Mining-Springer Singapore
100% (1)
Chengqing Zong - Rui Xia - Jiajun Zhang - Text Data Mining-Springer Singapore
506 pages
Master Thesis - Emidi and Galan
No ratings yet
Master Thesis - Emidi and Galan
39 pages
Big Data Researchpaper
No ratings yet
Big Data Researchpaper
4 pages
CEGP013091: 49.248.216.238 17/05/2024 13:48:57 Static-238
No ratings yet
CEGP013091: 49.248.216.238 17/05/2024 13:48:57 Static-238
3 pages
Fusion Sentiment Analysis for Enhanced E-commerce Product Experience
No ratings yet
Fusion Sentiment Analysis for Enhanced E-commerce Product Experience
12 pages
Msu Graduate School Thesis Formatting
100% (3)
Msu Graduate School Thesis Formatting
7 pages