Text Summarization Using NLP Final
Text Summarization Using NLP Final
NLP
BY:
SOUNDARAJULU R–
21MDT1069
GUIDED BY: DR. SOMNATH
1
BERA
CONTENTS
• background
• Project objective
• About dataset
• About algorithms
• Algorithms implemented
• Comparisons
• Conclusion
• Future work
• References
2
INTRO ABOUT NATURAL LANGUAGE
PROCESSING
• NLP stands for natural language processing which is used to understand and
interpret human language to the machine.
• Basically it is the automatic way to manipulate the natural language like speech
and text, by software for further analysis to get the required information from
them.
• These enable machines to process human language in the form of text or voice
data.
• This can not implemented on single paragraph ,because it requires more text data. 3
LITERATURE REVIEW
• Vishnu Preethi K, Vijaya MS - 16 April 2018 "Text Summarizers for Education News Articles" In this
paper, presentation examination was made and found that the famous Message rank calculation which
was not involved a lot of in message rundown research produce improved results for both datasets.
• Termite: Visualization Techniques for Assessing Textual Topic Models Jason Chuang,
Christopher D. Manning, Jeffrey Heer Advanced Visual Interfaces, 2012 In this paper, they
examines document-topic probabilities. They also focused on understanding terms and
term-topic distributions.
• arson Sievert and Kenneth Shirley. 2014. LDAvis: A method for visualizing and interpreting
topics. In Proceedings of the Workshop on Interactive Language Learning, Visualization,
and Interfaces, pages 63–70, Baltimore, Maryland, USA. Association for Computational
Linguistics. In this paper, they have explained a web based interactive visual representation
LDAvis to determine topic-term relation using an R package "LDAvis".
4
PROJECT OBJECTIVE
• TO SUMMARISE TEXT DOCUMENTS USING NLP and comparing the results of algorithms implemented.
5
DATASET
6
ALGORITHMS IMPLEMENTED
• TextRank Algorithm
• LexRank Algorithm
7
TEXT SUMMARIZATION
• Text summarization is the process of shortening the number of sentences and words of a report
without changing its importance.
• There are various methods to separate data from raw text data and use it for a summarization
model, generally they can be sorted as Extractive and Abstractive.
8
TYPES OF TEXT SUMMARIZATION
• Text cleaning
• Sentence tokenization
• Word tokenization
• Summarization
10
TEXT CLEANING
• Removing Punctuations
• Removing Numbers, Extra Cases
• Removing HTML Tags
• Removing & Finding URL
• Removing & Finding Email ID
• Removing Stop Words
• Spell Check
• Remove the less frequent words
11
SENTENCE AND WORD TOKENIZATION
• WORD TOKENIZATION:-
• SENTENCE TOKENIZATION:-
• Word tokenization is the method involved with
parting a huge sample of text into words.
• Sentence Tokenization is the process of
splitting text into individual sentences.
• Each word should be captured and subjected
to additional analysis like classifying and
counting them for a specific sentiment and so
on.
12
TEXT RANK ALGORITHM
• TextRank (2004) is an unsupervised graph based ranking model for text processing, based on Google’s
PageRank Algorithm.
• First the whole text is split into sentences, then the algorithm builds a graph where sentences are the
nodes and overlapped words are the links.
• Finally it identifies the most important nodes of this network for these sentences.
13
14
LEXRANK ALGORITHM
• LexRank is an unsupervised graph based approach for text summarization in which the scoring of
sentences is done using the graph method.
• The main idea is that sentences "suggest" other similar sentences to the reader.
• Ex: This is an example of the article. This is the second example sentence. This is the third sentence,
that is the most important because it says other sentences are just examples.
15
LEXRANK SCORES
• Cosine Similarity
• Adjacency Matrix
• Connectivity Matrix
• Eigenvector Centrality
16
These Scores defines Similarity Matrix of Classical and Continuous LexRank.
17
18
LexRank TextRank
19
ROGUE
• ROUGE metric is used for measuring the performance of the automatic summarization and machine
translation tasks.
• ROUGE-N measures the number of matching ‘n-grams’ between our model-generated text and a
‘reference’.
• A unigram (1-gram) would consist of a single word. A bigram (2-gram) consists of two consecutive
words and so on.
20
RECALL, PRECISION & F1 SCORE
Recall:-
• To ensure our model is capturing all of the information contained in the reference.
• The recall counts (the number of overlapping n-grams found in both the model output and reference)
— then divides this number by (the total number of n-grams in the reference).
Precision:-
• Precision is calculated in almost the exact same way, but rather than dividing by the reference n-gram
count, we divide by the model n-gram count.
F1 Score:-
• F1 Score finds (2*Recall*Precision) — then divides this number by (Recall*Precision).
21
ROUGE FOR TEXTRANK
• R – Recall
• P – Precision
• F – F1 Score
22
ROUGE FOR LEXRANK
• R – Recall
• P – Precision
• F – F1 Score
23
LSA (LATENT SEMANTIC ANALYSIS)
• LSA (Latent Semantic Analysis) also known as LSI (Latent Semantic Index) LSA uses bag of word(BoW)
model, which results in a term-document matrix(occurrence of terms in a document).
• Rows represent terms and columns represent documents. LSA learns latent topics by performing a
matrix decomposition on the document-term matrix using Singular value decomposition.
24
TEXT CLASSIFICATION VS TOPIC
MODELING
https://www.datacamp.com/tutorial/discovering-hidden-topics-python
• Text classification is a supervised machine learning problem, where a text document or article
classified into a pre-defined set of classes. Topic modeling is the process of discovering groups of co-
occurring words in text documents.
25
• Topic modeling can be used to solve the text classification problem. Topic modeling will
identify the topics presents in a document" while text classification classifies the text into a
single class.
26
https://www.datacamp.com/tutorial/discovering-hidden-topics-python
IMPLEMENTING LSA
27
Here, only the tech-related news
article looks like having a wider
spread whereas other news
articles nicely clustered.
• LDA is a generative probabilistic model that assumes each topic is a mixture over an underlying set of
words, and each document is a mixture of over a set of topic probabilities.
29
IMPLEMENTING LDA
• Loading data
• Data cleaning
• EDA
• Preparing data for LDA analysis
• LDA model training
• Analyzing LDA model results
30
31
32
LSA VS LDA
• LSA and LDA have same input which is Bag of words in matrix format. LSA focus on reducing matrix
dimension while LDA solves topic modeling problems.
• LDA & LSA are unsupervised
33
CONCLUSION
• We have done word cleaning and prepossessing steps. We have done word tokenization and sentence
tokenization to bring out the summary. Even though the summary is too large. Then we implemented
Lex rank and Text rank for the dataset. And gained f1 measure. By Comparing the result, we conclude
that lex rank giving the 98% of f1 score. And also we have implemented LSA & LDA for different
dataset (bbc news & nips papers).
• We have generated eda and document clustering for LSA and summarization output generated by
query.
• In LDA we have created word cloud and LDA visualization (Intertopic Distance Map (via
multidimensional scaling), Marginal topic distribution, Overall term frequency, Estimated term
frequency within the selected topic)
• Both LDA and LSA are used for topic summarization based on the dataset.
34
REFERENCES
• [1]Termite: Visualization Techniques for Assessing Textual Topic Models Jason Chuang, Christopher D. Manning, Jeffrey Heer
Advanced Visual Interfaces, 2012
• [2]arson Sievert and Kenneth Shirley. 2014. LDAvis: A method for visualizing and interpreting topics. In Proceedings of the
Workshop on Interactive Language Learning, Visualization, and Interfaces, pages 63–70, Baltimore, Maryland, USA. Association
for Computational Linguistics.
• [3] https://www.ijesi.org/v7i4(version-2).html- Vishnu Preethi K, Vijaya MS Text Summarizers for Education News Articles. 16
April 2018, Invention Journals
• [4] Alomari, A., Idris, N., Sabri, A. Q. M., & Alsmadi, I. (2022). Deep reinforcement and transfer learning for abstractive text
summarization: A review. Computer Speech & Language, 71, 101276.
• [5] Wazery, Y. M., Saleh, M. E., Alharbi, A., & Ali, A. A. (2022). Abstractive Arabic Text Summarization Based on Deep Learning.
Computational Intelligence and Neuroscience, 2022.
• [6] Laskar, M. T. R., Hoque, E., & Huang, J. X. (2022). Domain Adaptation with Pre-trained Transformers for Query-Focused
Abstractive Text Summarization. Computational Linguistics, 48(2), 279-320.
• [7] Suleiman, D., & Awajan, A. (2022). Multilayer encoder and single-layer decoder for abstractive Arabic text summarization.
Knowledge-Based Systems, 237, 107791.
• [8] Ertam, F., & Aydin, G. (2022). Abstractive text summarization using deep learning with a new Turkish summarization
benchmark dataset. Concurrency and Computation: Practice and Experience, 34(9), e6482.
35
• [9] Aliakbarpour, H., Manzuri, M. T., & Rahmani, A. M. (2022). Improving the readability and saliency of abstractive text
summarization using combination of deep neural networks equipped with auxiliary attention mechanism. The Journal of
Supercomputing, 78(2), 2528-2555.
• [10] Aggarwal, C. C. (2022). Text summarization. In Machine Learning for Text (pp. 393-418). Springer, Cham.
• [11] Gupta, A., Chugh, D., & Katarya, R. (2022). Automated news summarization using transformers. In Sustainable
Advanced Computing (pp. 249-259). Springer, Singapore.
• [12] Khurana, A., & Bhatnagar, V. (2022). Investigating entropy for extractive document summarization. Expert Systems
with Applications, 187, 115820.
• [13] Zhong, M., Liu, Y., Xu, Y., Zhu, C., & Zeng, M. (2022, June). Dialoglm: Pre-trained model for long dialogue
understanding and summarization. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 36, No. 10, pp.
11765-11773).
• [14] Ma, C., Zhang, W. E., Guo, M., Wang, H., & Sheng, Q. Z. (2022). Multi-document summarization via deep learning
techniques: A survey. ACM Computing Surveys, 55(5), 1-37.
• [15] Moro, G., & Ragazzi, L. (2022, February). Semantic Self-segmentation for Abstractive Summarization of Long Legal
Documents in Low-resource Regimes. In Proceedings of the Thirty-Six AAAI Conference on Artificial Intelligence,
Virtual (Vol. 22).
• [16] Patil, P., Rao, C., Reddy, G., Ram, R., & Meena, S. M. (2022). Extractive Text Summarization Using BERT. In
Proceedings of the 2nd International Conference on Recent Trends in Machine Learning, IoT, Smart Cities and
Applications (pp. 741-747). Springer, Singapore.
36
• [17] Mohan, G. B., & Kumar, R. P. (2022). A Comprehensive Survey on Topic Modeling in Text Summarization. Micro-
Electronics and Telecommunication Engineering, 231-240.
FUTURE WORK
• In future we can analyze using other techniques to improve the summaries such as
Natural Language Understanding, Natural Language Generation, Multi-Document
Summarization, Personalized Summarization, Cross-Lingual Summarization, Visual
Summarization. These are just a few of the many potential directions for future
research and development in text summarization.
37
THANK YOU
38