0% found this document useful (0 votes)
32 views

Text Summarization Using NLP Final

Project Presentation on Text summarisation

Uploaded by

Soundar Ravi
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

Text Summarization Using NLP Final

Project Presentation on Text summarisation

Uploaded by

Soundar Ravi
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 38

TEXT SUMMARIZATION USING

NLP

BY:

SOUNDARAJULU R–
21MDT1069
GUIDED BY: DR. SOMNATH
1
BERA
CONTENTS
• background
• Project objective
• About dataset
• About algorithms
• Algorithms implemented
• Comparisons
• Conclusion
• Future work
• References

2
INTRO ABOUT NATURAL LANGUAGE
PROCESSING
• NLP stands for natural language processing which is used to understand and
interpret human language to the machine.

• Basically it is the automatic way to manipulate the natural language like speech
and text, by software for further analysis to get the required information from
them.

• NLP combines computational linguistics—rule-based modeling of human


language—with statistical, machine learning, and deep learning models.

• These enable machines to process human language in the form of text or voice
data.

• This can not implemented on single paragraph ,because it requires more text data. 3
LITERATURE REVIEW

• Vishnu Preethi K, Vijaya MS - 16 April 2018 "Text Summarizers for Education News Articles" In this
paper, presentation examination was made and found that the famous Message rank calculation which
was not involved a lot of in message rundown research produce improved results for both datasets.
• Termite: Visualization Techniques for Assessing Textual Topic Models Jason Chuang,
Christopher D. Manning, Jeffrey Heer Advanced Visual Interfaces, 2012 In this paper, they
examines document-topic probabilities. They also focused on understanding terms and
term-topic distributions.
• arson Sievert and Kenneth Shirley. 2014. LDAvis: A method for visualizing and interpreting
topics. In Proceedings of the Workshop on Interactive Language Learning, Visualization,
and Interfaces, pages 63–70, Baltimore, Maryland, USA. Association for Computational
Linguistics. In this paper, they have explained a web based interactive visual representation
LDAvis to determine topic-term relation using an R package "LDAvis".

4
PROJECT OBJECTIVE

• TO SUMMARISE TEXT DOCUMENTS USING NLP and comparing the results of algorithms implemented.

5
DATASET

• Article Mixture (Text & Lex rank)


• BBC News (LSA) (raw texts related to 5 different categories such as business, entertainment, politics, sports, and
tech.)
• NIPS 1987-2016 papers (LDA)

6
ALGORITHMS IMPLEMENTED

• TextRank Algorithm

• LexRank Algorithm

• ROUGE (Recall – Oriented Understudy for Gisting Evaluation)

• LSA (Latent Semantic Analysis)

• LDA (Latent Dirichlet Allocation)

7
TEXT SUMMARIZATION

• Text summarization is the process of shortening the number of sentences and words of a report
without changing its importance.

• There are various methods to separate data from raw text data and use it for a summarization
model, generally they can be sorted as Extractive and Abstractive.

8
TYPES OF TEXT SUMMARIZATION

• Extractive methods select the main sentences


inside a message (without essentially figuring
out the importance), in this way the result is
only a subset of the full text.

• On the contrary, Abstractive models utilize


advanced NLP (for example word embeddings)
to grasp the semantics of the text and create a
significant outline.

• Subsequently, Abstractive strategies are a lot


harder to train (from scratch) as they need a
ton of parameters and data.
9
HOW TO DO TEXT SUMMARIZATION

• Text cleaning

• Sentence tokenization

• Word tokenization

• Summarization

10
TEXT CLEANING

• Removing Punctuations
• Removing Numbers, Extra Cases
• Removing HTML Tags
• Removing & Finding URL
• Removing & Finding Email ID
• Removing Stop Words
• Spell Check
• Remove the less frequent words

11
SENTENCE AND WORD TOKENIZATION

• WORD TOKENIZATION:-

• SENTENCE TOKENIZATION:-
• Word tokenization is the method involved with
parting a huge sample of text into words.
• Sentence Tokenization is the process of
splitting text into individual sentences.
• Each word should be captured and subjected
to additional analysis like classifying and
counting them for a specific sentiment and so
on.

12
TEXT RANK ALGORITHM

• TextRank (2004) is an unsupervised graph based ranking model for text processing, based on Google’s
PageRank Algorithm.

• First the whole text is split into sentences, then the algorithm builds a graph where sentences are the
nodes and overlapped words are the links.

• Finally it identifies the most important nodes of this network for these sentences.

13
14
LEXRANK ALGORITHM

• LexRank is an unsupervised graph based approach for text summarization in which the scoring of
sentences is done using the graph method.

• The main idea is that sentences "suggest" other similar sentences to the reader.

• Ex: This is an example of the article. This is the second example sentence. This is the third sentence,
that is the most important because it says other sentences are just examples.

15
LEXRANK SCORES

• Cosine Similarity

• Adjacency Matrix

• Connectivity Matrix

• Eigenvector Centrality

16
These Scores defines Similarity Matrix of Classical and Continuous LexRank.

17
18
LexRank TextRank

• In addition to pageRank approach, it uses • Uses typical PageRank approach.


similarity metrics.

• Does not consider any such parameters.


• Considers position and length of sentences.

• Used for Single document summarization.


• Used for Multi-document summarization.

19
ROGUE

• ROGUE refers to Recall – Oriented Understudy for Gisting Evaluation.

• ROUGE metric is used for measuring the performance of the automatic summarization and machine
translation tasks.

• ROUGE-N measures the number of matching ‘n-grams’ between our model-generated text and a
‘reference’.

• An n-gram is simply a grouping of tokens/words.

• A unigram (1-gram) would consist of a single word. A bigram (2-gram) consists of two consecutive
words and so on.
20
RECALL, PRECISION & F1 SCORE

Recall:-
• To ensure our model is capturing all of the information contained in the reference.
• The recall counts (the number of overlapping n-grams found in both the model output and reference)
— then divides this number by (the total number of n-grams in the reference).

Precision:-
• Precision is calculated in almost the exact same way, but rather than dividing by the reference n-gram
count, we divide by the model n-gram count.

F1 Score:-
• F1 Score finds (2*Recall*Precision) — then divides this number by (Recall*Precision).
21
ROUGE FOR TEXTRANK

• R – Recall

• P – Precision

• F – F1 Score

22
ROUGE FOR LEXRANK

• R – Recall

• P – Precision

• F – F1 Score

23
LSA (LATENT SEMANTIC ANALYSIS)

• LSA (Latent Semantic Analysis) also known as LSI (Latent Semantic Index) LSA uses bag of word(BoW)
model, which results in a term-document matrix(occurrence of terms in a document).

• Rows represent terms and columns represent documents. LSA learns latent topics by performing a
matrix decomposition on the document-term matrix using Singular value decomposition.

• LSA is typically used as a dimension reduction or noise reducing technique.

24
TEXT CLASSIFICATION VS TOPIC
MODELING

https://www.datacamp.com/tutorial/discovering-hidden-topics-python

• Text classification is a supervised machine learning problem, where a text document or article
classified into a pre-defined set of classes. Topic modeling is the process of discovering groups of co-
occurring words in text documents.
25
• Topic modeling can be used to solve the text classification problem. Topic modeling will
identify the topics presents in a document" while text classification classifies the text into a
single class.

26

https://www.datacamp.com/tutorial/discovering-hidden-topics-python
IMPLEMENTING LSA

• Loading original data into data frame


• PRE PROCESSING
• Document Clustering

27
Here, only the tech-related news
article looks like having a wider
spread whereas other news
articles nicely clustered.

It also suggests that LSA (or


Truncated SVD) has done a nice
work on the textual data to
extract 200 important
dimensions to segregate news
articles on different topics. It is
to be understood that TSNE is
non-deterministic in nature and
multiple runs will produce
multiple representations, even
though, the structure will be
more likely to remain similar if
not the same.
28
LDA (LATENT DIRICHLET ALLOCATION)

• LDA is a generative probabilistic model that assumes each topic is a mixture over an underlying set of
words, and each document is a mixture of over a set of topic probabilities.

29
IMPLEMENTING LDA

• Loading data
• Data cleaning
• EDA
• Preparing data for LDA analysis
• LDA model training
• Analyzing LDA model results

30
31
32
LSA VS LDA

• LSA and LDA have same input which is Bag of words in matrix format. LSA focus on reducing matrix
dimension while LDA solves topic modeling problems.
• LDA & LSA are unsupervised

33
CONCLUSION

• We have done word cleaning and prepossessing steps. We have done word tokenization and sentence
tokenization to bring out the summary. Even though the summary is too large. Then we implemented
Lex rank and Text rank for the dataset. And gained f1 measure. By Comparing the result, we conclude
that lex rank giving the 98% of f1 score. And also we have implemented LSA & LDA for different
dataset (bbc news & nips papers).
• We have generated eda and document clustering for LSA and summarization output generated by
query.
• In LDA we have created word cloud and LDA visualization (Intertopic Distance Map (via
multidimensional scaling), Marginal topic distribution, Overall term frequency, Estimated term
frequency within the selected topic)
• Both LDA and LSA are used for topic summarization based on the dataset.
34
REFERENCES

• [1]Termite: Visualization Techniques for Assessing Textual Topic Models Jason Chuang, Christopher D. Manning, Jeffrey Heer
Advanced Visual Interfaces, 2012
• [2]arson Sievert and Kenneth Shirley. 2014. LDAvis: A method for visualizing and interpreting topics. In Proceedings of the
Workshop on Interactive Language Learning, Visualization, and Interfaces, pages 63–70, Baltimore, Maryland, USA. Association
for Computational Linguistics.
• [3] https://www.ijesi.org/v7i4(version-2).html- Vishnu Preethi K, Vijaya MS Text Summarizers for Education News Articles. 16
April 2018, Invention Journals
• [4] Alomari, A., Idris, N., Sabri, A. Q. M., & Alsmadi, I. (2022). Deep reinforcement and transfer learning for abstractive text
summarization: A review. Computer Speech & Language, 71, 101276.
• [5] Wazery, Y. M., Saleh, M. E., Alharbi, A., & Ali, A. A. (2022). Abstractive Arabic Text Summarization Based on Deep Learning.
Computational Intelligence and Neuroscience, 2022.
• [6] Laskar, M. T. R., Hoque, E., & Huang, J. X. (2022). Domain Adaptation with Pre-trained Transformers for Query-Focused
Abstractive Text Summarization. Computational Linguistics, 48(2), 279-320.
• [7] Suleiman, D., & Awajan, A. (2022). Multilayer encoder and single-layer decoder for abstractive Arabic text summarization.
Knowledge-Based Systems, 237, 107791.
• [8] Ertam, F., & Aydin, G. (2022). Abstractive text summarization using deep learning with a new Turkish summarization
benchmark dataset. Concurrency and Computation: Practice and Experience, 34(9), e6482.

35
• [9] Aliakbarpour, H., Manzuri, M. T., & Rahmani, A. M. (2022). Improving the readability and saliency of abstractive text
summarization using combination of deep neural networks equipped with auxiliary attention mechanism. The Journal of
Supercomputing, 78(2), 2528-2555.
• [10] Aggarwal, C. C. (2022). Text summarization. In Machine Learning for Text (pp. 393-418). Springer, Cham.
• [11] Gupta, A., Chugh, D., & Katarya, R. (2022). Automated news summarization using transformers. In Sustainable
Advanced Computing (pp. 249-259). Springer, Singapore.
• [12] Khurana, A., & Bhatnagar, V. (2022). Investigating entropy for extractive document summarization. Expert Systems
with Applications, 187, 115820.
• [13] Zhong, M., Liu, Y., Xu, Y., Zhu, C., & Zeng, M. (2022, June). Dialoglm: Pre-trained model for long dialogue
understanding and summarization. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 36, No. 10, pp.
11765-11773).
• [14] Ma, C., Zhang, W. E., Guo, M., Wang, H., & Sheng, Q. Z. (2022). Multi-document summarization via deep learning
techniques: A survey. ACM Computing Surveys, 55(5), 1-37.
• [15] Moro, G., & Ragazzi, L. (2022, February). Semantic Self-segmentation for Abstractive Summarization of Long Legal
Documents in Low-resource Regimes. In Proceedings of the Thirty-Six AAAI Conference on Artificial Intelligence,
Virtual (Vol. 22).
• [16] Patil, P., Rao, C., Reddy, G., Ram, R., & Meena, S. M. (2022). Extractive Text Summarization Using BERT. In
Proceedings of the 2nd International Conference on Recent Trends in Machine Learning, IoT, Smart Cities and
Applications (pp. 741-747). Springer, Singapore.
36
• [17] Mohan, G. B., & Kumar, R. P. (2022). A Comprehensive Survey on Topic Modeling in Text Summarization. Micro-
Electronics and Telecommunication Engineering, 231-240.
FUTURE WORK

• In future we can analyze using other techniques to improve the summaries such as
Natural Language Understanding, Natural Language Generation, Multi-Document
Summarization, Personalized Summarization, Cross-Lingual Summarization, Visual
Summarization. These are just a few of the many potential directions for future
research and development in text summarization.

37
THANK YOU

38

You might also like