0% found this document useful (0 votes)

104 views

Data Redundancy Using LSTM

The document describes a report on a project that uses a Ma-LSTM model to reduce data redundancy. It was submitted by three students - Rishu Verma, Sachin, and Piyush Kumar - in partial fulfillment of their sixth semester mini-project requirements at Ramaiah Institute of Technology under the guidance of Dr. Sumana Maradithaya. The model takes word embeddings as input and uses an LSTM network to detect similarity between sentences and reduce redundant information in documents.

Uploaded by

piyush

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

104 views

Data Redundancy Using LSTM

Uploaded by

piyush

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

“Data Redundancy using Ma-LSTM”

A report submitted in partial fulfillment of the requirements

MINI-PROJECT

Sixth Semester

1MS16IS074 Rishu Verma

1MS16IS080 Sachin
1MS16IS054 Piyush Kumar

Under the guidance of

Dr. Sumana Maradithaya

Associate Professor
Dept. of ISE, RIT

RAMAIAH
I n s t i t u t e o f Te c h n o l o g y

DEPARTMENT OF INFORMATION SCIENCE & ENGINEERING

RAMAIAH INSTITUTE OF TECHNOLOGY
(AUTONOMOUS INSTITUTE AFFILIATED TO VTU)
VIDYA SOUDHA
M. S. RAMAIAH NAGAR, M. S. R. I. T. POST, BANGALORE – 560054

2018-2019
RAMAIAH INSTITUTE OF TECHNOLOGY
(Autonomous Institute Affiliated to VTU)
VIDYA SOUDHA
M. S. Ramaiah Nagar, M. S. R. I. T. Post, Bangalore – 560054

DEPARTMENT OF INFORMATION SCIENCE AND ENGINEERING

RAMAIAH
I n s t i t u t e o f Te c h n o l o g y

CERTIFICATE

This is to certify that the project work entitled “Data Redundancy using Ma-
LSTM” is a bonafide work carried out by Rishu Verma,Sachin, Piyush
Kumar bearing USN: 1MS16IS074,1MS16IS080, 1MS16IS054 in partial
fulfillment of requirements of Mini-Project course of Sixth Semester B.E. It is
certified that all correction/suggestions indicated for internal assessment has
been incorporated in the report. The project has been approved as it satisfies
the academic requirements in respect of project work prescribed by the above
said course.

_________________________ __________________________

Signature of the Guide Signature of the HOD

Dr. Sumana Maradithaya Dr. Vijaya Kumar B P
Asst. Professor Professor and Head,
Dept. of ISE, RIT, Dept. of ISE, RIT
Bangalore-54 Bangalore-54

Other Examiners
Name of the Examiners:
Signature
1.

2.
Acknowledgements

All sentences or passages quoted in this report from other people's work
have been specifically acknowledged by clear cross-referencing to author,
work and page(s). Any illustrations which are not the work of the author
of this report have been used (where possible) with the explicit
permission of the originator and are specifically acknowledged. I
understand that failure to do this amount to plagiarism and will be
considered grounds for failure in this project and the degree examination
as a whole.
Abstract

There is redudant data everywhere. Searching on internet about anything

gives lots of repetetive data about the topic. One finds it difficult to find
the concise, non-repetetive information.
The LSTM model has been proposed which is trained to detect similarity
between two sentences. The model takes word embeddings as input and
gives similarity score between 0 and 1. Each word is first transformed
into a vector with the help of Google-news-vector corpus.
The model can’t be directly trained on raw sentences each word has to
be represented in the form of vectors. This representation of word is
known as word embeddings.
The proposed model can thus predict similarity in the sentences, and can
output concise, non-redundant information to the user.
Contents
1. Introduction 1
1.1 Motivation 4
1.2 Scope 4
1.3 Objectives 5
1.4 Proposed Model 5

2. Literature Review 9

3. System analysis and Design 10

4. Modelling and Implementation 12

4.1 Use Case Diagram 12
4.2 Sequence Diagram 13

5. Testing, Results and Discussion 14

6. Conclusion and Future Work 18

References 19
Chapter-1

INTRODUCTION

The problem of data redundancy across related documents is tackled by

Manhattan LSTM (MaLSTM) — a Siamese deep network and WordNet lexi-
cal database.

Siamese network is an artificial neural network that use the same

weights while working in tandem on two different input vectors to com-
pute comparable output vectors.

The Long Short Term Memory (LSTM) is a second-order recurrent neu-

ral network architecture that excels at storing sequential short-term mem-
ories and retrieving them many time-steps later.

LSTMs have an edge over conventional feed-forward neural networks and

RNN in many ways. This is because of their property of selectively re-
membering patterns for long durations of time. The purpose of this arti-
cle is to explain LSTM and enable you to use it in real life problems.

LSTMs make small modifications to the information by multiplications

and additions. With LSTMs, the information flows through a mechanism
known as cell states. This way, LSTMs can selectively remember or forget
things. The information at a particular cell state has three different
dependencies.
These dependencies can be generalized to any problem as:
1.The previous cell state (i.e. the information that was present in
the memory after the previous time step)

2.The previous hidden state (i.e. this is the same as the output of
the previous cell)

3.The input at the current time step (i.e. the new information that
is being fed in at that moment) 1
fig1: Expanded RNN
fig1 is Expansion of RNN is for understanding .Recurrent Neural Network
can be represented as time sequenced simple neural network where
output of RNN at any time serves as input at next time sequence.

WordNet is a lexical database for the English language. It groups English

words into sets of synonyms called synsets, provides short definitions and
usage examples, and records a number of relations among these synonym
sets or their members. WordNet can thus be seen as a combination of
dictionary and thesaurus.

Synsets are interlinked by means of conceptual-semantic and lexical

relations. The resulting network of meaningfully related words. WordNet
is also freely and publicly available to use. WordNet's structure makes it
a useful tool for computational linguistics and natural language
processing.
WordNet superficially resembles a thesaurus, in that it groups words
together based on their meanings. However, there are some important
distinctions. First, WordNet interlinks not just word forms—strings of
letters—but specific senses of words. As a result, words that are found in
close proximity to one another in the network are semantically
disambiguated. Second, WordNet labels the semantic relations among
words, whereas the groupings of words in a thesaurus does not follow
any explicit pattern other than meaning similarity.

2
Structure of wordnet:

The main relation among words in WordNet is synonymy, as between the

words shut and close or car and automobile. Synonyms--words that
denote the same concept and are interchangeable in many contexts--are
grouped into unordered sets (synsets). Each of WordNet’s 117 000 synsets
is linked to other synsets by means of a small number of “conceptual
relations.” Additionally, a synset contains a brief definition (“gloss”)
and, in most cases, one or more short sentences illustrating the use of
the synset members. Word forms with several distinct meanings are
represented in as many distinct synsets. Thus, each form-meaning pair in
WordNet is unique.

3
1.1 Motivation
Most of the things are available in internet. One can study about any
topic using MOOC(massive open online courses), but most of the time
infomation about a topic is repeated. This makes it difficult for a person
to search more about that topic. We wanted to make a model which can
detect similarity between sentences and remove the redundant data from
a document thus giving reader concise information.

1.2 Scope

The model has scopes:

• Extract useful information from documents and ignoring redundant
data.
• Detect plagiarism.
• Checking if two research papers are similiar.
• Biomedical Informatics: To developed the biomedical ontologies
namely the Gene Ontology we used the semantic similarity.
Similarity methods are mainly used to compare the genes and they
can also used in other bio-entities
• Geo-Informatics: Similarity measure also used to find the similarities
between geographical feature type ontologies. Several tools are
available to do this task such as (i) The OSM Semantic Network
used to compute the semantic similarity of tags in OpenStreetMap.
(ii) Similarity Calculator is used to find the similarity between two
geographical concepts in the Geo-Net-PT ontology and (iii) SIM-DL
similarity server computes the similarity between geographical
feature type ontologies.

4
1.3 Objectives
• Reduce related multiple documents to single non-trivial document .
• Improve accuracy to reduce data loss.
• Perform accurately over different genre of documents.

1.4 Proposed Model

Siamese Manhattan LSTM

fig 2: Siamese LSTM

The proposed Manhattan LSTM (MaLSTM) model is outlined in Figure 2.

There are two networks LSTM(a) and LSTM(b) which each process one of
the sentences in a givenpair, but we solely focus on siamese architectures
with tied weights such that LSTM(a) = LSTM(b)in this work. Each
sentence (represented as a sequence of word vectors) is passed to the
LSTM, which updates its hidden state at each sequence-index.

5
fig 3: Embedding matrix

Fig3 demonstrate conversion of sentences into their word embedding

matrix.

6
Word Embeddings:

Word embedding is one of the most popular representation of document

vocabulary. It is capable of capturing context of a word in a document,
semantic and syntactic similarity, relation with other words, etc.

Loosely speaking, they are vector representations of a particular word.

Having said this, what follows is how do we generate them? More
importantly, how do they capture the context?

Word2Vec is one of the most popular technique to learn word

embeddings using shallow neural network.

Consider the following similar sentences: Have a good day and Have a
great day. They hardly have different meaning. If we construct an
exhaustive vocabulary (let’s call it V), it would have V = {Have, a,
good, great, day}.

Now, let us create a one-hot encoded vector for each of these words in
V. Length of our one-hot encoded vector would be equal to the size of V
(=5). We would have a vector of zeros except for the element at the
index representing the corresponding word in the vocabulary. That
particular element would be one. The encodings below would explain
this better.Have = [1,0,0,0,0]`; a=[0,1,0,0,0]` ; good=[0,0,1,0,0]` ;
great=[0,0,0,1,0]` ; day=[0,0,0,0,1]` (` represents transpose)If we try to
visualize these encodings, we can think of a 5 dimensional space, where
each word occupies one of the dimensions and has nothing to do with
the rest (no projection along the other dimensions). This means ‘good’
and ‘great’ are as different as ‘day’ and ‘have’, which is not true.

7
Our objective is to have words with similar context occupy close spatial
positions. Mathematically, the cosine of the angle between such vectors
should be close to 1, i.e. angle close to 0.

fig 4: Word2Vec example

fig4 , shows a famous example of word2vec. Here, when vector of king

is subtracted from vector of man and added with vector of queen the
resultant shows a vector of woman.

8
Chapter- 2

Literature Survey
[1] It disscusses about unified MOOC model to provide information. It
facilitate the exploitation of the experiences produced by the interactions
of the pedagogical actors. The aim is to make a unified analysis of the
massive data generated by learning actors.

[2] Language sentences is critical to the performance of several

applications such as text mining, question answering, and text
summarization. Given two sentences, an effective similarity measure
should be able to determine whether the sentences are semantically
equivalent or not, taking into account the variability of natural language
expression. That is, the correct similarity judgment should be made even
if the sentences do not share similar surface form.

[3] Ordering information is a difficult but a important task for

natural language generation applications. A wrong order of information
not only makes it difficult to understand, but also conveys an entirely
different idea to the reader.

[4] A way for encoding sentences into embedding vectors that specifically
target transfer learning to other NLP tasks. The models are efficient and
result in accurate performance on diverse transfer tasks. Two variants of
the encoding models allow for trade-offs between accuracy and compute
resources.

[5] Long Short Term Memory(LSTM) is a Machine learning model which

is improved version of RNN. RNN has very short memory and can’t
predict about things that happened in initial phases of training. LSTM
takes care of these things and is a prefered for NLP tasks.

[6] Keras is an open-source neural-network library written in Python. It

is capable of running on top of TensorFlow, Microsoft Cognitive Toolkit,
Theano, or PlaidML. Designed to enable fast experimentation with deep
neural networks, it focuses on being user-friendly, modular, and
extensible.
9
Chapter-3

System analysis and design

Determination:

• The model should provide results as accurate as possible.

• The model should be robust and must not crash during processing
of inputs.
• The interface for the project should be easy to understand and use.

Vanishing Gradient:

fig 5a fig 5b

An error gradient is the direction and magnitude calculated during the

training of a neural network that is used to update the network weights
in the right direction and by the right amount. (Fig 5a)In deep networks
or recurrent neural networks, error gradients can accumulate during an
update and result in very large gradients.

10
These in turn result in large updates to the network weights, and in
turn, an unstable network. At an extreme, the values of weights can
become so large as to overflow and result in NaN values. The explosion
occurs through exponential growth by repeatedly multiplying gradients
through the network layers that have values larger than 1.0.

Moving backward in the Network and calculating gradients of loss(Error)

with respect to the weights , this tends to get smaller and smaller as we
keep on moving backward in the Network. This means that the neurons
in the Earlier layers learn very slowly as compared to the neurons in the
later layers in the Hierarchy. (fig5b) The Earlier layers in the network
are slowest to train. The Training process takes too long and the
Prediction Accuracy of the Model will decrease.

Secifications:

OS: Ubuntu
Language: Python 2.76
Dataset : Quora duplicate question dataset
coding interface: Jupyter Notebook
libraries used:
• keras
• pandas
• numpy
• nltk
• matplotlib
• seaborn
• scikit learn
• gensim
11
Chapter-4

Modelling and Implementation

4.1 Use case diagram

fig 6: Use Case Diagram

Release1- Input were Sentences and output was Similarity score between
them.

Release2- A front Layer is built on top of release 1 input , here multiple

documents are taken as input and output was merged documents. (fig6)

12
4.2 Sequence Diagram

fig 7: Sequence Diagram

Stage1 - User Interface

Job - Documents input and retrieval , visible to end users.

Stage2 - Pre-Processing

Job - Sentence to their vector representation and text cleaning.

Stage3 - Computation

Job - Model feeding and computation of similarity score. (fig7)

13
Chapter-5

Testing Result and Disscussion

Training and Testing:

The model was tested on a sample from the quora dataset. The optimizer
used was Adadelta optimizer. Gradient clipping was also used to avoid
exploding gradient problem.

Sample image of our dataset:

our model predict the values between 0 and 1. We can choose threshold
near to 0.5 to predict if sentence is similar or not.

14
fig 8: training phase

fig 8 shows hoe the model was trained on the dataset. It also shows

increase in accuracy with increase in epochs.

15
Result and Discussions:
After training our model on the training set. And validating it on
validation set we got accuracy greater than 80%. Convergence of our
model can be shown by the graph below.

Fig 8: Observations

With increase in epochs validation curve behaves similar to training

curve. (fig8a)
With increase in epochs validation curve behaves similar to training
curve. (fig8b)

The model was trained using LSTM but there are various other sentence
similarity measuring algorithms eg GRU. We can use it in future to
compare accuracy.
A bigger dataset can help in increasing the accuracy of the model. 16
Better optimizer can also help in giving better results for the data.

fig 9: Output

Fig9 Shows a sample output of the trained model. The interface takes in
a question number from the testing set as input and gives the predicted
and actual score as output.
Predicted score is given between 0 and 1, based on that, a threshold can
be decided to predict if sentence is similar or not.

17
Chapter-6

Conclusion and future work

The model was successfully trained and was able to produce accuracy of
greater than 80% on validation set. The model was also able to produce
the same accuracy with the test set.
The model was trained only on Quora duplicate dataset, in future the
model will be trained on various other dataset for bigger domain.
In Future, Model will scrap data from web give precise and non-
redundant information from the scrapped data. We want to apply other
algorithms for sentence similarity like GRU to compare the efficiency and
training time among algorithms.
Choose a different optimizer. Adadelta doesn’t perform as well as other
methods when finely tuned.

18
References:
[1] Machine Learning Based On Big Data Extraction of Massive
Educational Knowledge Abdelladim Hadioui!!", Nour-eddine El Faddouli,
Yassine Benjelloun Touimi, and Samir Bennani
[2]The Evaluation of Sentence Similarity Measures
Palakorn Achananuparp, Xiaohua Hu, and Shen Xiajiong
The ability to accurately judge the similarity between natural

[3]A Machine Learning Approach to Sentence Ordering for Multidocument

Summarization and its Evaluation
Danushka Bollegala, Naoaki Okazaki, Mitsuru Ishizuka

[4]Universal Sentence Encoder

Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco

[5] https://en.wikipedia.org/wiki/Long_short-term_memory
[6] https://keras.io/

Nonlinear Pedagogy in Skill Acquisition.
100% (4)
Nonlinear Pedagogy in Skill Acquisition.
248 pages
Managing Successful Programmes - 5th Edition MSPÂ® Foundation and Practitioner
0% (1)
Managing Successful Programmes - 5th Edition MSPÂ® Foundation and Practitioner
3 pages
He Weiling - Flatness Transformed
No ratings yet
He Weiling - Flatness Transformed
360 pages
Classroom Management Theories
100% (1)
Classroom Management Theories
16 pages
Visual Inquiry Lesson: Pontiac's Rebellion
No ratings yet
Visual Inquiry Lesson: Pontiac's Rebellion
3 pages
Text Similarity Using Siamese Networks and Transformers
No ratings yet
Text Similarity Using Siamese Networks and Transformers
10 pages
Emotion Detection On Social Media
No ratings yet
Emotion Detection On Social Media
7 pages
Design of Efficient Model To Predict Duplications in Questionnaire Forum Using Machine Learning
No ratings yet
Design of Efficient Model To Predict Duplications in Questionnaire Forum Using Machine Learning
7 pages
M S S W: A S: Easurement of Emantic Imilarity Between Ords Urvey
No ratings yet
M S S W: A S: Easurement of Emantic Imilarity Between Ords Urvey
10 pages
Mining User - Aware Rare Sequential Topic Pattern in Document Streams
No ratings yet
Mining User - Aware Rare Sequential Topic Pattern in Document Streams
6 pages
Deep Learning Based Text Abstraction
No ratings yet
Deep Learning Based Text Abstraction
9 pages
AIML Notes Sess 2
No ratings yet
AIML Notes Sess 2
12 pages
A New Method For Applicant of Explicit Semantic Analysis and Word Sense Disambiguation in Concept-Based Information Retrieval
No ratings yet
A New Method For Applicant of Explicit Semantic Analysis and Word Sense Disambiguation in Concept-Based Information Retrieval
10 pages
CNN LSTM Hybrid Approach For Sentiment Analysis
No ratings yet
CNN LSTM Hybrid Approach For Sentiment Analysis
14 pages
Evaluation of Deep Learning Methods in Twitter Statistics Emotion Evaluation
No ratings yet
Evaluation of Deep Learning Methods in Twitter Statistics Emotion Evaluation
7 pages
AHybridCNN-LSTMModelforImprovingAccuracyofMovieReviewsSentimentAnalysis2019-Multimedia Tools and Applications
No ratings yet
AHybridCNN-LSTMModelforImprovingAccuracyofMovieReviewsSentimentAnalysis2019-Multimedia Tools and Applications
17 pages
A E A T - B L M: E O M: Nalysis of The Volution of Dvanced Ransformer Ased Anguage Odels Xperiments On Pinion Ining
No ratings yet
A E A T - B L M: E O M: Nalysis of The Volution of Dvanced Ransformer Ased Anguage Odels Xperiments On Pinion Ining
16 pages
Towards Machine Learning On The Semantic Web
No ratings yet
Towards Machine Learning On The Semantic Web
43 pages
Paper Work
No ratings yet
Paper Work
12 pages
Vol22 Issue01-253 PDF
No ratings yet
Vol22 Issue01-253 PDF
10 pages
Multimodal Deep Learning: Seminar Report On
No ratings yet
Multimodal Deep Learning: Seminar Report On
34 pages
Sentence Similarity Based On Semantic Networks
No ratings yet
Sentence Similarity Based On Semantic Networks
36 pages
Discovering Emerging Topics in Social Streams Via Link-Anomaly Detection
No ratings yet
Discovering Emerging Topics in Social Streams Via Link-Anomaly Detection
5 pages
Drift Survey Paper JETIR2411319
No ratings yet
Drift Survey Paper JETIR2411319
9 pages
LLM_Survey_2015_onwards_arxiv
No ratings yet
LLM_Survey_2015_onwards_arxiv
41 pages
Text-Based Intent Analysis Using Deep Learning
No ratings yet
Text-Based Intent Analysis Using Deep Learning
8 pages
Multi-Agent Robots Using Artificial Intelligence
No ratings yet
Multi-Agent Robots Using Artificial Intelligence
18 pages
Tweet Sentiment & Emotion Analysis
No ratings yet
Tweet Sentiment & Emotion Analysis
8 pages
Tweet Sentiment & Emotion Analysis
No ratings yet
Tweet Sentiment & Emotion Analysis
8 pages
NLP Review 3 Formatted 2
No ratings yet
NLP Review 3 Formatted 2
27 pages
Analysis of The Evolution of Advanced Transformer-Based Language Models: Experiments On Opinion Mining
No ratings yet
Analysis of The Evolution of Advanced Transformer-Based Language Models: Experiments On Opinion Mining
16 pages
PDC Review2
No ratings yet
PDC Review2
23 pages
Information Retrival List of Experiment - Odd Sem 2024-25
No ratings yet
Information Retrival List of Experiment - Odd Sem 2024-25
23 pages
Sentimental Analysis of Product Review Data Using Deep Learning
No ratings yet
Sentimental Analysis of Product Review Data Using Deep Learning
5 pages
sonjannah,+8-Sentence-Level+Granularity+Oriented+Sentiment+Analysis+of+Social+Media+Using+Long+Short-Term+Memory+(LSTM)
No ratings yet
sonjannah,+8-Sentence-Level+Granularity+Oriented+Sentiment+Analysis+of+Social+Media+Using+Long+Short-Term+Memory+(LSTM)
11 pages
Synopsis Report - Upendra
No ratings yet
Synopsis Report - Upendra
34 pages
RoBERTa-LSTM A Hybrid Model For Sentiment Analysis With Transformer and Recurrent Neural Network
No ratings yet
RoBERTa-LSTM A Hybrid Model For Sentiment Analysis With Transformer and Recurrent Neural Network
9 pages
Finding Experts in Community Based Question-Answering Services-IJAERDV04I0115724
No ratings yet
Finding Experts in Community Based Question-Answering Services-IJAERDV04I0115724
6 pages
3586a971
No ratings yet
3586a971
6 pages
Decoding_Stocks_Patterns_Using_LSTM
No ratings yet
Decoding_Stocks_Patterns_Using_LSTM
5 pages
tmp6D8D TMP
No ratings yet
tmp6D8D TMP
5 pages
Sanyam Modi PPT Seminar PCE19IT051
No ratings yet
Sanyam Modi PPT Seminar PCE19IT051
13 pages
Research Paper
No ratings yet
Research Paper
16 pages
Co-Extraction of Feature Sentiment and Context Terms For Context-Sensitive Feature-Based Sentiment Classification Using Attentive-LSTM
No ratings yet
Co-Extraction of Feature Sentiment and Context Terms For Context-Sensitive Feature-Based Sentiment Classification Using Attentive-LSTM
10 pages
QB104762 2013 Regulation
No ratings yet
QB104762 2013 Regulation
2 pages
Text Mining Through Semi Automatic Semantic Annotation
No ratings yet
Text Mining Through Semi Automatic Semantic Annotation
12 pages
Civil Complaints Management System by Using Machine Learning Techniques
No ratings yet
Civil Complaints Management System by Using Machine Learning Techniques
4 pages
White Paper
No ratings yet
White Paper
9 pages
Sarcastic Tweet - MGR
No ratings yet
Sarcastic Tweet - MGR
26 pages
Paper News Text Summaraization 1
No ratings yet
Paper News Text Summaraization 1
7 pages
Advanced Deep Learning CA4
No ratings yet
Advanced Deep Learning CA4
38 pages
applsci-10-02690-v2
No ratings yet
applsci-10-02690-v2
13 pages
Convolutional Neural Networks For Text Classification A Comprehensive Analysis
No ratings yet
Convolutional Neural Networks For Text Classification A Comprehensive Analysis
11 pages
Convolutional Neural Networks For Text Classification A Comprehensive Analysis
No ratings yet
Convolutional Neural Networks For Text Classification A Comprehensive Analysis
11 pages
A Hybrid CNN-LSTM: A Deep Learning Approach For Consumer Sentiment Analysis Using Qualitative User-Generated Contents
No ratings yet
A Hybrid CNN-LSTM: A Deep Learning Approach For Consumer Sentiment Analysis Using Qualitative User-Generated Contents
15 pages
Vol 7 No 1 - November 2013
No ratings yet
Vol 7 No 1 - November 2013
105 pages
Sentiment of tweets
No ratings yet
Sentiment of tweets
7 pages
Rumor Detection From Social Media
No ratings yet
Rumor Detection From Social Media
4 pages
New_approach_for_Arabic_named_entity_rec
No ratings yet
New_approach_for_Arabic_named_entity_rec
13 pages
Technical Seminar Nameera
No ratings yet
Technical Seminar Nameera
14 pages
Answer Key Class Test 1 Paper3
No ratings yet
Answer Key Class Test 1 Paper3
7 pages
TOPSIS With Multiple Linear Regression For Multi-Document Text Summarization
No ratings yet
TOPSIS With Multiple Linear Regression For Multi-Document Text Summarization
11 pages
Abstrating Wisdom: Text Summarization in The Age of Intelligence
No ratings yet
Abstrating Wisdom: Text Summarization in The Age of Intelligence
8 pages
Semanti Search Engine Article
No ratings yet
Semanti Search Engine Article
3 pages
Connectivity Prediction in Mobile Ad Hoc Networks for Real-Time Control
From Everand
Connectivity Prediction in Mobile Ad Hoc Networks for Real-Time Control
Sebastian Thelen
5/5 (1)
Reflection Report
No ratings yet
Reflection Report
3 pages
4a's Different Kinds of CLothes For Different Weather Conditions
No ratings yet
4a's Different Kinds of CLothes For Different Weather Conditions
4 pages
Socratic Questioning Stool
No ratings yet
Socratic Questioning Stool
2 pages
Major Themes-The Last Lesson: Linguistic Chauvinism
No ratings yet
Major Themes-The Last Lesson: Linguistic Chauvinism
2 pages
Writing Technical Reports Across Profession
No ratings yet
Writing Technical Reports Across Profession
2 pages
Plan de Lectie - 10
No ratings yet
Plan de Lectie - 10
4 pages
Chapter 3 Compliance of The Municipal Governments in The First District of La Union To The Iloko Code
No ratings yet
Chapter 3 Compliance of The Municipal Governments in The First District of La Union To The Iloko Code
82 pages
Undergraduate Thesis Format 1
100% (1)
Undergraduate Thesis Format 1
77 pages
Type, Kind and Individuality of Text - Decision Making in Translation by Katharina Reiss
No ratings yet
Type, Kind and Individuality of Text - Decision Making in Translation by Katharina Reiss
12 pages
Mini-Research 10 G-2
No ratings yet
Mini-Research 10 G-2
6 pages
HRM Dessler 08 Training and Development
100% (6)
HRM Dessler 08 Training and Development
52 pages
Writing An Investigatory Project
50% (2)
Writing An Investigatory Project
5 pages
Gagne's Types of Learning
No ratings yet
Gagne's Types of Learning
4 pages
City of Maricopa Internal Communications Plan
No ratings yet
City of Maricopa Internal Communications Plan
6 pages
Revisiting A Curriculum 1. Establishing A Curriculum Design Specification For Araling Panlipunan Subject
No ratings yet
Revisiting A Curriculum 1. Establishing A Curriculum Design Specification For Araling Panlipunan Subject
2 pages
PL GK l3 All Activities
No ratings yet
PL GK l3 All Activities
9 pages
An LAVC Writing Center Workshop Presentation
No ratings yet
An LAVC Writing Center Workshop Presentation
19 pages
2) The Entrepreneurial Mind Crafting A Personal Entrepreneurial Strategy
100% (1)
2) The Entrepreneurial Mind Crafting A Personal Entrepreneurial Strategy
35 pages
Ge 105 Lesson 1
100% (1)
Ge 105 Lesson 1
7 pages
M1 SA Introduction To Animal Welfare PDF
No ratings yet
M1 SA Introduction To Animal Welfare PDF
4 pages
Aaaaaaaaa
No ratings yet
Aaaaaaaaa
173 pages
Interpersonal Skills WGD 10102: Student Learning Objectives
No ratings yet
Interpersonal Skills WGD 10102: Student Learning Objectives
5 pages
Quentin Lauer-Hegel's Concept of God-State University of New York Press (1982) PDF
100% (2)
Quentin Lauer-Hegel's Concept of God-State University of New York Press (1982) PDF
354 pages
Teaching of Science Notes
No ratings yet
Teaching of Science Notes
5 pages
Eca1 Tests Language Test 6a
0% (1)
Eca1 Tests Language Test 6a
1 page

Data Redundancy Using LSTM

Uploaded by

Data Redundancy Using LSTM

Uploaded by

“Data Redundancy using Ma-LSTM”

A report submitted in partial fulfillment of the requirements

1MS16IS074 Rishu Verma

Under the guidance of

Dr. Sumana Maradithaya

DEPARTMENT OF INFORMATION SCIENCE & ENGINEERING

DEPARTMENT OF INFORMATION SCIENCE AND ENGINEERING

Signature of the Guide Signature of the HOD

There is redudant data everywhere. Searching on internet about anything

3. System analysis and Design 10

4. Modelling and Implementation 12

5. Testing, Results and Discussion 14

6. Conclusion and Future Work 18

The problem of data redundancy across related documents is tackled by

Siamese network is an artificial neural network that use the same

The Long Short Term Memory (LSTM) is a second-order recurrent neu-

LSTMs have an edge over conventional feed-forward neural networks and

LSTMs make small modifications to the information by multiplications

WordNet is a lexical database for the English language. It groups English

Synsets are interlinked by means of conceptual-semantic and lexical

The main relation among words in WordNet is synonymy, as between the

The model has scopes:

1.4 Proposed Model

Siamese Manhattan LSTM

fig 2: Siamese LSTM

The proposed Manhattan LSTM (MaLSTM) model is outlined in Figure 2.

Fig3 demonstrate conversion of sentences into their word embedding

Word embedding is one of the most popular representation of document

Loosely speaking, they are vector representations of a particular word.

Word2Vec is one of the most popular technique to learn word

fig 4: Word2Vec example

fig4 , shows a famous example of word2vec. Here, when vector of king

[2] Language sentences is critical to the performance of several

[3] Ordering information is a difficult but a important task for

[5] Long Short Term Memory(LSTM) is a Machine learning model which

[6] Keras is an open-source neural-network library written in Python. It

System analysis and design

• The model should provide results as accurate as possible.

An error gradient is the direction and magnitude calculated during the

Moving backward in the Network and calculating gradients of loss(Error)

Modelling and Implementation

4.1 Use case diagram

fig 6: Use Case Diagram

Release2- A front Layer is built on top of release 1 input , here multiple

fig 7: Sequence Diagram

Stage1 - User Interface

Job - Documents input and retrieval , visible to end users.

Job - Sentence to their vector representation and text cleaning.

Job - Model feeding and computation of similarity score. (fig7)

Testing Result and Disscussion

Training and Testing:

Sample image of our dataset:

increase in accuracy with increase in epochs.

With increase in epochs validation curve behaves similar to training

Conclusion and future work

[3]A Machine Learning Approach to Sentence Ordering for Multidocument

[4]Universal Sentence Encoder

You might also like