Data Redundancy Using LSTM
Data Redundancy Using LSTM
Of
MINI-PROJECT
In
Sixth Semester
By
RAMAIAH
I n s t i t u t e o f Te c h n o l o g y
2018-2019
RAMAIAH INSTITUTE OF TECHNOLOGY
(Autonomous Institute Affiliated to VTU)
VIDYA SOUDHA
M. S. Ramaiah Nagar, M. S. R. I. T. Post, Bangalore – 560054
RAMAIAH
I n s t i t u t e o f Te c h n o l o g y
CERTIFICATE
This is to certify that the project work entitled “Data Redundancy using Ma-
LSTM” is a bonafide work carried out by Rishu Verma,Sachin, Piyush
Kumar bearing USN: 1MS16IS074,1MS16IS080, 1MS16IS054 in partial
fulfillment of requirements of Mini-Project course of Sixth Semester B.E. It is
certified that all correction/suggestions indicated for internal assessment has
been incorporated in the report. The project has been approved as it satisfies
the academic requirements in respect of project work prescribed by the above
said course.
_________________________ __________________________
Other Examiners
Name of the Examiners:
Signature
1.
2.
Acknowledgements
All sentences or passages quoted in this report from other people's work
have been specifically acknowledged by clear cross-referencing to author,
work and page(s). Any illustrations which are not the work of the author
of this report have been used (where possible) with the explicit
permission of the originator and are specifically acknowledged. I
understand that failure to do this amount to plagiarism and will be
considered grounds for failure in this project and the degree examination
as a whole.
Abstract
2. Literature Review 9
References 19
Chapter-1
INTRODUCTION
2.The previous hidden state (i.e. this is the same as the output of
the previous cell)
3.The input at the current time step (i.e. the new information that
is being fed in at that moment) 1
fig1: Expanded RNN
fig1 is Expansion of RNN is for understanding .Recurrent Neural Network
can be represented as time sequenced simple neural network where
output of RNN at any time serves as input at next time sequence.
2
Structure of wordnet:
3
1.1 Motivation
Most of the things are available in internet. One can study about any
topic using MOOC(massive open online courses), but most of the time
infomation about a topic is repeated. This makes it difficult for a person
to search more about that topic. We wanted to make a model which can
detect similarity between sentences and remove the redundant data from
a document thus giving reader concise information.
1.2 Scope
4
1.3 Objectives
• Reduce related multiple documents to single non-trivial document .
• Improve accuracy to reduce data loss.
• Perform accurately over different genre of documents.
5
fig 3: Embedding matrix
6
Word Embeddings:
Consider the following similar sentences: Have a good day and Have a
great day. They hardly have different meaning. If we construct an
exhaustive vocabulary (let’s call it V), it would have V = {Have, a,
good, great, day}.
Now, let us create a one-hot encoded vector for each of these words in
V. Length of our one-hot encoded vector would be equal to the size of V
(=5). We would have a vector of zeros except for the element at the
index representing the corresponding word in the vocabulary. That
particular element would be one. The encodings below would explain
this better.Have = [1,0,0,0,0]`; a=[0,1,0,0,0]` ; good=[0,0,1,0,0]` ;
great=[0,0,0,1,0]` ; day=[0,0,0,0,1]` (` represents transpose)If we try to
visualize these encodings, we can think of a 5 dimensional space, where
each word occupies one of the dimensions and has nothing to do with
the rest (no projection along the other dimensions). This means ‘good’
and ‘great’ are as different as ‘day’ and ‘have’, which is not true.
7
Our objective is to have words with similar context occupy close spatial
positions. Mathematically, the cosine of the angle between such vectors
should be close to 1, i.e. angle close to 0.
8
Chapter- 2
Literature Survey
[1] It disscusses about unified MOOC model to provide information. It
facilitate the exploitation of the experiences produced by the interactions
of the pedagogical actors. The aim is to make a unified analysis of the
massive data generated by learning actors.
[4] A way for encoding sentences into embedding vectors that specifically
target transfer learning to other NLP tasks. The models are efficient and
result in accurate performance on diverse transfer tasks. Two variants of
the encoding models allow for trade-offs between accuracy and compute
resources.
Vanishing Gradient:
fig 5a fig 5b
10
These in turn result in large updates to the network weights, and in
turn, an unstable network. At an extreme, the values of weights can
become so large as to overflow and result in NaN values. The explosion
occurs through exponential growth by repeatedly multiplying gradients
through the network layers that have values larger than 1.0.
Secifications:
OS: Ubuntu
Language: Python 2.76
Dataset : Quora duplicate question dataset
coding interface: Jupyter Notebook
libraries used:
• keras
• pandas
• numpy
• nltk
• matplotlib
• seaborn
• scikit learn
• gensim
11
Chapter-4
Release1- Input were Sentences and output was Similarity score between
them.
12
4.2 Sequence Diagram
13
Chapter-5
The model was tested on a sample from the quora dataset. The optimizer
used was Adadelta optimizer. Gradient clipping was also used to avoid
exploding gradient problem.
our model predict the values between 0 and 1. We can choose threshold
near to 0.5 to predict if sentence is similar or not.
14
fig 8: training phase
fig 8 shows hoe the model was trained on the dataset. It also shows
15
Result and Discussions:
After training our model on the training set. And validating it on
validation set we got accuracy greater than 80%. Convergence of our
model can be shown by the graph below.
Fig 8: Observations
The model was trained using LSTM but there are various other sentence
similarity measuring algorithms eg GRU. We can use it in future to
compare accuracy.
A bigger dataset can help in increasing the accuracy of the model. 16
Better optimizer can also help in giving better results for the data.
fig 9: Output
Fig9 Shows a sample output of the trained model. The interface takes in
a question number from the testing set as input and gives the predicted
and actual score as output.
Predicted score is given between 0 and 1, based on that, a threshold can
be decided to predict if sentence is similar or not.
17
Chapter-6
18
References:
[1] Machine Learning Based On Big Data Extraction of Massive
Educational Knowledge Abdelladim Hadioui!!", Nour-eddine El Faddouli,
Yassine Benjelloun Touimi, and Samir Bennani
[2]The Evaluation of Sentence Similarity Measures
Palakorn Achananuparp, Xiaohua Hu, and Shen Xiajiong
The ability to accurately judge the similarity between natural
[5] https://en.wikipedia.org/wiki/Long_short-term_memory
[6] https://keras.io/
19