0% found this document useful (0 votes)
52 views

Sentiment Analysis With LSTM

Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
52 views

Sentiment Analysis With LSTM

Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 40
Sentiment Analysis of Movie Reviews Using LSTM In previous chapters, we looked at neural network architectures, such as the basic MLP and feedforward neural networks, for classification and regression tasks. We then looked at CNNs, and we saw how they are used for image recognition tasks. In this chapter, we will turn our attention to recurrent neural networks (RNNs) (in particular, to long short-term memory (LSTM) networks) and how they can be used in sequential problems, such as Naturai Language Processing (NLP). We will develop and train a LSTM network to predict, the sentiment of movie reviews on IMDb. In this chapter, we'll cover the following topics: + Sequential problems in machine learning + NLP and sentiment analysis, + Introduction to RNNs and LSTM networks ‘+ Analysis of the IMDb movie reviews dataset * Word embeddings ‘+ A step-by-step guide to building and training an LSTM network in Keras, + Analysis of our results of Movie Reviews Using LSTM Chapter 6 Technical requirements The Python libraries required for this chapter are as follows: + matplotlib 3.0.2 + Keras 2.2.4 + seaborn 0.9.0 «+ scikitlearn 0.20.2 ‘The code for this chapter can be found in the GitHub repository for the book. clone To download the code onto your computer, you may run the following s: command: 3 git clone https: //github. con/PacktPublishing/Neural-Network-Projects-with-python.git After the process is complete, there will be a folder entitled Keura1-Net work-Proje: Enter the folder by running the following: with-Pytho $ cd Neural-Network-Projects-with-Python ‘To install the required Python libraries in a virtual environment, run the following command: $ conda env create -£ environment .yml Note that you should have installed Anaconda on your computer first, before running this command. To enter the virtual environment, run the following command: $ conda activate neural-network-projects-python Navigate to the chapter06 folder by running the following command: $ ed Chaptero6 The following file is located in the folder: © Lstm. -y: This is the main code for this chapter To run the code, simply execute the 1stm. py file 8 python 1stm.py [186] Sentiment Analysis of Movie Reviews Using LSTM Chapter 6 Sequential problems in machine learning Sequential problems are a class of problem in machine learning in which the order of the features presented to the model is important for making predictions, Sequential problems are commonly encountered in the following scenarios: + NLP, including sentiment analysis, language translation, and text prediction + Time series predictions For example, let's consider the text prediction problem, as shown in the following screenshot, which falls under NLP: “TWAS BORN IN PARIS BUT I GREW UP IN. TOKYO THEREFORE, 1 SPEAK FLUENT — | Human beings have an innate ability for this, and it is trivial for us to know that the word in the blank is probably the word Japanese, The reason for this is that as we read the sentence, we process the words as a sequence. The sequence of the words captures the information required to make the prediction. By contrast, if we discard the sequential information and only consider the words individually, we get a bag of words, as shown in the following diagram: ‘grew in was Toho | | born up but 1 Paris ' in We can see that our ability to predict the word in the blank is now severely impacted. Without knowing the sequence of words, it is impossible to predict the word in the blank. [187] Sentiment Analysis of Movie Reviews Using LSTM Chapter 6 Besides text predictions, sentiment analysis and language translation are also sequential problems. In fact, many NLP problems are sequential problems, because the languages that ‘we speak are sequential in nature, and the sequence conveys context and other subtle nuances. Sequential problems also occur naturally in time series problems. Time series problems are common in stock markets. Often, we wish to know whether a particular stock will rise or fall on a certain day. This problem is accurately defined as a time series problem, because knowing the movement of the stocks in the preceding hours or minutes is often crucial to predicting whether the stock will rise or fall. Today, machine learning methods are being, heavily applied in this domain, with algorithmic trading strategies driving the buying and selling of stocks In this chapter, we will focus on NLP problems. In particular, we will create a neural network for sentiment analysis. NLP and sentiment analysis NLP is a subfield in artificial intelligence (Al) that is concemed with the interaction of computers and human languages. As early as the 1950s, scientists were interested in designing intelligent machines that could understand human languages. Early efforts to create a language translator focused on the rule-based approach, where a group of linguistic experts handcrafted a set of rules to be encoded in machines. However, this rule- based approach produced results that were sub-optimal, and, often, it was impossible to convert these rules from one language to another, which meant that scaling up was difficult. For many decades, not much progress was made in NLP, and human language ‘was a goal that AI couldn't reach—until the resurgence of deep learning. With the proliferation of deep learning and neural networks in the image classification domain, scientists began to wonder whether the powers of neural networks could be applied to NLP. In the late '00s, tech giants, including Apple, Amazon, and Google, applied LSM networks to NLP problems, and the results were astonishing. The ability of Al assistants, such as Siri and Alexa, to understand multiple languages spoken in different accents was the result of deep learning and LSTM networks. In recent years, we have also seen a massive improvement in the abilities of text translation software, such as Google Translate, which is capable of producing translations as good as human language experts. [188] of Movie Reviews Using LSTM Chapter 6 Sentiment analysis is also an area of NLP that benefited from the resurgence of deep learning. Sentiment analysis is defined as the prediction of the positivity of a text. Most sentiment analysis problems are classification problems (positive/neutral/negative) and not regression problems. ‘There are many practical applications of sentiment analysis. For example, modern customer service centers use sentiment analysis to predict the satisfaction of customers through the reviews they provide on platforms such as Yelp or Facebook. This allows businesses to step in immediately whenever customers are dissatisfied, allowing the problem to be addressed as soon as possible, and preventing customer churn, Sentiment analysis has also been applied in the domain of stocks trading. In 2010, scientists showed that by sampling the sentiment in Twitter (positive versus negative tweets), we can predict whether the stock market will rise. Similarly, high-frequency trading firms use sentiment analysis to sample the sentiment of news related to certain companies, and execute trades automatically, based on the positivity of the news. Why sentiment analysis is difficult Early efforts in sentiment analysis faced many hurdles, due to the presence of subtle nuances in human languages. The same word can often covey a different meaning, depending on the context. Take for example the following two sentences: “THe BULDING Is ON Fine J "TAM ON FIRE TODAY! We know that the sentiment of the first sentence is negative, as it probably means that the building is literally on fire. On the other hand, we know that the sentiment of the second sentence is positive, since it is unlikely that the person is literally on fire. Instead, it probably means that the person is on a hot streak, and this is positive. The rule-based approach toward sentiment analysis suffers because of these subtle nuances, and it is incredibly complex to encode this knowledge in a rule-based manner. [189] of Movie Reviews Using LSTM Chapter 6 Another reason sentiment analysis is difficult is because of sarcasm. Sarcasm is commonly used in many cultures, especially in an online medium, Sarcasm is difficult for computers to understand. In fact, even humans fail to detect sarcasm at times. Take for example the following sentence: “THANKS FOR LOSING MY LUGGAGE! WHAT 'A WAY TO TREAT A LOYAL CUSTOMER” You can probably detect sarcasm in the preceding sentence, and come to the conclusion that the sentiment is negative. However, it is not easy for a program to understand that. In the next section, we will look at RNNs and LSTM nets, and how they have been used to tackle sentiment analysis. RNN Up until now, we have used neural networks such as the MLP, feedforward neural network, and CNN in our projects. The constraint faced by these neural networks is that they only accept a fixed input vector such as an image, and output another vector. The high-level architecture of these neural networks can be summarized by the following oe - ‘This restrictive architecture makes it difficult for CNNs to work with sequential data. To work with sequential data, the neural network needs to take in specific bits of the data at each time step, in the sequence that it appears. This provides the idea for an RNN. An RNN has high-level architecture, as shown in the following diagram: [190] Sentiment Analysis of Movie Reviews Using LSTM Chapter 6 From the previous diagram, we can see that an RNN is a multi-layered neural network. We can break up the raw input, splitting it into time steps. For example, if the raw input is a sentence, we can break up the sentence into individual words (in this case, every word represents a time step). Each word will then be provided in the corresponding layer in the RNN as Input. More importantly, each layer in an RNN passes its output to the next layer, The intermediate output passed from layer to layer is known as the hidden state. Essentially, the hidden state allows an RNN to maintain a memory of the intermediate states from the sequential data. What's inside an RNN? Let's now take a closer look at what goes on inside each layer of an RNN. The following, diagram depicts the mathematical function inside each layer of an RNN: Input from time step t ‘Output (Hidden state from time step t, Hismazear eee to be passed to next layer) time step 1 [191] of Movie Reviews Using LSTM Chapter 6 ‘The mathematical function of an RNN is simple. Each layer t within an RNN has two. inputs ‘+ The input from the time step t + The hidden state passed from the previous layer -1 Each layer in an RNN simply sums up the two inputs and applies a tanh function to the sum. It then outputs the result, to be passed as a hidden state to the next layer. I's that simple! More formally, the output hidden state of layer f is this: 4) 84 = turf sy But what exactly is the tank function? The tank function is the hyperbolic tangent function, and it simply squashes a value between 1 and -1. The following graph illustrates this: tanh) ‘The tanh function is a good choice as a non-linear transformation of the combination of the current input and the previous hidden state, because it ensures that the weights don’t diverge too rapidly. It has also other nice mathematical properties, such as being easily differentiable Finally, to get the final output from the last layer in the RNN, we simply apply a sigmoid function to it on) [192] Sentiment Analysis of Movie Reviews Using LSTM Chapter 6 In the previous equation, 1 is the index of the last layer in the RNN. Recall from previous chapters that the sigmoid function produces an output between 0 and 1, hence providing the probabilities for each class as a prediction, We can see that if we stack these layers together, the final output from an RNN depends on the non-linear combination of the inputs at different time steps. Long- and short-term dependencies in RNNs ‘The architecture of an RNN makes it ideal for handling sequential data. Let's take a look at some concrete examples, to understand how an RNN handles different lengths of sequential data Let's first take a look at a short piece of text as our sequential data: "THE WEATHER IS HOT TODAY") We can treat this short sentence as sequential data by breaking it down into five different inputs, with each word at each time step. This is illustrated in the following diagram: “THE WEATHER IS HOT TODAY” [193] Sentiment Analysis of Movie Reviews Using LSTM Chapter 6 Now, suppose that we are building a simple RNN to predict whether is it snowing based on this sequential data. The RNN would look something like this: THE WEATHER HoT TODAY Output: Not Snowing ‘The critical piece of information in the sequence is the word HOT, at time step 4 (t, circled in red). With this piece of information, the RNN is able to easily predict that it is not snowing today. Notice that the critical piece of information came just shortly before the final output. In other words, we would say that there is a short-term dependency in this sequence. Clearly, RNNs have no problems with short-term dependencies. But what about long-term. dependencies? Let’s take a look now at a longer sequence of text. Let's use the following paragraph as an example: “really iked the movie but | was disappointed in the service and cleanliness of the cinema. The ‘cinema should be better maintained in order to provide a better experience for customers.” [194] Sentiment Analysis of Movie Reviews Using LSTM. Chapter 6 Our goal is to predict whether the customer liked the movie. Clearly, the customer liked the movie but not the cinema, which was the main complaint in the paragraph. Let's break up the paragraph into a sequence of inputs, with each word at each time step (32 time steps for 32 words in the paragraph). The RNN would look this: REALLY LIKED ‘The critical words liked the movie appeared between time steps 3 and 5. Notice that there isa significant gap between the critical time steps and the output time step, as the rest of the text was largely irrelevant to the prediction problem (whether the customer liked the movie). In other words, we say that there is a long-term dependency in this sequence. Unfortunately, RNNs do not work well with long-term dependency sequences. RNNs have 2 good short-term memory, but a bad long-term memory. To understand why this is so, we need to understand the vanishing gradient problem when training neural networks. [195] Sentiment Analysis of Movie Reviews Using LSTM Chapter 6 The vanishing gradient problem The vanishing gradient problem is a problem when training deep neural networks using, gradient-based methods such as backpropagation. Recall in previous chapters, we discussed the backpropagation algorithm in training neural networks. In particular, the Loss function provides information on the accuracy of our predictions, and allows us to adjust the weights in each layer, to reduce the loss. So far, we have assumed that backpropagation works perfectly. Unfortunately, that is not true. When the loss is propagated backward, the loss tends to decrease with each successive layer: Magnitude of Loss Propagated Decreases Asa result, by the time the loss is propagated back toward the first few layers, the loss has already diminished so much that the weights do not change much at all. With such a small loss being propagated backward, it is impossible to adjust and train the weights of the first few layers, This phenomenon is known as the vanishing gradient problem in machine learning. Interestingly, the vanishing gradient problem does not affect CNNs in computer vision problems, However, when it comes to sequential data and RNNs, the vanishing gradient can have a significant impact. The vanishing gradient problem means that RNNs are unable to learn from early layers (early time steps), which causes it to have poor long-term. memory. To address this problem, Hochreiter and others proposed a clever variation of the RNN, known as the long short-term memory (LSTM) network, [196] of Movie Reviews Using LSTM Chapter 6 The LSTM network STM are a variation of RNNs, and they solve the long-term dependency problem faced by conventional RNs. Before we dive into the technicalities of LSTMs, its useful to understand the intuition behind them. LSTMs - the intuition As we explained in the previous section, LSTMs were designed to overcome the problem with long-term dependencies. Let's assume we have this movie review: “Tloved this moviel The action sequences were ‘on point and the acting was terrific. Highly recommended!” Our task is to predict whether the reviewer liked the movie, As we read this review, we immediately understand that this review is positive. In particular, the following words (highlighted) are the most important: nas aa SR aan and the acting was tae. Highly If we think about it, only the highlighted words are important, and we can ignore the rest of the words. This is an important strategy. By selectively remembering certain words, we can censure that our neural network does not get bogged down by too many unnecessary words that do not provide much predictive power. This is an important distinction of LSTMs over conventional RNNs. Conventional RNNs have a tendency to remember everything (even unnecessary inputs) that results in the inability to learn from long sequences. By contrast, LSTMs selectively remember important inputs (such as the preceding highlighted text), and this allows them to handle both short- and long-term dependencies. ‘The ability of LSTMs to learn from both short- and long-term dependencies gives it its name, long short-term memory (LSTM). [197] Sentiment Analysis of Movie Reviews Using LSTM Chapter 6 What's inside an LSTM network? LSTMS have the same repeating structure of RNNs that we have seen previously. However, LSTMs differ in their internal structure, ‘The following diagram shows a high-level overview of the repeating unit of an LSTM: Previous Call Output Cat State(C,.) State (C) Previous Haddon State (ha) Cs] a sas T ‘Sigmoid tanh Muttiication —Adltion_——_Concatenation The preceding diagram might look complicated to you now, but, don't worry, as we'll go through everything step by step. As we mentioned in the previous section, LSTMs have the ability to selectively remember important inputs and to forget the rest. The internal structure of an LSTM allows it to do that. An LSTM differs from a conventional RNN in that it has a cell state, in addition to the hidden state. You can think of the cell state as the current memory of the LSTM. It flows from one repeating structure to the next, conveying important information that has to be retained at the moment. In contrast, the hidden state is the overall memory of the entire LSTM. It contains everything that we have seen so far, both important and unimportant information. How does the LSTM release information between the hidden state and the cell state? It does so via three important gates: + Forget gate ‘+ Input gate = Output gate [198] Sentiment Analysis of Movie Reviews Using LSTM Chapter 6 Just like physical gates, the three gates restrict the flow of information from the hidden state to the cell state. Forget gate ‘The Forget gate (f) of an LSTM is highlighted in the following diagram: Hidden State (hes) Input (m1) ‘The Forget gate (f) forms the first part of the LSTM repeating unit, and its role is to decide how much data we should forget or remember from the previous cell state. It does so by first concatenating the Previous Hidden State (h,) and the current Input (x), then passing the concatenated vector through a sigmoid function. Recall that the sigmoid function outputs a vector with values between 0 and 1. A value of 0 means to stop the information from passing through (forget), and a value of 1 means to pass the information through (remember) ‘The output of the forget gate, f is as follows: f = o(concatenate(hy-1,21)) [199] Sentiment Analysis of Movie Reviews Using LSTM Chapter 6 Input gate ‘The next gate is the Input gate (i). The Input gate (i) controls how much information to pass to the current cell state. The input gate of an LSTM is highlighted in the following diagram: Previous Hidden Sate (has) Input) Just like the forget gate, the Input gate (i) takes as input the concatenation of the Previous Hidden State (h,,) and the current Input (x). It then passes two copies of the concatenated. vector through a sigmoid function and a tanh function, before multiplying them together. The output of the input gate, i, is as follows: At this point, we have what is required to compute the current cell state (C,) to be output. This is illustrated in the following diagram: Previous Cel Output Cat ‘State (Ci) ‘State (C,) [200] Sentiment Analysis of Movie Reviews Using LSTM Chapter 6 ‘The current cell state C, is as follows: Output gate Finally, the output gate controls how much information is to be retained in the hidden state. ‘The output gate is highlighted in the following diagram: Previous Hidden State (er) Input (x) First, we concatenate the Previous Hidden State (h,_.) and the current Input (x), and pass it through a sigmoid function. Then, we take the current cell state (C)) and pass it through a tanh function. Finally, we take the multiplication of the two, which is passed to the next repeating unit as the hidden state (I,). This process is summarized by the following equation: Making sense of this Many beginners to LSTMs often get intimidated by the mathematical formulas involved. Although it is useful to understand the mathematical functions behind LSTMs, it is often. difficult (and not very useful) to try to relate the intuition behind LSTMs and the mathematical formulas. Instead, it is more useful to understand LSTMs at a high level, and then to apply a black box algorithm, as we shall see in the later sections. [201] of Movie Reviews Using LSTM Chapter 6 The IMDb movie reviews dataset At this point, let's take a quick look atthe IMDb movie reviews dataset before we start building our model, Its always a good practice to understand our data before we build our model. The IMDb movie reviews dataset is a corpus of movie reviews posted on the popular movie reviews website https: //wnw. imdb. com/. Each movie review has a label indicating whether the review is positive (1) or negative (0). ‘The IMDb movie reviews dataset is provided in Keras, and we can import it by simply calling the following code: from keras.datasers imp. training_set, testing set = imdb.load_data(index_from = 3) train, y_train = training set Xtest, y test = testing set We can print out the first movie review as follows: print (Xtrain(0]) Welll see the following output: 8, 65, 3941, 4, 284, 5 973, 1622, 1385, 65, 458, 838, 112, 50, 670, 22665, 9, 35, 480, 335, 385, 33, 4, 172, 1536, 1111, 17, 546, 38, 5 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 38, 13, 1247, 4 22, 1%) 515, 1%, 12, 1 400, 317, 46, 2071, 56, 2 144, 30,5535 ee, 12, 15, 283, 5, 16, 5345, 13, 4 We see a sequence of numbers, because Keras has already encoded the words as numbers as part of the preprocessing, We can convert the review back to words, using the builtin word-to-index dictionary provided by Keras as part of the datasct sid = indb. get. sid = (key: (valuer {index () r key;value in word _to_id.izems() key, value in word_to_id.itens()} [202] Sentiment Analysis of Movie Reviews Using LSTM Chapter 6 Now, we can show the original review in words print (1 '.join(id to werd(id) for id in X erain(159) 9) Welll see the following output: a rating of 1 dees not begin to express how dull depressing and relentlessly bad © Clearly, the sentiment of this review is negative! Let's make sure by printing the y value: print (y_train(1591) We'll see the following output: ° Avy value of 0 refers to a negative review and ay value of 1 refers to a positive review. Let's {ake a look at an example of a positive review: print (* '.join(idte_word[id) for id in Xtrain{s]}) Welll get the following output: lavish production values and solid performances in this straightforward adaption of jane austen's satirical classic about the mazziage game within and between Lhe classes in provincial 18th century england northam and paltrow are a salutory mixture as friends who must pass through jealousies and lies to discover that they love each other good humor is a sustaining virtue which goes a long way towards explaining the essability of the aged source material which has been toned down a bit in its harsh scepticism i liked the Look of the film and how shots were set, up and i thought it didn't rely too mach on successions of head shots like most other films of the 80s and 90s do very good results To check the sentiment of the review, try this: print (y_train(6]) We get the following output: 2 [203 } of Movie Reviews Using LSTM Chapter 6 Representing words as vectors So far, we have looked at what RNNs and LSTM networks represent. There remains an important question we need to address: how do we represent words as input data for our neural network? In the case of CNNs, we saw how images are essentially three-dimensional vectors/matrixes, with dimensions represented by the image width, height, and the number of channels (three channels for color images). The values in the vectors represent the intensity of each individual pixel One-hot encoding How do we create a similar vector/matrix for words so that they can be used as input to our neural network? In earlier chapters, we saw how categorical variables such as the day of week can be one-hot encoded to numerical variables by creating a new feature for each variable. It may be tempting to think that we can also one-hot encode our sentences in this manner, but such a method has significant disadvantages. Let's consider phrases such as the following: ‘+ Happy, excited + Happy + Excited ‘The following diagram shows a one-hot encoded two-dimensional representation of these phrases: [204] of Movie Reviews Using LSTM Chapter 6 In this vector representation, the phrase "Happy, excited” has a value of 1 for both axes, because both the words "Happy" and “Excited” are present in the phrase. Similarly, the phrase Happy has a value of 1 for the Happy axis and a value of 0 for the Excited axis, because it only contains the word Happy. The full two-dimensional vector representation is shown in the following table: Happy Excited 1 1 1 0 ° 1 ‘There are several problems with this one-hot encoded representation. Firstly, the number of axes depends on the number of unique words in our dataset. As we can imagine, there are tens of thousands of unique words in the English dictionary. If we were to create an axis for each word, then the size of our vector would quickly grow out of hand. Secondly, such a vector representation would be extremely sparse (full of zeros). This is because most words appear only once in each sentence/paragraph. It is difficult to train a neural network on such a sparse vector. Finally, and perhaps most importantly, such a vector representation does not take into consideration the similarity of words. In our preceding example, Happy and Excited are both words that convey positive emotions. However, this one-hot encoded representation does not take this similarity into consideration. Thus, important information is lost when words are represented in this form, As we can see, there are significant disadvantages associated with one-hot encoded vectors. In the next section, we'll look at word embeddings, which overcome these disadvantages. Word embeddings Word embeddings are a learned form of vector representation for words. The main. advantage of word embeddings is that they have fewer dimensions than the one-hot encoded representation, and they place similar words close to one another. [205] of Movie Reviews Using LSTM Chapter 6 ‘The following diagram shows an example of a word embedding "Disappointed", “Angry”, and Furious” are on the opposite ends of the spectrum, and should be placed far away. We won't go into detail regarding the creation of the word embeddings, but essentially they are trained using supervised learning algorithms. Keras also provides a convenient API for training our own word embeddings. In this project, we will train our word embeddings on the IMDb movie reviews dataset. Model architecture Let's take a look at the model architecture of our IMDb movie review sentiment analyzer, shown in the following diagram: Mel eaneeecliy Input (vio Reviews) Perey Output (Sentiment ‘This should be fairly familiar to you by now! Let's go through each component briefly. [206] Sentiment Analysis of Movie Reviews Using LSTM Chapter 6 Input ‘The input to our neural network shall be IMDb movie reviews. The reviews will be in the form of English sentences. As we've seen, the dataset provided in Keras has already encoded the English words into numbers, as neural networks require numerical inputs, However, there remains a problem we need to address. As we know, movie reviews have different lengths. If we were to represent the reviews as a vector, then different reviews would have different vector lengths, which is not acceptable for a neural network. Let's keep this in mind for now, and we'll see how we can address this issue as we build our neural network, Word embedding layer ‘The first layer in our neural network is the word embedding layer. As we've seen earlier, word embeddings are a learned form of vector representation for words. The word embedding layer takes in words as input, and then outputs a vector representation of these words. The vector representation should place similar words close to one another, and dissimilar words distant from one another. The word embedding layer learns this vector representation during training. LSTM layer The LSTM layer takes as input the vector representation of the words from the word embedding layer, and learns how to classify the vector representation as positive or negative. As we've seen earlier, LSTMs are a variation of RNNs, which we can think of as multiple neural networks stacked on top of one another. Dense layer ‘The next layer is the dense layer (fully connected layer). The dense layer takes as input the output from the LSTM layer, and transforms it into a fully connected manner. Then, we apply a sigmoid activation on the dense layer, so that the final output is between 0 and 1 Output The output is a probability between 0 and 1, representing the probability that the movie w is positive or negative. A probability near to 1 means that the movie review is ive, while a probability near to 0 means that the movie review is negative. [207] of Movie Reviews Using LSTM Chapter 6 Model building in Keras We're finally ready to start building our model in Keras. As a reminder, the model architecture that we're going to use is shown in the previous section, Importing data First, let's import the dataset. The IMDb mov Keras, so we can import it directly: lataset is already provided in from kerae.datasete import imdb ‘The imap class has a Load_data main function, which takes in the following important argument: ‘+ num_ords: This is defined as the maximum number of unique words to be loaded. Only the 1 most common unique words (as they appear in the dataset) will be loaded. If m is small, the training time will be faster at the expense of accuracy. Let's set num_words 00. The 1oad_data function returns two tuples as the output. The first tuple holds the training set, while the second tuple holds the testing set. Note that the load_data function splits the data equally and randomly into training and testing sets. ‘The following code imports the data, with the previously mentioned parameters: training_est, testing set - imdb.load_data(numwords = 1 aining_set sting_set Let's do a quick check to see the amount of data we have: (Nunes ("Number of cesting samples ~ (}". format ( raining samples = (}". format (X_train-shape[0)}} est shape[0]}) Welll see the following output: Number of training samples = 25000 Number of testing samples = 25000 We can see that we have 25090 training and testing samples each. [208] of Movie Reviews Using LSTM Chapter 6 Zero padding Before we can use the data as input to our neural network, we need to address an issue, Recall that in the previous section, we mentioned that movie reviews have different lengths, and therefore the input vectors have different sizes. This is an issue, as neural networks only accept fixed-size vectors. To address this issue, we are going to define a max en parameter. The maxlen parameter shall be the maximum length of each movie review. Reviews that are longer than maxlen will be truncated, and reviews that are shorter than max en will be padded with zeros. The following diagram illustrates the zero padding process: (input movie review of length 4) “TLOVE THis MovIEr | senor mt 1,2, 3, 4) | oro Pacing wth Len =10 [1, 2, 3, 4, @ @ 0, 0, @ 0] (§xtput vector of length 10) Using zero padding, we ensure that the input will have a fixed vector length. As always, Keras provides a handy function to perform zero padding. Under the Keras preprocessing module, there's a sequence class that allows us to perform preprocessing for sequential data. Let's import the sequence class: from kere. sequence [209] of Movie Reviews Using LSTM Chapter 6 ‘The sequence class has a pac_sequences function that allows us to perform zero padding on our sequential data. Let's truncate and pad our training and testing data using a maxle: of 100. The following code shows how we can do this: rain_padded = sequence.pad_sequences (x. 100) ences (X_test, maxlen= 100) padded = sequence-pad Now, let's verify the vector length after zero padding: print (%! print ("%! ain vector shape ~ {)".format (x_train_padded. shape) ) lox shape = {}".format (X_test_padded. shape! Welll see the following output: X_train vector shape = (25000, 100) X_test vector shape = (25000, 100) Word embedding and LSTM layers With our input preprocessed, we can now turn our attention to model building. As always, we will use the Sequential class in Keras to build our model. Recall that the s: class allows us to stack layers on top of one another, making it really easy to build complex models layer by layer. entia As always, let's define a new sequent ial class: from keras.models import Sequential model = Sequential () We can now add the word embedding layer to our model. The word embedding layer can be constructed directly from the keras. layers as follows: from keras.layers import Embedding ‘The Embedding class takes the following important arguments: ‘+ input_dim: The input dimensions of the word embedding layer. This should be the same as the nun_words parameter that we used when we loaded in our data, Essentially, this is the maximum number of unique words in our dataset ‘+ output_dim: The output dimensions of the word embedding layer. This should bea hyperparameter to be fine-tuned. For now, lets use a value of 128: [210] of Movie Reviews Using LSTM Chapter 6 We can add an embedding layer with the previously mentioned parameters to our sequential model as follows: model .add (Rnbedding(input_dim = 10000, output_dim = 128)) Similarly, we can add a 1.57 layer directly from keras. layers as follows: from keras.layers import LSTM The ist class takes the following important arguments: ‘+ units: This refers to the number of recurring units in the 1.5T™ layer. A larger number of units results in a more complex model, at the expense of training time and overfitting. For now, let's use a typical value of 128 for the number of units. + act ivat ion: This refers to the type of activation function applied to the cell state and the hidden state. The default value is the tanh function. urrent_act ivat ion: This refers to the type of activation function applied to the forget, input, and output gates. The default value is the sigmoid function. ‘You might notice that the kind of activation function is rather limited in Keras. Instead of selecting individual activations for the forget, input, and output gates, we are limited to choosing a single activation function for all three gates. This is unfortunately a limitation that we need to work with. However, the good news is that this deviation from theory does not significantly affect our results. The LSTM that we build in Keras is perfectly able to earn from the sequential data. We can add an 2.57M layer with the previously mentioned parameters to our sequential model as follows: model .addl(LSTM(units=128) ) Finally, we add a Dense layer with sigmoid as the act ivat ion function. Recall that the purpose of this layer is to ensure that the output of our model has a value between 0 and 1, representing the probability that the movie review is positive. We can add a Dense layer as follows: from kerae. lay: Dense -add (Dens oe nits=1, activation="sigmoid')} The dense layer is the final layer in our neural network. Let's verify the structure of our model by calling the summary () function: model. summary () [211] Sentiment Analysis of Movie Reviews Using LSTM Chapter 6 We get the following output: Nice! We can see that the structure of our Keras model matches the model architecture in the diagram that we introduced at the start of the previous section. Compiling and training models With the model building complete, we're ready to compile and train our model. By now, you should be familiar with the model compilation in Keras. As always, there are certain, parameters we need to decide when we compile our model. They are as follows: * Loss function: We use a binary_crossent ropy loss function when the target output is binary and a cat egorical_crossent ropy loss function when the target output is multi-class. Since the sentiment of movie reviews in this project is binary (that is, positive or negative), we will use a binary _crossent ropy loss function. + Optimizer: The choice of optimizer is an interesting problem in LSTMs. Without getting into the technicalities, certain optimizers may not work for certain datasets, due to the vanishing gradient and the exploding gradient problem (the opposite of the vanishing gradient problem). Itis often impossible to know beforehand which optimizer works better for the dataset. Therefore, the best way to know is to train different models using different optimizers, and to use the optimizer that gives the best resulls. Let's try the 860, R¥Sprop, and the adan optimizer. [212] Sentiment Analysis of Movie Reviews Using LSTM Chapter 6 We can compile our model as follows: # try the sop optimizer first optimizer = 'scb' model .compile (loss="binary_crossentropy', optimizer = optimizer) Now, let's train our model for 10 epochs, using the testing set as the validation data. We can do 0 as follows: scores = model. fit (x-X_train_padded, y-y_train, batch_size = 128, epochs=10, validation_data=(X_test_padded, y_test)) ‘The scores object returned is a Python dictionary that provides the training and validation accuracy and the loss per epoch, Before we go on to analyze our results, let's put all our code into a single function. This allows us to easily test and compare the performance of different optimizers, We define a train_mode! () function that takes in an opt imi zer as an argument: def train model (optimizer, X train, y train, Kval, y val} model = Sequential () model add (Embedding (input_dim ~ 10000, output_dim ~ 128)) model add (LSTM(units=128)) nodel.add(Dense(units=1, activatio sgmoid")) model. compile (lose="binary crossentropy', optimizes = optimizer, metrice=['accuzacy']) scores ~ model.fit(Xtrain, y train, bateh_size~i28, epochs=10, validation dat: verbose=0) KX val, yval), eturn scores, model Using this function, let's train three different models using three different optimizers, the SGD, RMSprop, and the adan optimizer: SGD_score, SGD_nodel = train_model (optimizer = ‘sgd', rain-X_train_padded, yoerainey_train, (_tast_padded, test} RNSprop_score, RMSprop_model = train_model (Optimizer = 'RMSprop", X_train-X_train_padded, yltrainsy train, Xval=K_test_padded, [213] Sentiment Analysis of Movie Reviews Using LSTM Chapter 6 yivaley_test) ain_model (Optimizer = ‘adan', X train padded, Analyzing the results Let's plot the validation accuracy per epoch for the three different models. First, we plot for the model trained using the sad optimizer: from matplotiib import pyplot as plt plt-plot (range (1,11), 6! ry['acc'}, Label=' let (range (1,11), SGD_score.history|"val_ace'], Validation Accuracy") 0, 9, 11) Lave plt.axis([1, pit -xlabel (* fel ("Accuracy") plt.title('Train and Validation Accuracy using SGD Optimizer") egend() pit.show() We get the following output: "Wan ad Valeaton Reza wang SGD Oper Taig ecuacy aan secrecy | a2 ‘| of Movie Revi ews Using LSTM Chapter 6 Did you notice anything wrong? The training and validation accuracy is stuck at 50%! Essentially, this shows that the training has failed and our neural network performs no better than a random coin toss for this binary classification task. Clearly, the sgq optimizer is not suitable for this dataset and this STM network. Can we do better if we use another optimizer? Let's try the R¢Sprop optimizer. We plot the training and validation accuracy for the model trained using the RESprop optimizer, as shown in the following code plt-plot (range (1,11), label='Train pit -plot (range (1,11), label="Valid xLabel( Label ("Accuracy") ltitie ("train and Ya pit-legend() We get the following output: RMSprop_s nist! ng Accura ” re history['val_+ dation Accuracy using RMSprop optimizer') 10 2 e as a2 —— Vadetion curacy 2 2 4 6 8 0 pee Tran and Vabdaton Accuracy ang RUSHrOp OPENS [215] of Movie Reviews Using LSTM Chapter 6 ‘That's much better! Within 10 epochs, our model is able to achieve a training accuracy of more than 95% and a validation accuracy of around 85%. That's not bad at all. Clearly, the RNSprop optimizer performs better than the sg¢ optimizer for this task. Finally, let's try the adam optimizer and see how it performs, We plot the training and validation accuracy for the model trained using the aclan optimizer, as shown in the following code: label="Training Aceura ry [tace'] ory [va plt-plot (range (1,11), Adam_score Lot (range(1,11), Adam_score. idation Accuracy") label plt.title('Train and Validation Accuracy using Adam optimizer") pit legend () We get the following output: ai Tair and Validation Accuracy using Adam Opiizer e aw a2 — rey aca —— Validation kecoracy 2% 2 4 6 a a0 Epoch The adam optimizer does pretty well. From the preceding graph, we can see that the Training Accuracy is almost 100% after 10 epochs, while the Validation Accuracy is around 80%, This gap of 20% suggests that overfitting is happening when the adam optimizer is used. [216] of Movie Reviews Using LSTM Chapter 6 By contrast, the gap between training and validation accuracy is smaller for the RMSprop optimizer. Hence, we conclude that the Ri optimizer is the most optimal for this, dataset and the LSTM network, and we shall use the model built using the &MSprop. optimizer from this point onward. Confusion matrix Incnapter 2, Diabetes Prediction with Multilayer Perceptrons, we saw how the confusion matrix is a useful visualization tool to evaluate the performance of our model. Let's also use the confusion matrix to evaluate the performance of our model in this project. To recap, these are the definitions of the terms in the confusion matrix: + True negative: The actual class is negative (negative sentiment), and the model also predicted negative + False positive: The actual class is negative (negative sentiment), but the model predicted positive + False negative: The actual class is positive (positive sentiment), but the model predicted negative + True positive: The actual class is positive (positive sentiment), and the model predicted positive ‘We want our false positive and false negative numbers to be as low as possible, and for the true negative and true positive numbers to be as high as possible. We can construct a confusion matrix using the con us ion_mat rix class from sk learn, using seaborn for visualization: es import confusion matrix plt. figure (figsize= (10, 7)) 5 (X_test_padded) peed) riz, annot=Tzue, xticklabels=['Negative Sentiment', ve Sentiment'], yticklabele=[ "Negative Sent imen ve Sentiment], cbar-False, cmap: tues", xlabel ("2: [217] Sentiment Analysis of Movie Reviews Using LSTM Chapter 6 We get the following output: Prediction From the preceding confusion matrix, we can see that most of the testing data was classified correctly, with the number of true negatives and true positives at around 85%. In other words, our model is 85% accurate at predicting sentiment for movie reviews. That's pretty impressive! Let's take a look at some of the wrongly classified samples, and see where the model got it ‘wrong, The following code captures the index of the wrongly classified samples: False_negatives = [) faise_positives = [] for i in range (Len (y_test_pred)): Af ytest_pred{i) (0) != y_test (11 Lf y_test[i] == 0: # False Positive False_positives.append(i) else: false_negatives.append (i) Let's first take a look at the false positives. As a reminder, false positives refer to movie reviews that were negative but that our model wrongly classified as positive. [218] Sentiment Analysis of Movie Reviews Using LSTM Chapter 6 We have selected an interesting false positive; this is shown as follows: “The sweet is never as sweet without the sour", This quote was essentially the theme for the movie in my opinion It is a movie that really makes you step back and look at your life and how you Live it. You cannot really appreciate the better things in life (the sweet) like love until you have experienced the bad (the sour) Only complaint is that the movie gets very twisted at pointe and is hard to really understand...... 1 veconmend you watch st and see for yourself. Even as a human, it is hard to predict the sentiment of this movie review! The first sentence of the movie probably sets the tone of the reviewer. However, itis written in a really subtle ‘manner, and itis difficult for our model to pick out the intention of the sentence. Furthermore, the middle of the review praises the movie, before ending with the conclusion that the movie gets very twisted at points and is hard to really understand. Now, let's take a look at some false negatives: I hate reading reviews that say something like ‘don't waste your time this Film stinks on ice’. It does to that reviewer yet for me it may have some sort of naive charm ..... This film is not as good in my opinion as any of the earlier series entries .., But the acting is good and co is the Lighting and the dialog. It's just lacking in energy and you'll likely Figure out exactly what's going on and how it's all going to come out in an a quarter of the way through But still 1°11 ‘the end not m recommend this one for at least a single viewing. I've watched it at least twice myself and got a reasonable amount of enjoyment out of it both times ‘This review is definitely on the fence, and it looked pretty neutral, with the reviewer presenting the good and bad of the movie. Another point to note is that, at the start of the review, the reviewer quoted another reviewer (T hate reading reviews that say something like 'don't waste your time this film stinks on ice').Our model probably didn’t understand that this quote is not the opinion of this reviewer. Quoted text is definitely a challenge for most NLP models. Let's take a look at another false negative: I just don't understand why this movie is getting beat up in here jeez. It is mindless, it isn't polished ..... I just don't get it. The jokes work on more then one level. If you didn't get it, I know what level you're at [219] Sentiment Analysis of Movie Reviews Using LSTM Chapter 6 ‘This movie review can be considered a rant against other movie reviews, similar to the previous review that we showed. The presence of multiple negative words in the movie probably misled our model, and our model did not understand that the review was ranting against all the other negative reviews. Statistically speaking, such reviews are relatively rare, and it is difficult for our model to learn the true sentiment of such reviews. Putting it all together ‘We have covered a lot in this chapter. Let's consolidate all our code here: From keras.datasets import imdb from keras preprocessing import sequence from keras.models import Sequential from keras.layers import Embedding from keras.layers import Dense, Snbedding From keras.layers import 11M from mazplotiib import pyplot as pit from sklearn.metrics import confusion_matrix import seaborn ass ¥ Import IMDB dataset training set, testing set = imdb. load data (num words = 10000) train, y_train = training set XLtest, y_test = testing_se print ("Number of print ("Number of raining samples = ()".format (X_train.shape[0)}} sting samples = {}".format (x test.shape[0]}) # zero-Padding X_train_padded ~ sequence.pad_sequences (x_train, maxlen= 100) X_test_padded = sequence.pad_sequences (X_test, maxlen= 100) print ("X train vector shape = {)". format (X_train_padded. shape) ) print ("X_test vector shape = {}". format (x_test_padded. shape) ) ¥ Model Buiiding def trainmodel (optimizer, X train, y_train, Xval, yval} model = Sequential() nodel.add (Embedding (input_dim = 10000, outpu! model add (BSTM(units~128)) model add (Dense(unite=1, activation="signoid')) model. compile (loss="binary_crossentropy', optimiz. metrice=[‘accuracy'|) scores = model.fit(Xtrain, y_train, batch_size=128, epochs=10, validation data~(X_val, y_val)) dim = 128)) optimizer, [220] Sentiment Analysis of Movie Reviews Using LSTM Chapter 6 return 'S, model # Train Mode RMSpzop_score, RMSprop_model = train model (Optimizer = 'RMSprop', Xtrain=x_train_padded, y_trainsy train, Xval=X_test_padded, yval=y_test) # Plot accuracy per epoch plt-ploz(range(1,11), RUSprop_score.history["ace'], label='Training Accus plt-plot (range (1,11), RMSprop. label="Validation Accuracy") pit-axis([1, 10, 0, 1) plt-xlabel ("Epoch") plt-ylabel (*Accuracy") plt-title('Train and Validation Accuracy using RMSprop 0; pit-legend() pit-show() ory (*val_aci # Plot confusion matrix y_test_pred = RuSprop_model .predict_classes (X_te: eLmatrix = confusion_matrix(y_test, y_test_pred) ax = sns.heatmap(c_matrix, annot-True, xticklabels-['Negative Sentiment", "Positive Sentiment'], yticklabels=['Negacive Sentiment", "Positive Sentiment'], char=False, cnap='Blues', fmt="g") ax.set_xlabel ("Prediction") ax.set_ylabel ("Actual") pit.show() Summary In this chapter, we created an LSTM-based neural network that can predict the sentiment of, movie reviews with 85% accuracy. We first looked at the theory behind recurrent neural networks and LSTMs, and we understood that they are a special class of neural network designed to handle sequential data, where the order of the data matters. We also looked at how we can convert sequential data such as a paragraph of text into a numerical vector, as input for neural networks, We saw how word embeddings can reduce the dimensionality of such a numerical vector into something more manageable for training, neural networks, without necessarily losing information. A word embedding layer does this by learning which words are similar to one another, and it places such words in a cluster, in the transformed vector. [221] of Movie Reviews Using LSTM Chapter 6 We also looked at how we can easily construct a LSTM neural network in Keras, using the Sequent ial model. We also investigated the effect of different optimizers on the LSTM, and we saw how the LSTM is unable to learn from the data when certain optimizers are used. More importantly, we saw that tuning and experimenting is an essential part of the machine learning process, in order to maximize our resulls. Lastly, we analyzed our results, and we saw how LSTM-based neural networks fail to detect sarcasm and other subtleties in our language. NLP is an extremely challenging subfield of machine learning that researchers are still working on today. In the next chapter, chapter 7, Implementing a Facial Recognition System with Newral Networks, we'll look at Siamese neural networks, and how they can be used to create a face recognition system. Questions 1. What are sequential problems in machine learning? Sequential problems are a class of problem in machine learning in which the order of the features presented to the model is important for making predictions. Examples of sequential problems include NLP problems (for example, speech and text) and time series problems 2, What are some reasons that make it challenging for AI to solve sentiment analysis problems? Human languages often contain words that have different meanings, depending, on the context, It is therefore important for a machine learning model to fully understand the context before making a prediction. Furthermore, sarcasm is, common in human languages, which is difficult for an AI-based model to comprehend. 3. How is an RNN different than a CNN? RNNS can be thought of as multiple, recursive copies of a single neural network. Each layer in an RNN passes its output to the next layer as input. This allows an RNN to use sequential data as input. [222] of Movie Reviews Using LSTM Chapter 6 4, Whatis the hidden state of an RNN? The intermediate output passed from layer to layer in an RNN is known as the hidden state. The hidden state allows an RNN to maintain a memory of the intermediate states from the sequential data. 5, What are the disadvantages of using an RNN for sequential problems? RNNS suffer from the vanishing gradient problem, which results in features early on in the sequence being "forgotten" due to the small weights assigned to them. Therefore, we say that RNNs have a long-term dependency problem. 6, How is an LSTM network different than a conventional RNN? LSTM networks are designed to overcome the long-term dependency problem in conventional RNNs. An LSTM network contains three gates (input, output, and forget gates), which allows it to place emphasis on certain features (that is, words), regardless of when the feature appears in the sequence. What is the disadvantage of one-hot encoding words to transform them to numerical inputs? The dimensionality of a one-hot encoded word vector tends to be huge (due to the amount of different words in a language), which makes it difficult for the neural network to learn from the vector. Furthermore, a one-hot encoded vector docs not take into consideration the relationships between similar words in a language. 8, What are word embeddings? ‘Word embeddings are a learned formed of vector representation for words. The main advantage of word embeddings is that they have smaller dimensions than the one-hot encoded representation, and they place similar words close to one another. Word embeddings are usually the first layer in an LSTM-based neural network. [223] of Movie Reviews Using LSTM Chapter 6 9. What important preprocessing step is required when working with textual data? Textual data often has uneven lengths, which results in vectors of different sizes. Neural networks are unable to accept vectors of different sizes as input. Therefore, we apply zero padding as a preprocessing step, to truncate and pad vectors evenly, 10, Tuning and experimenting is often an essential part of the machine learning process. What experimenting have we done in this project? p,and In this project, we experimented with different optimizers (the 50, 5 adan optimizers) for training our neural network. We found that the optimizer was unable to train the LSTM network, while the 5§5: had the best accuracy. D 1p optimizer [224]

You might also like