0% found this document useful (0 votes)
9 views

AminaRahmanK DL Lab5

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

AminaRahmanK DL Lab5

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Amina Rahman K_DL_Lab5

December 12, 2021

1 SENTIMENT ANALYSIS IN PYTHON 3 USING NLTK 2 1.


INSTALL AND DOWNLOAD THE DATA
[1]: import nltk
nltk.download('twitter_samples')

3 2. TOKENIZING THE DATA


[ ]: from nltk.corpus import twitter_samples

[21]: pos_tweets = twitter_samples.strings('positive_tweets.json') neg_tweets


= twitter_samples.strings('negative_tweets.json')
text = twitter_samples.strings('tweets.20150430-223406.json')

[22]: print(pos_tweets[:10])
print()
print(neg_tweets[:10])

['#FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members


in my community this week :)', '@Lamb2ja Hey James! How odd :/ Please call our Contact
Centre on 02392441234 and we will be able to assist you :) Many thanks!', '@DespiteOfficial
we had a listen last night :) As You Bleed is an amazing track. When are you in Scotland?!',
'@97sides CONGRATS :)', 'yeaaaah yippppy!!! my accnt verified rqst has succeed got a blue
tick mark on my fb profile :) in 15 days', '@BhaktisBanter @PallaviRuhail This one is irresistible
:)\n#FlipkartFashionFriday http://t.co/EbZ0L2VENM', "We don't like to keep our lovely
customers waiting for long! We hope you enjoy! Happy Friday! - LWWF :)
https://t.co/smyYriipxI", '@Impatientraider On second thought, there’s just not enough time for
a DD :) But new shorts entering system. Sheep must be buying.', 'Jgh , but we have to go to
Bayan :D bye', 'As an act of mischievousness, am calling the ETL layer of our in-house
warehousing app Katamari.\n\nWell… as the name implies :p.']

['hopeless for tmr :(', "Everything in the kids section of IKEA is so cute. Shame I'm nearly 19 in
2 months :(", '@Hegelbon That heart sliding into the waste basket. :(', '“@ketchBurning: I hate
Japanese call him "bani" :( :(”\n\nMe

1
too', 'Dang starting next week I have "work" :(', "oh god, my babies' faces :(
https://t.co/9fcwGvaki0", '@RileyMcDonough make me smile :((', '@f0ggstar @stuartthull
work neighbour on motors. Asked why and he said hates the updates on search :(
http://t.co/XvmTUikWln', 'why?:("@tahuodyy: sialan:(
https://t.co/Hv1i0xcrL2"', 'Athabasca glacier was there in #1948 :-( #athabasca #glacier
#jasper #jaspernationalpark #alberta #explorealberta #…
http://t.co/dZZdqmf7Cz']

[4]: nltk.download('punkt')

[26]: pos_tweets = twitter_samples.tokenized('positive_tweets.json') neg_tweets


= twitter_samples.tokenized('negative_tweets.json')

[6]: print(tweet_tokens[0])
print()
print(tweet_tokens[0][0])

['#FollowFriday', '@France_Inte', '@PKuchly57', '@Milipol_Paris', 'for', 'being', 'top',


'engaged', 'members', 'in', 'my', 'community', 'this', 'week', ':)']

#FollowFriday

4 3. NORMALIZING THE DATA


[36]: nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

[8]: from nltk.tag import pos_tag


from nltk.corpus import twitter_samples

[9]: tweet_tokens = twitter_samples.tokenized('positive_tweets.json')


print(pos_tag(tweet_tokens[0]))

[('#FollowFriday', 'JJ'), ('@France_Inte', 'NNP'), ('@PKuchly57', 'NNP'),


('@Milipol_Paris', 'NNP'), ('for', 'IN'), ('being', 'VBG'), ('top', 'JJ'), ('engaged', 'VBN'),
('members', 'NNS'), ('in', 'IN'), ('my', 'PRP$'), ('community', 'NN'), ('this', 'DT'), ('week',
'NN'), (':)', 'NN')]
The pos_tags return a list of tags, here is the list of the most common items and their meaning:
• NNP: Noun, proper, singular
• NN: Noun, common, singular or mass
• IN: Preposition or conjunction, subordinating
• VBG: Verb, gerund or present participle
• VBN: Verb, past participle
In general, if a tag starts with NN, the word is a noun and if it stars with VB, the word is a verb. 2

[10]: from nltk.stem.wordnet import WordNetLemmatizer

def lemmatize_sentence(tokens):
lemmatizer = WordNetLemmatizer()
lemmatized_sentence = []
for word, tag in pos_tag(tokens):
if tag.startswith('NN'):
pos = 'n'
elif tag.startswith('VB'):
pos = 'v'
else:
pos = 'a'
lemmatized_sentence.append(lemmatizer.lemmatize(word, pos))
return lemmatized_sentence

print(lemmatize_sentence(tweet_tokens[0]))

['#FollowFriday', '@France_Inte', '@PKuchly57', '@Milipol_Paris', 'for', 'be', 'top', 'engage',


'member', 'in', 'my', 'community', 'this', 'week', ':)']
This code imports the WordNetLemmatizer class and initializes it to a variable, lemmatizer.
The function lemmatize_sentence first gets the position tag of each token of a tweet. Within the if
statement, if the tag starts with NN, the token is assigned as a noun. Similarly, if the tag starts
with VB, the token is assigned as a verb.

5 4. REMOVING NOISE FROM DATA


Noise is any part of the text that does not add meaning or information to data.
Noise is specific to each project, so what constitutes noise in one project may not be in a
different project. For instance, the most common words in a language are called stop words.
Some examples of stop words are “is”, “the”, and “a”. They are generally irrelevant when
processing language, unless a specific use case warrants their inclusion.
In this project, you will use regular expressions in Python to search for and remove these items:
• Hyperlinks - All hyperlinks in Twitter are converted to the URL shortener t.co. Therefore,
keeping them in the text processing would not add any value to the analysis.
• Twitter handles in replies - These Twitter usernames are preceded by a @ symbol, which
does not convey any meaning.
• Punctuation and special characters - While these often provide context to textual data, this
context is often difficult to process. For simplicity, you will remove all punctuation and
special characters from tweets.
To remove hyperlinks, you need to first search for a substring that matches a URL starting with
http:// or https://, followed by letters, numbers, or special characters. Once a pattern is matched,
the .sub() method replaces it with an empty string.

3
[11]: import re, string
def remove_noise(tweet_tokens, stop_words = ()):

cleaned_tokens = []

for token, tag in pos_tag(tweet_tokens):


token = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*\(\),]|'\
'(?:%[0-9a-fA-F][0-9a-fA-F]))+','', token)
token = re.sub("(@[A-Za-z0-9_]+)","", token)

if tag.startswith("NN"):
pos = 'n'
elif tag.startswith('VB'):
pos = 'v'
else:
pos = 'a'

lemmatizer = WordNetLemmatizer()
token = lemmatizer.lemmatize(token, pos)

if len(token) > 0 and token not in string.punctuation and token.lower()␣ not in


,→

stop_words:
cleaned_tokens.append(token.lower())
return cleaned_tokens

This code creates a remove_noise() function that removes noise and incorporates the
normalization and lemmatization mentioned in the previous section. The code takes two
arguments: the tweet tokens and the tuple of stop words.
The code then uses a loop to remove the noise from the dataset. To remove hyperlinks, the code
first searches for a substring that matches a URL starting with http:// or https://, followed by
letters, numbers, or special characters. Once a pattern is matched, the .sub() method replaces it
with an empty string, or ’ ’.
Similarly, to remove @ mentions, the code substitutes the relevant part of text using regular ex
pressions. The code uses the re library to search @ symbols, followed by numbers, letters, or _,
and replaces them with an empty string.
Finally, we can remove punctuation using the library string.

[12]: from nltk.corpus import stopwords


stop_words = stopwords.words('english')
print(remove_noise(tweet_tokens[0],stop_words))

['#followfriday', 'top', 'engage', 'member', 'community', 'week', ':)']

[27]: positive_cleaned_tokens_list = []
negative_cleaned_tokens_list = []
4
for tokens in pos_tweets:
positive_cleaned_tokens_list.append(remove_noise(tokens, stop_words))

for tokens in neg_tweets:


negative_cleaned_tokens_list.append(remove_noise(tokens, stop_words))

To compare the original tokens to the cleaned tokens for a sample tweet. If you’d like to test this,
add the following code to the file to compare both versions of the 500th tweet in the list:

[28]: print(pos_tweets[500])
print(positive_cleaned_tokens_list[500])

['Dang', 'that', 'is', 'some', 'rad', '@AbzuGame', '#fanart', '!', ':D',


'https://t.co/bI8k8tb9ht']
['dang', 'rad', '#fanart', ':d']

6 5. DETERMINING WORD DENSITY


[29]: def get_all_words(cleaned_tokens_list):
for tokens in cleaned_tokens_list:
for token in tokens:
yield token

all_pos_words = get_all_words(positive_cleaned_tokens_list)

[30]: from nltk import FreqDist

freq_dist_pos = FreqDist(all_pos_words)
print(freq_dist_pos.most_common(10))

[(':)', 3691), (':-)', 701), (':d', 658), ('thanks', 388), ('follow', 357), ('love', 333), ('…', 290),
('good', 283), ('get', 263), ('thank', 253)]
The .most_common() method lists the words which occur most frequently in the data. Save and
close the file after making these changes.

7 6. PREPARING DATA FOR THE MODEL


Sentiment analysis is a process of identifying an attitude of the author on a topic that is being
written about. You will create a training data set to train a model. It is a supervised learning
machine learning process, which requires you to associate each dataset with a “sentiment” for
training. In this project, the model will use the “positive” and “negative” sentiments.
Sentiment analysis can be used to categorize text into a variety of sentiments. For simplicity and
availability of the training dataset, this project trains the model in only two categories, positive
and negative.
5
A model is a description of a system using rules and equations. It may be as simple as an
equation which predicts the weight of a person, given their height. A sentiment analysis model
that you will build would associate tweets with a positive or a negative sentiment. You will need
to split your dataset into two parts. The purpose of the first part is to build the model, whereas the
next part tests the performance of the model.
In the data preparation step, you will prepare the data for sentiment analysis by converting
tokens to the dictionary form and then split the data for training and testing purposes.
Converting Tokens to a Dictionary First, you will prepare the data to be fed into the model. You
will use the Naive Bayes classifier in NLTK to perform the modeling exercise. Notice that the
model requires not just a list of words in a tweet, but a Python dictionary with words as keys and
True as values. The following function makes a generator function to change the format of the
cleaned data.
Add the following code to convert the tweets from a list of cleaned tokens to dictionaries with
keys as the tokens and True as values. The corresponding dictionaries are stored in posi
tive_tokens_for_model and negative_tokens_for_model.

[31]: def get_tweets_for_model(cleaned_tokens_list):


for tweet_tokens in cleaned_tokens_list:
yield dict([token, True] for token in tweet_tokens)

positive_tokens_for_model = get_tweets_for_model(positive_cleaned_tokens_list)
negative_tokens_for_model = get_tweets_for_model(negative_cleaned_tokens_list)

7.0.1 Splitting the Dataset for Training and Testing the Model
Next, prepare the data for training the NaiveBayesClassifier class.

[32]: import random

positive_dataset = [(tweet_dict, "Positive")


for tweet_dict in positive_tokens_for_model]

negative_dataset = [(tweet_dict, "Negative")


for tweet_dict in negative_tokens_for_model]

dataset = positive_dataset + negative_dataset

random.shuffle(dataset)

train_data = dataset[:7000]
test_data = dataset[7000:]

This code attaches a Positive or Negative label to each tweet. It then creates a dataset by joining
the positive and negative tweets.
By default, the data contains all positive tweets followed by all negative tweets in sequence.
When training the model, you should provide a sample of your data that does not contain any
bias. To

6
avoid bias, we’ve added code to randomly arrange the data using the .shuffle() method of random.
Finally, the code splits the shuffled data into a ratio of 70:30 for training and testing, respectively.
Since the number of tweets is 10000, you can use the first 7000 tweets from the shuffled dataset
for training the model and the final 3000 for testing the model.

8 7. BUILDING AND TESTING THE MODEL


[33]: from nltk import classify
from nltk import NaiveBayesClassifier
classifier = NaiveBayesClassifier.train(train_data)

print("Accuracy is:", classify.accuracy(classifier, test_data))

print(classifier.show_most_informative_features(10))

Accuracy is: 0.9953333333333333


Most Informative Features
:( = True Negati : Positi = 2083.8 : 1.0
:) = True Positi : Negati = 1654.4 : 1.0
follower = True Positi : Negati = 21.5 : 1.0
welcome = True Positi : Negati = 20.7 : 1.0
bam = True Positi : Negati = 19.4 : 1.0
sick = True Negati : Positi = 18.6 : 1.0
sad = True Negati : Positi = 16.3 : 1.0
followed = True Negati : Positi = 15.2 : 1.0
arrive = True Positi : Negati = 13.1 : 1.0
community = True Positi : Negati = 12.8 : 1.0
None
Accuracy is defined as the percentage of tweets in the testing dataset for which the model was
correctly able to predict the sentiment. A 99.5% accuracy on the test set is pretty good.
In the table that shows the most informative features, every row in the output shows the ratio of
occurrence of a token in positive and negative tagged tweets in the training dataset. The first row
in the data signifies that in all tweets containing the token :(, the ratio of negative to positives
tweets was 2085.6 to 1. Interestingly, it seems that there was one token with :( in the positive
datasets. You can see that the top two discriminating items in the text are the emoticons. Further,
words such as sad lead to negative sentiments, whereas welcome and glad are associated with
positive sentiments.
[34]: from nltk.tokenize import word_tokenize

custom_tweet = "I ordered just once from TerribleCo, they screwed up, never␣ ,→used the
app again."

custom_tokens = remove_noise(word_tokenize(custom_tweet))

print(classifier.classify(dict([token, True] for token in custom_tokens))) 7

Negative

[35]: custom_tweet = 'Congrats #SportStar on your 7th best goal from last season␣ ,→winning
goal of the year :) #Baller #Topbin #oneofmanyworldies'
custom_tokens = remove_noise(word_tokenize(custom_tweet))

print(classifier.classify(dict([token, True] for token in custom_tokens))) Positive

9 8. CLEANING UP THE CODE


• All imports should be at the top of the file. Imports from the same library should be grouped
together in a single statement.
• All functions should be defined after the imports.
• All the statements in the file should be housed under an if name == “main”: condition. This
ensures that the statements are not executed if you are importing the functions of the file in
another file.
[ ]: from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import twitter_samples, stopwords
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
from nltk import FreqDist, classify, NaiveBayesClassifier

import re, string, random

def remove_noise(tweet_tokens, stop_words = ()):

cleaned_tokens = []

for token, tag in pos_tag(tweet_tokens):


token = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*\(\),]|'\
'(?:%[0-9a-fA-F][0-9a-fA-F]))+','', token)
token = re.sub("(@[A-Za-z0-9_]+)","", token)

if tag.startswith("NN"):
pos = 'n'
elif tag.startswith('VB'):
pos = 'v'
else:
pos = 'a'

lemmatizer = WordNetLemmatizer()
token = lemmatizer.lemmatize(token, pos)

8
if len(token) > 0 and token not in string.punctuation and token.lower()␣ not in
,→

stop_words:
cleaned_tokens.append(token.lower())
return cleaned_tokens

def get_all_words(cleaned_tokens_list):
for tokens in cleaned_tokens_list:
for token in tokens:
yield token

def get_tweets_for_model(cleaned_tokens_list):
for tweet_tokens in cleaned_tokens_list:
yield dict([token, True] for token in tweet_tokens)

if __name__ == "__main__":

positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json') text =
twitter_samples.strings('tweets.20150430-223406.json') tweet_tokens =
twitter_samples.tokenized('positive_tweets.json')[0]

stop_words = stopwords.words('english')

positive_tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
negative_tweet_tokens = twitter_samples.tokenized('negative_tweets.json')

positive_cleaned_tokens_list = []
negative_cleaned_tokens_list = []

for tokens in positive_tweet_tokens:


positive_cleaned_tokens_list.append(remove_noise(tokens, stop_words))

for tokens in negative_tweet_tokens:


negative_cleaned_tokens_list.append(remove_noise(tokens, stop_words))

all_pos_words = get_all_words(positive_cleaned_tokens_list)

freq_dist_pos = FreqDist(all_pos_words)
print(freq_dist_pos.most_common(10))

positive_tokens_for_model =␣
get_tweets_for_model(positive_cleaned_tokens_list)
,→
negative_tokens_for_model =␣
get_tweets_for_model(negative_cleaned_tokens_list)
,→

positive_dataset = [(tweet_dict, "Positive")


for tweet_dict in positive_tokens_for_model]

9
negative_dataset = [(tweet_dict, "Negative")
for tweet_dict in negative_tokens_for_model]

dataset = positive_dataset + negative_dataset

random.shuffle(dataset)

train_data = dataset[:7000]
test_data = dataset[7000:]

classifier = NaiveBayesClassifier.train(train_data)

print("Accuracy is:", classify.accuracy(classifier, test_data))

print(classifier.show_most_informative_features(10))

custom_tweet = "I ordered just once from TerribleCo, they screwed up, never␣ used the
,→

app again."

custom_tokens = remove_noise(word_tokenize(custom_tweet))

print(custom_tweet, classifier.classify(dict([token, True] for token in␣


,→custom_tokens)))

This project introduced us to a basic sentiment analysis model using the nltk library in Python 3.
First, we performed pre-processing on tweets by tokenizing a tweet, normalizing the words, and
removing noise. Next, we visualized frequently occurring items in the data. Finally, we built a
model to associate tweets to a particular sentiment.
A supervised learning model is only as good as its training data. To further strengthen the model,
we could consider adding more categories like excitement and anger. In this tutorial, you have
only scratched the surface by building a rudimentary model.
10

You might also like