AminaRahmanK DL Lab5
AminaRahmanK DL Lab5
[22]: print(pos_tweets[:10])
print()
print(neg_tweets[:10])
['hopeless for tmr :(', "Everything in the kids section of IKEA is so cute. Shame I'm nearly 19 in
2 months :(", '@Hegelbon That heart sliding into the waste basket. :(', '“@ketchBurning: I hate
Japanese call him "bani" :( :(”\n\nMe
1
too', 'Dang starting next week I have "work" :(', "oh god, my babies' faces :(
https://t.co/9fcwGvaki0", '@RileyMcDonough make me smile :((', '@f0ggstar @stuartthull
work neighbour on motors. Asked why and he said hates the updates on search :(
http://t.co/XvmTUikWln', 'why?:("@tahuodyy: sialan:(
https://t.co/Hv1i0xcrL2"', 'Athabasca glacier was there in #1948 :-( #athabasca #glacier
#jasper #jaspernationalpark #alberta #explorealberta #…
http://t.co/dZZdqmf7Cz']
[4]: nltk.download('punkt')
[6]: print(tweet_tokens[0])
print()
print(tweet_tokens[0][0])
#FollowFriday
def lemmatize_sentence(tokens):
lemmatizer = WordNetLemmatizer()
lemmatized_sentence = []
for word, tag in pos_tag(tokens):
if tag.startswith('NN'):
pos = 'n'
elif tag.startswith('VB'):
pos = 'v'
else:
pos = 'a'
lemmatized_sentence.append(lemmatizer.lemmatize(word, pos))
return lemmatized_sentence
print(lemmatize_sentence(tweet_tokens[0]))
3
[11]: import re, string
def remove_noise(tweet_tokens, stop_words = ()):
cleaned_tokens = []
if tag.startswith("NN"):
pos = 'n'
elif tag.startswith('VB'):
pos = 'v'
else:
pos = 'a'
lemmatizer = WordNetLemmatizer()
token = lemmatizer.lemmatize(token, pos)
stop_words:
cleaned_tokens.append(token.lower())
return cleaned_tokens
This code creates a remove_noise() function that removes noise and incorporates the
normalization and lemmatization mentioned in the previous section. The code takes two
arguments: the tweet tokens and the tuple of stop words.
The code then uses a loop to remove the noise from the dataset. To remove hyperlinks, the code
first searches for a substring that matches a URL starting with http:// or https://, followed by
letters, numbers, or special characters. Once a pattern is matched, the .sub() method replaces it
with an empty string, or ’ ’.
Similarly, to remove @ mentions, the code substitutes the relevant part of text using regular ex
pressions. The code uses the re library to search @ symbols, followed by numbers, letters, or _,
and replaces them with an empty string.
Finally, we can remove punctuation using the library string.
[27]: positive_cleaned_tokens_list = []
negative_cleaned_tokens_list = []
4
for tokens in pos_tweets:
positive_cleaned_tokens_list.append(remove_noise(tokens, stop_words))
To compare the original tokens to the cleaned tokens for a sample tweet. If you’d like to test this,
add the following code to the file to compare both versions of the 500th tweet in the list:
[28]: print(pos_tweets[500])
print(positive_cleaned_tokens_list[500])
all_pos_words = get_all_words(positive_cleaned_tokens_list)
freq_dist_pos = FreqDist(all_pos_words)
print(freq_dist_pos.most_common(10))
[(':)', 3691), (':-)', 701), (':d', 658), ('thanks', 388), ('follow', 357), ('love', 333), ('…', 290),
('good', 283), ('get', 263), ('thank', 253)]
The .most_common() method lists the words which occur most frequently in the data. Save and
close the file after making these changes.
positive_tokens_for_model = get_tweets_for_model(positive_cleaned_tokens_list)
negative_tokens_for_model = get_tweets_for_model(negative_cleaned_tokens_list)
7.0.1 Splitting the Dataset for Training and Testing the Model
Next, prepare the data for training the NaiveBayesClassifier class.
random.shuffle(dataset)
train_data = dataset[:7000]
test_data = dataset[7000:]
This code attaches a Positive or Negative label to each tweet. It then creates a dataset by joining
the positive and negative tweets.
By default, the data contains all positive tweets followed by all negative tweets in sequence.
When training the model, you should provide a sample of your data that does not contain any
bias. To
6
avoid bias, we’ve added code to randomly arrange the data using the .shuffle() method of random.
Finally, the code splits the shuffled data into a ratio of 70:30 for training and testing, respectively.
Since the number of tweets is 10000, you can use the first 7000 tweets from the shuffled dataset
for training the model and the final 3000 for testing the model.
print(classifier.show_most_informative_features(10))
custom_tweet = "I ordered just once from TerribleCo, they screwed up, never␣ ,→used the
app again."
custom_tokens = remove_noise(word_tokenize(custom_tweet))
Negative
[35]: custom_tweet = 'Congrats #SportStar on your 7th best goal from last season␣ ,→winning
goal of the year :) #Baller #Topbin #oneofmanyworldies'
custom_tokens = remove_noise(word_tokenize(custom_tweet))
cleaned_tokens = []
if tag.startswith("NN"):
pos = 'n'
elif tag.startswith('VB'):
pos = 'v'
else:
pos = 'a'
lemmatizer = WordNetLemmatizer()
token = lemmatizer.lemmatize(token, pos)
8
if len(token) > 0 and token not in string.punctuation and token.lower()␣ not in
,→
stop_words:
cleaned_tokens.append(token.lower())
return cleaned_tokens
def get_all_words(cleaned_tokens_list):
for tokens in cleaned_tokens_list:
for token in tokens:
yield token
def get_tweets_for_model(cleaned_tokens_list):
for tweet_tokens in cleaned_tokens_list:
yield dict([token, True] for token in tweet_tokens)
if __name__ == "__main__":
positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json') text =
twitter_samples.strings('tweets.20150430-223406.json') tweet_tokens =
twitter_samples.tokenized('positive_tweets.json')[0]
stop_words = stopwords.words('english')
positive_tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
negative_tweet_tokens = twitter_samples.tokenized('negative_tweets.json')
positive_cleaned_tokens_list = []
negative_cleaned_tokens_list = []
all_pos_words = get_all_words(positive_cleaned_tokens_list)
freq_dist_pos = FreqDist(all_pos_words)
print(freq_dist_pos.most_common(10))
positive_tokens_for_model =␣
get_tweets_for_model(positive_cleaned_tokens_list)
,→
negative_tokens_for_model =␣
get_tweets_for_model(negative_cleaned_tokens_list)
,→
9
negative_dataset = [(tweet_dict, "Negative")
for tweet_dict in negative_tokens_for_model]
random.shuffle(dataset)
train_data = dataset[:7000]
test_data = dataset[7000:]
classifier = NaiveBayesClassifier.train(train_data)
print(classifier.show_most_informative_features(10))
custom_tweet = "I ordered just once from TerribleCo, they screwed up, never␣ used the
,→
app again."
custom_tokens = remove_noise(word_tokenize(custom_tweet))
This project introduced us to a basic sentiment analysis model using the nltk library in Python 3.
First, we performed pre-processing on tweets by tokenizing a tweet, normalizing the words, and
removing noise. Next, we visualized frequently occurring items in the data. Finally, we built a
model to associate tweets to a particular sentiment.
A supervised learning model is only as good as its training data. To further strengthen the model,
we could consider adding more categories like excitement and anger. In this tutorial, you have
only scratched the surface by building a rudimentary model.
10