0% found this document useful (0 votes)

4 views

(Description) Sentiment Analysis

Sentiment Analysis

Uploaded by

thaihaidang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

(Description) Sentiment Analysis

Sentiment Analysis

Uploaded by

thaihaidang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

AI VIET NAM – COURSE 2024

Sentiment Analysis - Project

Ngày 13 tháng 9 năm 2024

Phần I: Giới thiệu

Trong bài tập này, chúng ta sẽ thực hành các nội dung về phân tích cảm xúc của khách hàng với các
bình luận đánh giá phim dựa vào các phương pháp tiếp cận cho bài toán phân loại văn bản điển hình.
Bài toán phân loại được lấy ví dụ như hình dưới. Với mỗi đơn vị văn bản sẽ được phân loại với một
nhãn cụ thể thuộc tập hợp các nhãn cho trước.

Hình 1: Text Classification.

Với mỗi đơn vị văn bản có thể thuộc vào tập các nhãn như: ‘Technology’, ‘Sports’, ‘Business’,...

1
AI VIETNAM aivietnam.edu.vn

Sentiment Analysis là nhóm các bài toán con thuộc vào phân loại văn bản. Với mục tiêu phân
tích và đánh giá các bình luận của khách hàng cho các sản phẩm và tích cực, tiêu cực hay trung tính.

Hình 2: Sentiment Analysis.

Phần project này sẽ tập trung giải quyết cho bài phân tích cảm xúc trên bộ dữ liệu IMDB - Đánh
giá phim.

2
AI VIETNAM aivietnam.edu.vn

Phần II: Bài tập

A. Phần lập trình
Trong phần này, chúng ta sẽ cài đặt và huấn luyện hai mô hình Decision Tree và Random Forest
để giải quyết bài toán phân tích cảm xúc gồm các bước như hình sau:

Hình 3: Các bước huấn luyện mô hình phân loại.

1. Tải bộ dữ liệu: Các bạn tải bộ dữ liệu IMDB-Dataset.csv tại đây.

2. Đọc bộ dữ liệu: Sử dụng thư viện pandas, chúng ta sẽ đọc file .csv lên như sau:
1 # Load dataset
2 import pandas as pd
3
4 df = pd . read_csv ( ’ ./ IMDB - Dataset . csv ’)
5
6 # Remove duplicate rows
7 df = df . drop_duplicates ()

Ở đây chúng ta sẽ thực hiện làm sạch dữ liệu thông qua các bước như: xoá thẻ html, xoá dấu câu,
xoá số, xoá các icon,...
1 import re
2 import string
3 import nltk
4 nltk . download ( ’ stopwords ’)
5 nltk . download ( ’ wordnet ’)
6 from nltk . corpus import stopwords
7 from nltk . stem import WordNetLemmatizer
8 from bs4 import BeautifulSoup
9 import contractions
10
11 stop = set ( stopwords . words ( ’ english ’) )
12
13 # Expanding contractions
14 def expand_contractions ( text ) :
15 return contractions . fix ( text )
16
17 # Function to clean data
18 def preprocess_text ( text ) :
19
20 wl = WordNetLemmatizer ()
21
22 soup = BeautifulSoup ( text , " html . parser " ) # Removing html tags

3
AI VIETNAM aivietnam.edu.vn

23 text = soup . get_text ()

24 text = expand_contractions ( text ) # Expanding chatwords and contracts clearing
contractions
25 emoji_clean = re . compile ( " [ "
26 u " \ U0001F600 -\ U0001F64F " # emoticons
27 u " \ U0001F300 -\ U0001F5FF " # symbols & pictographs
28 u " \ U0001F680 -\ U0001F6FF " # transport & map symbols
29 u " \ U0001F1E0 -\ U0001F1FF " # flags ( iOS )
30 u " \ U00002702 -\ U000027B0 "
31 u " \ U000024C2 -\ U0001F251 "
32 " ]+ " , flags = re . UNICODE )
33 text = emoji_clean . sub ( r ’ ’ , text )
34 text = re . sub ( r ’ \.(?=\ S ) ’ , ’. ’ , text ) # add space after full stop
35 text = re . sub ( r ’ http \ S + ’ , ’ ’ , text ) # remove urls
36 text = " " . join ([
37 word . lower () for word in text if word not in string . punctuation
38 ]) # remove punctuation and make text lowercase
39 text = " " . join ([
40 wl . lemmatize ( word ) for word in text . split () if word not in stop and word .
isalpha () ]) # lemmatize
41 return text
42
43 df [ ’ review ’] = df [ ’ review ’ ]. apply ( preprocess_text )

3. Phân tích dữ liệu: Thống kê số lượng các nhãn trong bộ dữ liệu:

1 import numpy as np
2 import seaborn as sns
3 import matplotlib . pyplot as plt
4
5 # Creating autocpt arguments
6 def func ( pct , allvalues ) :
7 absolute = int ( pct / 100.* np . sum ( allvalues ) )
8 return " {:.1 f }%\ n ({: d }) " . format ( pct , absolute )
9
10 freq_pos = len ( df [ df [ ’ sentiment ’] == ’ positive ’ ])
11 freq_neg = len ( df [ df [ ’ sentiment ’] == ’ negative ’ ])
12
13 data = [ freq_pos , freq_neg ]
14
15 labels = [ ’ positive ’ , ’ negative ’]
16 # Create pie chart
17 pie , ax = plt . subplots ( figsize =[11 ,7])
18 plt . pie ( x = data , autopct = lambda pct : func ( pct , data ) , explode =[0.0025]*2 ,
pctdistance =0.5 , colors =[ sns . color_palette () [0] , ’ tab : red ’] , textprops ={ ’
fontsize ’: 16})
19 # plt . title ( ’ Frequencies of sentiment labels ’, fontsize =14 , fontweight = ’ bold ’)
20 labels = [ r ’ Positive ’ , r ’ Negative ’]
21 plt . legend ( labels , loc = " best " , prop ={ ’ size ’: 14})
22 pie . savefig ( " PieChart . png " )
23 plt . show ()

4
AI VIETNAM aivietnam.edu.vn

Kết quả thu được:

Hình 4: Số lượng các nhãn trong bộ dữ liệu IMDB.

Thống kê độ dài của các mẫu cho mỗi class.

1 words_len = df [ ’ review ’ ]. str . split () . map ( lambda x : len ( x ) )
2 df_temp = df . copy ()
3 df_temp [ ’ words length ’] = words_len
4
5 hist_positive = sns . displot (
6 data = df_temp [ df_temp [ ’ sentiment ’] == ’ positive ’] ,
7 x = " words length " , hue = " sentiment " , kde = True , height =7 , aspect =1.1 , legend =
False
8 ) . set ( title = ’ Words in positive reviews ’)
9 plt . show ( hist_positive )
10
11 hist_negative = sns . displot (
12 data = df_temp [ df_temp [ ’ sentiment ’] == ’ negative ’] ,
13 x = " words length " , hue = " sentiment " , kde = True , height =7 , aspect =1.1 , legend =
False , palette =[ ’ red ’]
14 ) . set ( title = ’ Words in negative reviews ’)
15 plt . show ( hist_negative )
16
17 plt . figure ( figsize =(7 ,7.1) )
18 k e r n e l _ d i s t i b u t i o n _ n u m b e r _ w o r d s _ p l o t = sns . kdeplot (
19 data = df_temp , x = " words length " , hue = " sentiment " , fill = True , palette =[ sns .
color_palette () [0] , ’ red ’]
20 ) . set ( title = ’ Words in reviews ’)
21 plt . legend ( title = ’ Sentiment ’ , labels =[ ’ negative ’ , ’ positive ’ ])
22 plt . show ( k e r n e l _ d i s t i b u t i o n _ n u m b e r _ w o r d s _ p l o t )

5
AI VIETNAM aivietnam.edu.vn

Kết quả thu được:

Hình 5: Số lượng các nhãn trong bộ dữ liệu IMDB.

4. Chia tập train và test:

1 from sklearn . model_selection import train_test_split
2 from sklearn . feature_extraction . text import TfidfVectorizer
3 from sklearn . preprocessing import LabelEncoder
4
5 label_encode = LabelEncoder ()
6 y_data = label_encode . fit_transform ( df [ ’ sentiment ’ ])
7
8 x_train , x_test , y_train , y_test = train_test_split (
9 x_data , y_data , test_size =0.2 , random_state =42
10 )

5. Biểu diễn văn bản thành vector:

1 tfidf_vectorizer = TfidfVectorizer ( max_features =10000)
2 tfidf_vectorizer . fit ( x_train , y_train )
3
4 x_train_encoded = tfidf_vectorizer . transform ( x_train )
5 x_test_encoded = tfidf_vectorizer . transform ( x_test )

6. Huấn luyện và đánh giá mô hình:

Ta thực hiện huấn luyện mô hình với bộ dữ liệu train. Để huấn luyện mô hình Decision Tree, các
bạn sẽ sử dụng DecisionTreeClassifier():
1 from sklearn . tree import De cis io nTr ee Cla ssi fi er
2 from sklearn . ensemble import Ran do mFo re stC la ssi fie r
3 from sklearn . metrics import accuracy_score
4
5 dt_classifier = De cis io nTr ee Cla ss ifi er (
6 criterion = ’ entropy ’ ,
7 random_state =42
8 )
9 dt_classifier . fit ( x_train_encoded , y_train )
10 y_pred = dt_classifier . predict ( x_test_encoded )
11 accuracy_score ( y_pred , y_test )

6
AI VIETNAM aivietnam.edu.vn

Để huấn luyện mô hình Random Forest, các bạn sẽ sử dụng RandomForestClassifier():

1 rf_classifier = Ra ndo mF ore st Cla ss ifi er (
2 random_state =42
3 )
4 rf_classifier . fit ( x_train_encoded , y_train )
5 y_pred = rf_classifier . predict ( x_test_encoded )
6 accuracy_score ( y_pred , y_test )

7
AI VIETNAM aivietnam.edu.vn