Topic Extraction From Online Reviews For Classification and Recommendation
Topic Extraction From Online Reviews For Classification and Recommendation
net/publication/249643263
CITATIONS READS
47 1,588
4 authors:
Some of the authors of this publication are also working on these related projects:
'Pace-Man', A predictive analytics platform that helps runners prepare for, predict and pace their race. View project
All content following this page was uploaded by Ruihai Dong on 22 May 2014.
Product, P1 Product, Pn
Sentiment Terms
Shallow NLP
CLARITY: Centre for Sensor Web Technologies Bi-Grams Nouns
Abstract build quality and action shots Valid in a positive or negative con-
Opinion
Patterns
text. Our aim is to automatically mine these types of topics
Automatically identifying informative reviews is (Ri,from
Sj, Tk, +/-)
Sentiment
the raw review text and to Sentiment
Assignment automatically
Matching assign senti-
increasingly important given the rapid growth of ment labels to the relevant topics and review elements. We
user generated reviews on sites like Amazon and describe and evaluate how such features can be used to pre-
Sentiment Analysis
TripAdvisor. In this paper, we describe and eval- dict review quality (helpfulness). Further we show how this
uate techniques for identifying and recommending can be used as the basis of a review recommendation system
helpful product reviews using a combination of re- to automatically recommend high quality reviews even in the
view features, including topical and sentiment in- absence of any explicit helpfulness feedback.
formation, mined from a review corpus.
The Fuji X100 is a great camera. It looks beautiful and takes great quality images.
I have found the battery life to be superb during normal use. I only seem to charge
1 Introduction after well over 1000 shots. The build quality is excellent and it is a joy to hold.
The web is awash with user generated reviews, from the con-
templative literary critiques of GoodReads to the flame wars The camera is not without its quirks however and it does take some getting used to.
that can sometimes erupt around hotels on TripAdvisor. User The auto focus can be slow to catch, for example. So it's not so good for action shots
generated reviews are now an important part of how we in- but it does take great portraits and its night shooting is excellent.
form opinions when we make decisions to travel and shop.
The availability of reviews helps shoppers to choose [Hu et
Figure 1: A product review for a digital camera with topics
al., 2008] and increases the likelihood that they will make a
marked as bold, underlined text and sentiment highlighted as
buying decision [Zhu and Zhang, 2010]. But the helpfulness
either a green (positive) or red (negative) background.
of user reviews depends on their quality (detail, objectivity,
readability etc.) As the volume of online reviews grows it
is becoming increasingly important to provide users with the
tools to filter hundreds of opinions about a given product. 2 Related Work
Sorting reviews by helpfulness is one approach but it takes Recent research highlights how online product reviews can
time to accumulate helpfulness feedback and more recent re- influence on the purchasing behaviour of users; see [Hu et
views are often disadvantaged until they have accumulated a al., 2008; Zhu and Zhang, 2010]. The effect of consumer
minimum amount of feedback. One way to address this is reviews on book sales on Amazon.com and Barnesandno-
to develop techniques for automatically assessing the help- ble.com [Chevalier and Dina Mayzlin, 2006] shows that the
fulness of reviews. This has been attempted in the past with relative sales of books on a site correlates closely with pos-
varying degrees of success [O’Mahony et al., 2009] by learn- itive review sentiment; although interestingly, there was in-
ing classifiers using review features based on the readability sufficient evidence to conclude that retailers themselves ben-
of the text, the reputation of the reviewer, the star rating of efit from making product reviews available to consumers; see
the review, and various content features based on the review also the work of [Dhar and Chang, 2009] and [Dellarocas et
terms. al., 2007] for music and movie sales, respectively. But as re-
In this paper we build on this research in a number of ways. view volume has grown retailers need to develop ways to help
We describe a technique to extract interesting topics from re- users find high quality reviews for products of interest and to
views and assign sentiment labels to these topics. Figure 1 avoid malicious or biased reviews. This has led to a body of
provides a simple example based on a review of a camera. In research focused on classifying or predicting review helpful-
this case the reviewer has mentioned certain topics such as ness and also research on detecting so-called spam reviews.
A classical review classification approach, proposed by
∗
This work is supported by Science Foundation Ireland under [Kim et al., 2006], considered features relating to the rat-
grant 07/CE/I1147. ings, structural, syntactic, and semantic properties of reviews
to find ratings and review length among the most discriminat- 3.1 Topic Extraction
ing. Reviewer expertise was found to be a useful predictor of We consider two basic types of topics — bi-grams and single
review helpfulness by [Liu et al., 2008], confirming, in this nouns — which are extracted using a combination of shallow
case, the intuition that people interested in a certain genre of NLP and statistical methods, primarily by combining ideas
movies are likely to pen high quality reviews for similar genre from [Hu and Liu, 2004a] and [Justeson and Katz, 1995]. To
movies. Review timeliness was also found to be important produce a set of bi-gram topics we extract all bi-grams from
since review helpfulness declined as time went by. Further- the global review set which conform to one of two basic part-
more, opinion sentiment has been mined from user reviews to of-speech co-location patterns: (1) an adjective followed by a
predict ratings and helpfulness in services such as TripAdvi- noun (AN ) such as wide angle; and (2) a noun followed by a
sor by the likes of [Baccianella et al., 2009; Hsu et al., 2009; noun (N N ) such as video mode. These are candidate topics
O’Mahony et al., 2009; O’Mahony and Smyth, 2009]. that need to be filtered to avoid including AN ’s that are ac-
Just as it is useful to automate the filtering of helpful re- tually opinionated single-noun topics; for example, excellent
views it is also important to weed out malicious or biased lens is a single-noun topic (lens) and not a bi-gram topic. To
reviews. These reviews can be well written and informative do this we exclude bi-grams whose adjective is found to be a
and so appear to be helpful. However these reviews often sentiment word (e.g. excellent, good, great, lovely, terrible,
adopt a biased perspective that is designed to help or hinder horrible etc.) using the sentiment lexicon proposed in [Hu
sales of the target product [Lim et al., 2010]. [Li et al., 2011] and Liu, 2004b].
describe a machine learning approach to spam detection that To identify the single-noun topics we extract a candidate
is enhanced by information about the spammer’s identify as set of (non stop-word) nouns from the global review set. Of-
part of a two-tier co-learning approach. On a related topic, ten these single-noun candidates will not make for good top-
[O’Callaghan et al., 2012] use network analysis techniques ics; for example, they might include words such as family or
to identify recurring spam in user generated comments as- day or vacation. [Qiu et al., 2009] proposed a solution for
sociated with YouTube videos by identifying discriminating validating such topics by eliminating those that are rarely as-
comment motifs that are indicative of spambots. sociated with opinionated words. The intuition is that nouns
In this paper we extend the related work in this area by that frequently occur in reviews and that are frequently asso-
considering novel review classification features. We describe ciated with sentiment rich, opinion laden words are likely to
techniques for mining topical and sentiment features from be product topics that the reviewer is writing about, and there-
user generated reviews and demonstrate their ability to boost fore good topics. Thus, for each candidate single-noun, we
classification accuracy. calculate how frequently it appears with nearby words from
a list of sentiment words (again, as above, we use Hu and
3 Topic Extraction and Sentiment Analysis Liu’s sentiment lexicon), keeping the single-noun only if this
frequency is greater than some threshold (in this case 70%).
For the purpose of this work our focus is on mining topics The result is a set of bi-gram and single-noun topics which
from user-generated product reviews and assigning sentiment we further filter based on their frequency of occurrence in the
to these topics on a per review basis. Before we describe how review set, keeping only those topics (T1 , . . . , Tm ) that occur
this topical and sentiment information can be used as novel in at least k reviews out of the total number of n reviews; in
classification features, we will outline how we automatically this case, for bi-gram topics we set kbg = n/20 and for single
extract topics and assign sentiment as per Figure 2. noun topics we set ksn = 10 × kbg .
Reviews Topic Extraction 3.2 Sentiment Analysis
(Part of Speech Tagging)
Product, P1 Product, Pn Sentiment Terms To determine the sentiment of the topics in the product topic
set we use a method similar to the opinion pattern mining
Shallow NLP
+++ ---
Bi-Grams Nouns
technique proposed by [Moghaddam and Ester, 2010] for ex-
tracting opinions from unstructured product reviews. Once
Thresholding & Ranking again we use the sentiment lexicon from [Hu and Liu, 2004b]
as the basis for this analysis. For a given topic Ti , and cor-
T1, T2, …
responding review sentence Sj from review Rk (that is the
sentence in Rk that includes Ti ), we determine whether there
Opinion Patterns Opinion Pattern are any sentiment words in Sj . If there are not then this topic
JJ-TOPIC,... Mining
is marked as neutral, from a sentiment perspective. If there
Valid Opinion
Patterns
are sentiment words (w1 , w2 , ...) then we identify that word
(wmin ) which has the minimum word-distance to Ti .
Sentiment Sentiment
(Ri, Sj, Tk, + / - / =) Assignment Matching Next we determine the part-of-speech tags for wmin , Ti
and any words that occur between wmin and Ti . The POS
Sentiment Analysis
sequence corresponds to an opinion pattern. For example, in
the case of the bi-gram topic noise reduction and the review
Figure 2: System architecture for extracting topics and as-
sentence “...this camera has great noise reduction...”, wmin
signing sentiment for user generated reviews.
is the word “great” which corresponds to the opinion pattern
JJ-TOPIC as per [Moghaddam and Ester, 2010].
The Fuji X100 is a great camera. It looks beautiful and takes great quality images.
I have found the battery life to be superb during normal use. I only seem to charge
after well over 1000 shots. The build quality is excellent and it is a joy to hold.
The camera is not without its quirks however and it does take some getting used to.
The auto focus can be slow to catch, for example. So it's not so good for action shots
but it does take great portraits and its night shooting is excellent.
Once an entire pass of all topics has been completed we can by simply counting the number of topics contained within the
compute the frequency of all opinion patterns that have been review and the average word count associated with the cor-
recorded. A pattern is deemed to be valid (from the perspec- responding review sentences, as in Equation 2. Similarly we
tive of our ability to assign sentiment) if it occurs more than can aggregate the popularity of review topics, relative to the
the average number of occurrences over all patterns [Moghad- topics across the product as a whole as in Equation 3 (with
dam and Ester, 2010]. For valid patterns we assign senti- rank(Ti ) as a topic’s popularity rank for the product and
ment based on the sentiment of wmin and subject to whether U niqueT opics(Rk ) as the set of unique topics in a review);
Sj contains any negation terms within a 4-word-distance of so if a review covers many popular topics then it receives a
wmin . If there are no such negation terms then the sentiment higher score than if it covers fewer rare topics.
assigned to Ti in Sj is that of the sentiment word in the senti-
ment lexicon. If there is a negation word then this sentiment Breadth(Rk ) = |topics(Rk )| (1)
is reversed. If an opinion pattern is deemed not to be valid
(based on its frequency) then we assign a neutral sentiment P
∀Ti topics(Rk ) len(sentence(Rk , Ti ))
to each of its occurrences within the review set. Depth(Rk ) =
Breadth(Rk )
(2)
4 Classifying Helpful Reviews
In the previous section we described our approach for auto- X 1
matically mining topics (T1 , ..., Tm ) from review texts and as- T opicRank(Rk ) = (3)
rank(Ti )
signing sentiment values to them. Now we can associate each ∀Ti U niqueT opics(Rk )
review Ri with sentiment tuples, (Ri , Sj , Tk , +/ − / =), cor- When it comes to sentiment we can formulate a variety of
responding to a sentence Sj containing topic Tk with a senti- classification features from the number of positive (NumPos
ment value positive (+), negative (-), or neutral (=). and NumUPos), negative (NumNeg and NumUNeg) and neu-
To build a classifier for predicting review helpfulness we tral (NumNeutral and NumUNeutral) topics (total and unique)
adopt a supervised machine learning approach. In the data in a review, to the rank-weighted number of positive (WPos),
that is available to us each review has a helpfulness score that negative (WNeg), and neutral (WNeutral) topics, to the rel-
reflects the percentage of positive votes that it has received, if ative sentiment, positive (RelUPos), negative (RelUNeg), or
any. In this work we label a review as helpful if and only if neutral (RelUNeutral), of a review’s topics; see Table 1.
it has a helpfulness score in excess of 0.75. All other reviews We also include a measure of the relative density of opin-
are labeled as unhelpful; thus we adopt a similar approach to ionated (non-neutral sentiment) topics in a review (see Equa-
that described by [O’Mahony and Smyth, 2009]. tion 4) and a relative measure of the difference between the
To represent review instances we rely on a standard overall review sentiment and the user’s normalized product
feature-based encoding using a set of 7 different types of fea- rating, i.e. SignedRatingDif f (Rk ) = RelU P os(Rk ) −
tures including temporal information (AGE), rating informa- N ormU serRating(Rk ); we also compute an unsigned ver-
tion (RAT ), simple sentence and word counts (SIZE), top- sion of this metric. The intuition behind the rating difference
ical coverage (T OP ), sentiment information (SEN T ), read- metrics is to note whether the user’s overall rating is similar
ability metrics (READ), and content made up of the top to or different from the positivity of their review content. Fi-
50 most popular topics extracted from the reviews (CN T ). nally, as shown in Table 1 each review instance also encodes
These different types, and the corresponding individual fea- a vector of the top 50 most popular review topics (CNT), in-
tures are summarised in Table 1. Some of these features, dicating whether it is present in the review or not.
such as rating, word and sentence length, date and readabil-
ity have been considered in previous work [Kim et al., 2006;
|pos(topics(Rk ))| + |neg(topics(Rk ))|
Liu et al., 2008; O’Mahony and Smyth, 2010] and reflect best Density(Rk ) =
practice in the field of review classification. But the topical |topics(Rk )|
and sentiment features (explained in detail below) are novel, (4)
and in this paper our comparison of the performance of the 4.2 Expanding Basic Features
different feature sets is intended to demonstrate the efficacy
of our new features (in isolation and combination) in com- Each of the basic features in Table 1 is calculated for a given
parison to classical benchmarks across a common dataset and single review. For example, we may calculate the breath of
experimental setup. review Ri to be 5, indicating that it covers 5 identified topics.
Is this a high or low value for the product in question, which
4.1 From Topics and Sentiment to Classification may have tens or even hundreds of reviews written about it?
Features For this reason, in addition to this basic feature value, we
include 4 other variations as follows to reflect the distribution
For each review Rk , we assign a collection of topics of its values across a particular product:
(topics(Rk ) = T1 , T2 , ..., Tm ) and corresponding sentiment
scores (pos/neg/neutral) which can be considered in isola- • The mean value for this feature across the set of reviews
tion and/or in aggregate as the basis for classification fea- for the target product.
tures. For example, we can encode information about a re- • The standard deviation of the values for this feature
view’s breadth (see Equation 1) and depth of topic coverage across the target product reviews.
Table 1: Classification Feature Sets.
• The normalised value for the feature based on the num- unique products. We focused on 4 product categories — Dig-
ber of standard deviations above (+) or below (-) the ital Cameras (DC), GPS Devices, Laptops, Tablets — and
mean. labeled them as helpful or unhelpful, depending on whether
• The rank of the feature value, based on a descending their helpfulness score was above 0.75 or not, as described
ordering of the feature values for the target product. in Section 4. For the purpose of this experiment, all reviews
included at least 5 helpfulness scores (to provide a reliable
Accordingly most of the features outlined in Table 1 trans- ground-truth) and the helpful and unhelpful sets were sam-
late into 5 different actual features (the original plus the above pled so as to contain approximately the same number of re-
4 variations) for use during classification. This is the case for views. Table 2 presents a summary of these data, per product
every feature (30 in all) in Table 1 except for the content fea- type, including the average helpfulness scores across all re-
tures (CN T ). Thus each review instance is represented as a views, and separately for helpful and unhelpful reviews.
total set of 200 features ((30 × 5) + 50).
Category #Reviews #Prod. Avg. Helpfulness
5 Evaluation Help. Unhelp. All
We have described techniques for extracting topical features DC 3180 113 0.93 0.40 0.66
GPS Devices 2058 151 0.93 0.46 0.69
from product reviews and an approach for assigning senti-
Laptops 4172 592 0.93 0.40 0.67
ment to review sentences that cover these topics. Our hypoth- Tablets 6652 241 0.92 0.39 0.65
esis is that these topical and sentiment features will help when
it comes to the automatic classification of user generated re-
views, into helpful and unhelpful categories, by improving Table 2: Filtered and Balanced Datasets.
classification performance above and beyond more traditional
features (e.g. terms, ratings, readability); see [Kim et al., Each review was processed to extract the classification fea-
2006; O’Mahony et al., 2009]. In this section we test this tures described in Section 4. Here we are particularly in-
hypothesis on real-world review data for a variety of product terested in understanding the classification performance of
categories using a number of different classifiers. different categories of features. In this case we consider 8
different categories, AGE, RAT, SIZE, TOP, SENT-1, SENT-
5.1 Datasets & Setup 2, READ, CNT. Note, we have split the sentiment features
The review data for this experiment was extracted from Ama- (SEN T ) into two groups SENT-1 and SENT-2. The latter
zon.com during October 2012; 51,837 reviews from 1,384 contains all of the sentiment features from Table 1 whereas
the former excludes the ratings difference features (signed and sentiment features contributes to an uplift in classification
and unsigned) so that we can better gauge the influence of rat- performance, particularly with respect to more conventional
ing information (usually a powerful classification feature in features that have been traditionally used for review classifi-
its own right) within the sentiment feature-set. Accordingly cation. In Figure 3(d) we present summary classification re-
we prepared corresponding datasets for each category (Dig- sults according to product category when we build classifiers
ital Cameras, GPS Devices, Laptops and Tablets) in which using the combination of all types of features. Once again
the reviews were represented by a single set of features; for we can see strong classification performance. We achieve an
example, the SENT-1 dataset consists of reviews (one set of AUC of more than 0.7 for all conditions and the RF classifier
reviews for each product category) represented according to delivers an AUC close to 0.8 or beyond for all categories.
the SENT-1 features only.
For the purpose of this evaluation we used three commonly
used classifiers: RF (Random Forest), JRip and NB (Naive
6 Recommending Helpful Reviews
Bayes), see [Witten and Frank, 2005]. In each case we evalu- In many situations users are faced with a classical informa-
ated classification performance, in terms of the area under the tion overload problem: sifting through potentially hundreds
ROC curve (AUC) using a 10-fold cross validation. or even thousands of product opinions. Sites like Amazon
collect review helpfulness feedback so that they can rank re-
5.2 Results views by their average helpfulness scores but this is far from
The results are presented in Figures 3(a-d). In Figures 3(a-c) perfect. Many reviews (often a majority) have received very
we show the AUC performance for each classification algo- few or no helpfulness scores. This is especially true for more
rithm (RF, JRip, NB) separately; each graph plots the AUC recent reviews, which arguably may be more reliable in the
of one algorithm for the 8 different categories of classifica- case of certain product categories (e.g. hotel rooms). More-
tion features for each of the four different product categories over, if reviews are sorted by helpfulness then it is unlikely
(DC, GPS, Laptop, and Tablet). Figure 3(d) provides a direct that users will get to see those yet to be rated making it even
comparison of all classification algorithms (RF, JRip, NB); less likely that they will attract ratings. It quickly becomes a
here we use a classifier using all features combined. AUC case of “the rich get richer” for those early rated helpful re-
values in excess of 0.7 can be considered as useful from a views. This is one of the strong motivations behind our own
classification performance viewpoint [Streiner and Cairney, work on review classification, but can our classifier be used
2007]. Overall we can see that RF tends to produce better to recommend helpful reviews to the end user?
classification performance across the various feature groups Amazon currently adopts a simple approach to review rec-
and product categories. Classification performance tends to ommendation, by suggesting the most helpful positive and
be poorer for the GPS dataset compared to Laptop, Tablet, most helpful critical review from a review collection. To eval-
and DC. uate the ability of our classifier to make review recommenda-
We know from previous research that ratings information tions we can use the classification confidence as one simple
proves to be particularly useful when it comes to evaluating way to rank-order helpful reviews and select the top-ranked
review helpfulness [Kim et al., 2006]. It is perhaps no sur- review for recommendation to the user. In this experiment we
prise therefore to see that our ratings-based features perform select the single most confident helpful review for each indi-
well, often achieving an AUC > 0.7 on their own; for ex- vidual product across the four different product categories;
ample in Figure 3(a) we see an AUC of approximately 0.75 we refer to this strategy as Pred. Remember we are making
for the Laptop and Tablet datasets, compared to between 0.65 this recommendation without the presence of actual helpful-
and 0.69 for GPS and DC, respectively. Other ‘traditional’ ness scores and rely only on our ability to predict whether
feature groups (AGE, SIZE, READ, and CNT) rarely manage a review will be helpful. In this experiment we use an RF
to achieve AUC scores > 0.7 across the product categories. classifier using all features. As a baseline recommendation
We can see strong performance from the new topic and strategy we also select a review at random; we call this strat-
sentiment feature-sets proposed in this work. The SENT-2 egy Rand.
features consistently and significantly outperform all others, We can test the performance of these recommendation
with AUC scores in excess of 0.7 for all three algorithms and techniques in two ways. First, because we know the actual
across all four product categories; indeed in some cases the helpfulness scores of all reviews (the ground-truth) we can
SENT-2 features deliver AUC greater than 0.8 for DC, Lap- compare the recommended review to the review which has
top and Tablet products; see Figure 3(a)). The SENT-2 feature the actual highest helpfulness score for each product, and av-
group benefits from a combination of sentiment and ratings erage across all products in a given product class. Thus the
based features but a similar observation can be made for the two line graphs in Figure 4 plot the actual helpfulness of the
sentiment-only features of SENT-1, which also achieve AUC recommended reviews (for Pred and Rand) as a percentage
greater than 0.7 for almost classification algorithms and prod- of the actual helpfulness of the most helpful review for each
uct categories. Likewise, the topical features (TOP) also de- product; we call this the helpfulness ratio (HR). We can see
liver a strong performance with AU C > 0.7 for all product that Pred significantly outperforms Rand delivering a help-
categories except for GPS. fulness ratio of 0.9 and higher compared to approximately 0.7
These results bode well for a practical approach to review for Rand. This means that Pred is capable of recommending
helpfulness prediction/classification, with or without ratings a review that has an average helpfulness score that is 90% that
data. The additional information contained within the topical of the actual most helpful review.
(a) RF (b) JRip
Figure 3: Classification performance results: (a-c) for RF, JRip and NB classifiers and different feature groups; (d) comparison
of RF, JRip and NB for all features.