Paper-6 Data Mining and Natural Language Processing Methods For Extracting Opinions From Customer Reviews
Paper-6 Data Mining and Natural Language Processing Methods For Extracting Opinions From Customer Reviews
6 ISSN: 1837-7823
Data Mining and Natural Language Processing Methods for Extracting Opinions from Customer Reviews
Shaikh Abdul Hannan1, Shaikh Jameel Ahmed2, Quadri Naveed3, Rizwan Alam Thakur4
1Vivekanand College, Aurangabad, India ([email protected]) 2Lecturer, King Khalid University, Abha, Saudi Arabia. ([email protected]) 3Lecturer, King Khalid University, Abha, Saudi Arabia. ([email protected]) 4Lecturer, King Khalid University, Abha, Saudi Arabia. ([email protected])
Abstract
Automatic opinion recognition involves a number of related tasks, such as identifying the boundaries of opinion expression, determining their polarity, and determining their intensity. Although much progress has been made in this area, existing research typically treats each of the above tasks in isolation. With the increasing amount of opinions there is a need of automatic extraction of information to support summaries of customer reviews. What other people think has always been an important for most of us during the decision making process. There are number of web sites and web forums which provide us lots of opinions and experiences of people from all over the world. But it is time consuming to read all the opinions from the web sites and it is confusing also. Opinion Mining is used to analyze all the opinions of the people from different web sites and to provide a summary, so that we can make our decision without any confusion and without wasting much time. The aim of proposed system is to generate feature-based summaries of customer reviews of product sold online.
Introduction
In recent years, there has been a great deal of interest in methods for automatically identifying opinions, emotions, and sentiments in text. The Web has dramatically changed the way that people express themselves and interact with others. They can now post reviews of products at merchant sites and express their views and interact with others via blogs and forums. Such content contributed by Web users is collectively called the user-generated content (as opposed to the content provided by Web site owners). It is now well recognized that the user generated content contains valuable information that can be exploited for many applications. In this paper, we focus on customer reviews of products. Reviews contain rich user opinions on products and services. They are used by potential customers to find opinions of existing users before deciding to purchase a product. They are also used by product manufacturers to identify product problems and/or to find marketing intelligence information about their competitors [1] . With the rapid expansion of e-commerce, more and more products are sold on the Web, and more and more people are also buying products online. In order to enhance customer satisfaction and shopping experience, it has become a common practice for online merchants to enable their customers to review or to express opinions on the products that they have purchased. With more and more common users becoming comfortable with the Web, an increasing number of people are writing reviews. As a result, the number of reviews that a product receives grows rapidly. There is a wide range of technologies and focus areas in Human Language Technology (HLT). These include areas such as Natural Language Processing (NLP), Speech Recognition, Machine Translation, Text Generation and Text Mining. NLP has been around for a number of decades. It has developed various techniques that are typically linguistically inspired, i.e. text is typically syntactically parsed using information from a formal grammar and a lexicon, the resulting information is then interpreted semantically and used to extract information about what was said. A Natural Language (NL) is any of the language naturally used by humans, i.e. not an artificial or man-made language such as a programming language. Natural Language Processing (NLP) is a convenient description for all attempts to use computers to process natural language. NLP is often used ina way which excluded speech and then SNLP is used as the term to include both speech and other aspects of natural language processing. NLP included speech synthesis, Speech Recognition, Natural Language, Understanding (NLU), Natural Language Generation and Machine Translation (MT) [36]. 52
International Journal of Computational Intelligence and Information Security, July 2012 Vol. 3, No. 6 ISSN: 1837-7823 Opinion Mining is recent discipline of Information Retrieval and of Computational Linguistics which is concerned not with the topic a document is about, but with the opinion it expresses. Opinion Mining gathers and combines many concepts, ideas and methods of two disciplines-Information Retrieval and computational linguistics.
Historical background
Textual information in the world can be broadly classified into two main categories, facts and opinions. Facts are objective statements about entities and events in the world. Opinions are subjective statements that reflect peoples sentiments or perceptions about the entities and events. Much of the existing research on text information processing has been (almost exclusively) focused on mining and retrieval of factual information, e.g., information retrieval, Web search, and many other text mining and natural language processing tasks. Little work has been done on the processing of opinions until only recently. Yet, opinions are so important that whenever one needs to make a decision one wants to hear others opinions. This is not only true for individuals but also true for organizations. The Lack of study on opinions is that there was opinionated text before the World Wide Web. Before the web, for an individuals to make a decision much more efforts are required for a single product where as for an organization needs to find opinions from hundred an thousand person and that too from different community, but here only it not end after collection of information analysis an decision making on the collected information. With help of World Wide Web there is massive explosive information is present it depends upon on the user to extract information what he/she required for a particular product. The information which is present on the web can be editable by individual these collective called as User Generated content. Finding the Opinion sources and monitoring them on the web, still its huge task for an organization to maintain the update frequently for each and every product. For a Single product lot of source are available on the web now its totally depends on the user to extract the exact information. To solve the entire problem an automated system is need which can help for better opinion mining. Opinion mining also refers as sentiment analysis. [2]
Literature review
In the past few years, there was a growing interest in mining opinions in reviews from both academia and industry. However, the existing work has been mainly focused on extracting and summarizing opinions from reviews using natural language processing and data mining techniques [3,4,5,6]. Grouping feature expressions, which are domain synonyms, is critical for effective opinion summary [7]. Since there are typically hundreds of feature expressions that can be discovered from text for an opinion mining application, its very time consuming and tedious for human users to group them into feature categories. Some automated assistance is needed. Unsupervised learning or clustering is the natural technique for solving the problem. The similarity 53
International Journal of Computational Intelligence and Information Security, July 2012 Vol. 3, No. 6 ISSN: 1837-7823 measures used in clustering are usually based on some form of distributional similarity [8,9,10,11,12,13,14,15]. Recent work also used topic modeling [16,17]. Opinion mining aims to find peoples opinions/sentiments about topics and aspects/features of the topics [18]. Much of the current research has been focused on extracting opinions from product reviews([19]. A key characteristic of reviews is that each review is dedicated to the evaluation of a specific product. There is little interaction among reviewers or irrelevant content. However, this is not the case for online discussions or comments. Research on opinion mining started with identifying opinion (or sentiment) bearing words, e.g., great, amazing, wonderful, bad, and poor. Many researchers have worked on mining such words and identifying their semantic orientations (i.e., positive or negative). In [23], the authors identified several linguistic rules that can be exploited to identify opinion words and their orientations from a large corpus. This method has been applied, extended and improved in [21, 26, 30]. In [24, 27], a bootstrapping approach is proposed, which uses a small set of given seed opinion words to find their synonyms and antonyms in WordNet (http://wordnet.princeton.edu/). The next major development is sentiment classification of product reviews at the document level [20, 29, 31]. The objective of this task is to classify each review document as expressing a positive or a negative sentiment about an object (e.g., a movie, a camera, or a car). Several researchers also studied sentence-level sentiment classification [27, 32, 33], i.e., classifying each sentence as expressing a positive or a negative opinion. The model of feature-based opinion mining and summarization is proposed in [24,28]. This model gives a more complete formulation of the opinion mining problem. It identifies the key pieces of information that should be mined and describes how a structured opinion summary can be produced from unstructured texts. The problem of mining opinions from comparative sentences is introduced in [22, 25]. Reviews are generally posted at merchant sites, Internet forums, discussion groups, blogs, etc. These are called as the user generated content or user generated media. There are three mining tasks which can be used for mining these user generated contents or user generated media. Bing Liu [32] has explained these three mining tasks in his book Web Data Mining Exploring Hyperlinks, Contents, and Usage Data. Minqing Hu and Bing Liu [33] proposed a technique to study the problem of feature-based opinion summarization of customer reviews of products sold online. Given a set of customer reviews of a particular product, the task involves three subtasks: (1) identifying features of the product that customers have expressed their opinions on (called product features); (2) for each feature, identifying review sentences that give positive or negative opinions; and (3) producing a summary using the discovered information. They proposed a set of techniques for mining and summarizing product reviews based on data mining and natural language processing methods. Xiaowen Ding, Bing Liu, Philip S. Yu [34], proposed an effective method for identifying semantic orientations of opinions expressed by reviewers on product features. It is able to deal with two major problems with the existing methods, (1) opinion words whose semantic orientations are context dependent, and (2) aggregating multiple opinion words in the same sentence. For (1), a holistic approach is proposed that can accurately Helena Ahonen, Oskari Heinonen Mika Klemettinen, A. Inkeri Verkamo [37], proposed a method of descriptive pharse extraction in digital document collections. A framework which for text mining. The data mining method that will be apply is based on gnenralized episodes and episodes rules. (1) General framework for text mining the data collected in this process include sequential data or observations. The starting point is textual data and the end product is information describing phenomena that are frequent in the data. Episodes Episodes rules and episodes are modification concept of association rules frequent sets applied on sequential data. (2) Preprocessing of data In preprocessing of data nearly 80 percent of the total effects. The applicability of our approach was demonstrated with experiments on real-life data, showing that episodes and episode rules produced discriminate between documents. Both pre- and post processing have essential roles in pruning and weighting the results. Bing Liu, Minqing Hu, Junsheng Cheng [35] focused on one type of opinion sources, customer reviews of products. We proposed a novel visual analysis system to compare consumer opinions of multiple products. To support visual analysis, we designed a supervised pattern discovery method to automatically identify product
54
International Journal of Computational Intelligence and Information Security, July 2012 Vol. 3, No. 6 ISSN: 1837-7823 features from Pros and Cons in reviews of format (2). A friendly interface is also provided to enable the analyst to interactively correct errors of the automatic system, if needed, which is much more efficient than manual tagging.
Review Database: The review database is created by gathering customer reviews of product from online review sites. Some sites from which data can be gathered are www.ebay.com, www.eopinion.com , www.imdb.com and www.amazon.com. Some datasets of customer reviews are available on web address http://www.cs.uic.edu/~liub. POS Tagger: POS Tagger is used for POS Tagging. Part-of-Speech (POS) tagging is the process of assigning a part-of-speech like noun, verb, pronoun, adverb, adjective or other lexical class marker to each world in a sentence. There are some POS taggers like Stanford Tagger[5] which can be used for POS tagging. For example consider following sentenceThe battery life is not long enough. After tagging this sentence with Stanford Tagger we get following output. Battery/NNP life/NN of/IN this/DT handset/NN is/VBZ good/JJ Frequent Feature Identification: After POS tagging frequent features are identified. Frequent features means product features on which many people have expressed their opinion and infrequent features means product features on which few people have expressed their opinion. In previous step that is POS tagging, we get the product features on which reviewer has commented and opinions of review. In this step we separate out frequent product features and infrequent product features. In this system, only frequent features will be considered. Association mining can be used to find all frequent features. The main reason for using association mining is because of the observation. It is common that a customer review contains many things that are not directly related to product features. Different customers usually have different stories. However, when they comment on product features, the words that they use converge. Thus using association mining to find frequent item sets is appropriate because those frequent item sets are likely to be product features. Feature Pruning: Not all candidate frequent features generated by association mining are genuine features. Two types of pruning are used to remove those unlikely features. 55
International Journal of Computational Intelligence and Information Security, July 2012 Vol. 3, No. 6 ISSN: 1837-7823 Compactness pruning: This method checks features that contain at least two words, which we call feature phrases, and remove those that are likely to be meaningless. The association mining algorithm does not consider the position of an item (or word) in a sentence. However, in a sentence, words that appear together in a specific order are more likely to be meaningful phrases. Therefore, some of the frequent feature phrases generated by association mining may not be genuine features. Compactness pruning aims to prune those candidate features whose words do not appear together in a specific order. Redundancy pruning: In this step, we focus on removing redundant features that contain single words. Opinion Word Extractor: After pruning Opinion Words are extracted by using Opinion Word Extractor. If a sentence contains one or more product features and one or more opinion words, then the sentence is called an opinion sentence. Opinion words extractor extracts opinion words in the following manner: 1. For each sentence in database, if sentence is opinion sentence then it extracts all the adjective words as opinion words 2. For each feature it considers each nearby adjective as its opinion word. Opinion Words Polarity Recognizer: It recognizes the polarity of each opinion word in sentence. It follows following steps for recognizing polarity of an opinion word: 1. If opinion words are present in database then assign polarity stored in database. 2. If opinion words are not present in database then find its synonym. If synonym is present then assign polarity of that synonym to the opinion word and store it in database. 3. If opinion words are not present in database and its synonym is not present, then find its antonym. If antonym is present then assign opposite polarity of that antonym to the opinion word and store it in database. Opinion Sentence Polarity Recognizer: Opinion Sentence Polarity Recognizer recognizes the polarity of each opinion sentence in each review. It follows following steps for recognizing polarity of an opinion sentence: 4. If odd number of negation words is present in the opinion sentence then assign opposite polarity of opinion word to that sentence. 5. If even number of negation words is present in the opinion sentence then assign polarity of opinion word to that sentence. 6. If negation word is not present in the opinion sentence then assign polarity of opinion word to that sentence. Feature-Based Summary Review Generation: To generate the final feature-based review summary following steps are used: 7. For each discovered feature, related opinion sentences are put into positive and negative categories according to the opinion sentences orientations. A count is computed to show how many reviews give positive/negative opinions to the feature. 8. All features are ranked according to the frequency of their appearances in the reviews. Feature phrases appear before single word features as phrases normally are more interesting to users. Other types of rankings are also possible.
Conclusion
In this paper, a system for mining and summarizing customer reviews based on data mining and natural language processing methods is proposed. The objective is to provide a feature-based summary of a large number of customer reviews of a product sold online. Summarizing the reviews is not only useful to common shoppers, but also crucial to product manufacturers. In future work, we plan to further improve and refine our techniques, and to deal with the outstanding problems identified above, i.e., pronoun resolution, determining the strength of opinions, and investigating opinions expressed with adverbs, verbs and nouns. Finally, we will also look into monitoring of customer reviews.
56
International Journal of Computational Intelligence and Information Security, July 2012 Vol. 3, No. 6 ISSN: 1837-7823
References
[1] K. Dave, S. Lawrence & D. Pennock. Mining the peanut gallery: opinion extraction and semantic classification of product reviews. WWW2003. [2] Bing Liu. "Opinion Mining." Invited contribution to Encyclopedia of Database Systems, 2008. [3]. M. Hu & B. Liu. Mining and summarizing customer reviews. KDD2004. [4]. A. Ntoulas, M. Najork, M. Manasse & D. Fetterly. Detecting Spam Web Pages through Content Analysis. WWW2006. [5]. B. Pang, L. Lee & S. Vaithyanathan. Thumbs up? Sentiment classification using machine learning techniques. EMNLP2002. [6]. M. Sahami and S. Dumais and D. Heckerman and E. Horvitz. A Bayesian Approach to Filtering Junk {E}-Mail. AAAI Technical Report WS-98-05, 1998. [7] Liu B, Hu M, and Cheng J. Opinion Observer: Analyzing and Comparing Opinions on the Web. in Proceedings of WWW. 2005.342-351. [8] Bollegala D, Matsuo Y, and Ishizuka M. Measuring semantic similarity between words using web search engines. In Proceedings of WWW. 2007.757-766 [9] Chen H, Lin M, and Wei Y. Novel association measures using web search with double checking. in ACL. 2006.1016 [10] Lin D. Automatic retrieval and clustering of similar words. 1998: Proceedings of ACL.768-774 [11] Lin D and Wu X. Phrase clustering for discriminative learning. in Proceedings of ACL. 2009.1030-1038 [12] Pantel P, Crestan E, Borkovsky A, Popescu A, and Vyas V. Web-scale distributional similarity and entity set expansion. in Proceedings of EMNLP. 2009.938-947. [13] Pereira F, Tishby N, and Lee L. Distributional clustering of English words. in Proceedings of ACL. 1993.183190 [14] Resnik P. Using information content to evaluate semantic similarity in a taxonomy. in IJCAI. 1995.448-453 [15] Sahami M and Heilman T. A web-based kernel function for measuring the similarity of short text snippets. In Proceedings of WWW. 2006.377-386 [16] Guo H, Zhu H, Guo Z, Zhang X, and Su Z. Product feature categorization with multilevel latent semantic association. In Proceedings of CIKM. 2009.1087-1096 [17] Titov I and McDonald R. Modeling online reviews with multi-grain topic models. in WWW. 2008.111-120 [18] Hu M. and B. Liu. 2004. Mining and Summarizing Customer Reviews. Proc. of KDD [19] Liu B. 2010. Sentiment Analysis and Subjectivity. Handbook of Natural Language Processing N. Indurkhya and F. J. Damerau. [20] Dave, D., Lawrence, A., and Pennock, D. Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of Product Reviews. Proceedings of International World Wide Web Conference (WWW03), 2003. [21] Ding, X., Liu, B. and Yu, P. A Holistic Lexicon-Based Approach to Opinion Mining. Proceedings of the first ACM International Conference on Web search and Data Mining (WSDM08), 2008. [22] Ganapathibhotla, G. and Liu, B. Identifying Preferred Entities in Comparative Sentences. To appear in Proceedings of the 22nd International Conference on Computational Linguistics (COLING08), 2008. [23] Hatzivassiloglou, V. and McKeown, K. Predicting the Semantic Orientation of Adjectives. ACLEACL 97, 1997. [24] Hu, M and Liu, B. Mining and Summarizing Customer Reviews. Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD04), 2004. [25] Jindal, N. and Liu, B. Mining Comparative Sentences and Relations. Proceedings of National Conference on Artificial Intelligence (AAAI06), 2006. [26] Kanayama, H. and Nasukawa, T. Fully Automatic Lexicon Expansion for Domain-Oriented Sentiment Analysis. Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP06), 2006. [27] Kim, S. and Hovy, E. Determining the Sentiment of Opinions. Proceedings of the 20th International Conference on Computational Linguistics (COLING04), 2004. [28] Liu, B., Hu, M. and Cheng, J. Opinion Observer: Analyzing and Comparing Opinions on the Web. Proceedings of International World Wide Web Conference (WWW05), 2005.
57
International Journal of Computational Intelligence and Information Security, July 2012 Vol. 3, No. 6 ISSN: 1837-7823 [29] Pang, B., Lee, L. and Vaithyanathan, S. Thumbs up? Sentiment Classification Using Machine Learning Techniques. Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP02), 2002. [30] Popescu, A.-M. and Etzioni, O. Extracting Product Features and Opinions from Reviews. Proceedings of the 2005 Conference on Empirical Methods in Natural Language Processing (EMNLP05), 2005. [31] Turney, P. Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews. ACL02, 2002. [32] Minqing Hu and Bing Liu Mining Opinion Features in Customer Reviews KDD04, August 2225, 2004 [33] Bing Liu,Minqing Hu,Junsheng Cheng, Opinion Observer: Analyzing and Comparing Opinions on the Web,WWW 2005, May 10-14, 2005 [34] Bing Liu, Web Data Mining Exploring Hyperlinks, Contents, and Usage Data, Springer, 2007. [35] Standard Tagger Version 3.0 (http://www-nlp.standfold.edu/software/tagger.html) [36] Rune Saetre, GeneTUC Natural Language Understanding in Medical Text, Thesis of Department of Computer Science and Technology, June 2006, Norwegian University of Science and Technology, (NTNU), Norway. [37] Helena Ahonen, Oskari Heinonen Mika Klemettinen, A. Inkeri Verkamo Apply Data Mining Techniques for descriptive Pharse extraction in digital document collections, 24 april 1998, IEEE International Forum, Santa Barbara, CA.
58