Preprocessing Techniquesfor Text Mining
Preprocessing Techniquesfor Text Mining
net/publication/273127322
CITATIONS READS
170 55,958
2 authors:
All content following this page was uploaded by Vairaprakash Gurusamy on 05 March 2015.
[email protected] [email protected]
Abstract
Preprocessing is an important task and appearing in the document itself. The
critical step in Text mining, Natural decision may be binary (retrieve/reject), or it
Language Processing (NLP) and information may involve estimating the degree of
retrieval (IR). In the area of Text Mining, data relevance that the document has to query.
preprocessing used for extracting interesting Unfortunately, the words that appear in
and non-trivial and knowledge from documents and in queries often have many
unstructured text data. Information Retrieval structural variants. So before the information
(IR) is essentially a matter of deciding which retrieval from the documents, the data
documents in a collection should be retrieved preprocessing techniques are applied on the
to satisfy a user's need for information. The target data set to reduce the size of the data
user's need for information is represented by set which will increase the effectiveness of IR
a query or profile, and contains one or more System The objective of this study is to
search terms, plus some additional analyze the issues of preprocessing methods
information such as weight of the words. such as Tokenization, Stop word removal and
Hence, the retrieval decision is made by Stemming for the text documents
comparing the terms of the query with the
Keywords: Text Mining, NLP, IR, Stemming
index terms (important words or phrases)
I. Introduction Need of Text Preprocessing in NLP System
Isolating: Words do not divide into smaller performance. Every text document deals with
units. Example: Mandarin Chinese these words which are not necessary for text
mining applications.
Agglutinative: Words divide into smaller
units. Example: Japanese, Tamil
IV. Stemming Terms from the queries and indexes could
Stemming is the process of conflating then be stemmed via lookup table, using b-
the variant forms of a word into a common trees or hash tables. Such lookups are very
representation, the stem. For example, the fast, but there are problems with this
“presenting” could all be reduced to a English, even if there were they may not be
common representation “present”. This is a represented because they are domain specific
widely used procedure in text processing for and require some other stemming methods.
containing the words presentation and Successor variety stemmers are based
presented. on the structural linguistics which determines
the word and morpheme boundaries based on
Errors in Stemming
distribution of phonemes. Successor variety
There are mainly two errors in of a string is the number of characters that
stemming. follow it in words in some body of text. For
1. over stemming example consider a body of text consisting of
2. under stemming following words.
Over-stemming is when two words with Able, ape, beatable, finable, read, readable,
different stems are stemmed to the same root. reading, reads, red, rope, ripe.
This is also known as a false positive. Let’s determine the successor variety
Under-stemming is when two words that for the word read. First letter in read is R. R
should be stemmed to the same root are not. is followed in the text body by 3 characters E,
This is also known as a false negative. I, O thus the successor variety of R is 3. The
next successor variety for read is 2 since A,
TYPES OF STEMMING
ALGORITHMS D follows RE in the text body and so on.
Following table shows the complete
i) Table Look Up Approach
successor variety for the word read.
One method to do stemming is to store
a table of all index terms and their stems.
Prefix Successor Variety Letters
R 3 E,I,O
RE 2 A,D
REA 1 D
READ 3 A,I,S