0% found this document useful (0 votes)
40 views7 pages

Preprocessing Techniquesfor Text Mining

Uploaded by

comeonitsa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views7 pages

Preprocessing Techniquesfor Text Mining

Uploaded by

comeonitsa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/273127322

Preprocessing Techniques for Text Mining

Conference Paper · October 2014

CITATIONS READS
170 55,958

2 authors:

Vairaprakash Gurusamy Subbu Kannan


Madurai Kamaraj University Madurai Kamaraj University
25 PUBLICATIONS 292 CITATIONS 26 PUBLICATIONS 361 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Vairaprakash Gurusamy on 05 March 2015.

The user has requested enhancement of the downloaded file.


Preprocessing Techniques for Text Mining

Dr.S.Kannan, Vairaprakash Gurusamy,


Associate Professor, Research Scholar,

Department of Computer Applications, Department of Computer Applications,

Madurai Kamaraj University. Madurai Kamaraj University.

[email protected] [email protected]

Abstract
Preprocessing is an important task and appearing in the document itself. The
critical step in Text mining, Natural decision may be binary (retrieve/reject), or it
Language Processing (NLP) and information may involve estimating the degree of
retrieval (IR). In the area of Text Mining, data relevance that the document has to query.
preprocessing used for extracting interesting Unfortunately, the words that appear in
and non-trivial and knowledge from documents and in queries often have many
unstructured text data. Information Retrieval structural variants. So before the information
(IR) is essentially a matter of deciding which retrieval from the documents, the data
documents in a collection should be retrieved preprocessing techniques are applied on the
to satisfy a user's need for information. The target data set to reduce the size of the data
user's need for information is represented by set which will increase the effectiveness of IR
a query or profile, and contains one or more System The objective of this study is to
search terms, plus some additional analyze the issues of preprocessing methods
information such as weight of the words. such as Tokenization, Stop word removal and
Hence, the retrieval decision is made by Stemming for the text documents
comparing the terms of the query with the
Keywords: Text Mining, NLP, IR, Stemming
index terms (important words or phrases)
I. Introduction Need of Text Preprocessing in NLP System

Text pre-processing 1. To reduce indexing(or data) file size


is an essential part of any NLP system, since of the Text documents
the characters, words, and sentences i) Stop words accounts 20-30%
identified at this stage are the fundamental of total word counts in a
units passed to all further processing stages, particular text documents
from analysis and tagging components, such ii) Stemming may reduce
as morphological analyzers and part-of- indexing size as much as 40-
speech taggers, through applications, such as 50%
information retrieval and machine translation 2. To improve the efficiency and
systems. It is a Collection of activities in effectiveness of the IR system
which Text Documents are pre-processed. i) Stop words are not useful for
Because the text data often contains some searching or Text mining and
special formats like number formats, date they may confuse the retrieval
formats and the most common words that system
unlikely to help Text mining such as ii) Stemming used for matching
prepositions, articles, and pro-nouns can be the similar words in a text
eliminated document

II. Tokenization in computer science, where it forms part of


lexical analysis. Textual data is only a block
Tokenization is the process of breaking a
of characters at the beginning. All processes
stream of text into words, phrases, symbols,
in information retrieval require the words of
or other meaningful elements called tokens
the data set. Hence, the requirement for a
.The aim of the tokenization is the
parser is a tokenization of documents. This
exploration of the words in a sentence. The
may sound trivial as the text is already stored
list of tokens becomes input for further
in machine-readable formats. Nevertheless,
processing such as parsing or text mining.
some problems are still left, like the removal
Tokenization is useful both in linguistics
of punctuation marks. Other characters like
(where it is a form of text segmentation), and
brackets, hyphens, etc require processing as
well. Furthermore, tokenizer can cater for Inflectional: Boundaries between
consistency in the documents. The main use morphemes are not clear and ambiguous in
of tokenization is identifying the meaningful terms of grammatical meaning. Example:
keywords. The inconsistency can be different Latin.
number and time formats. Another problem
III. Stop Word Removal
are abbreviations and acronyms which have
to be transformed into a standard form. Many words in documents recur very
frequently but are essentially meaningless as
Challenges in Tokenization
they are used to join words together in a
Challenges in tokenization depend on sentence. It is commonly understood that stop
the type of language. Languages such as words do not contribute to the context or
English and French are referred to as space- content of textual documents. Due to their
delimited as most of the words are separated high frequency of occurrence, their presence
from each other by white spaces. Languages in text mining presents an obstacle in
such as Chinese and Thai are referred to as understanding the content of the documents.
unsegmented as words do not have clear
Stop words are very frequently used
boundaries. Tokenizing unsegmented
common words like ‘and’, ‘are’, ‘this’ etc.
language sentences requires additional
They are not useful in classification of
lexical and morphological information.
documents. So they must be removed.
Tokenization is also affected by writing
However, the development of such stop
system and the typographical structure of the
words list is difficult and inconsistent
words. Structure of languages can be grouped
between textual sources. This process also
into three categories:
reduces the text data and improves the system

Isolating: Words do not divide into smaller performance. Every text document deals with

units. Example: Mandarin Chinese these words which are not necessary for text
mining applications.
Agglutinative: Words divide into smaller
units. Example: Japanese, Tamil
IV. Stemming Terms from the queries and indexes could

Stemming is the process of conflating then be stemmed via lookup table, using b-

the variant forms of a word into a common trees or hash tables. Such lookups are very

representation, the stem. For example, the fast, but there are problems with this

words: “presentation”, “presented”, approach. First there is no such data for

“presenting” could all be reduced to a English, even if there were they may not be

common representation “present”. This is a represented because they are domain specific

widely used procedure in text processing for and require some other stemming methods.

information retrieval (IR) based on the Second issue is storage overhead.

assumption that posing a query with the term


presenting implies an interest in documents ii) Successor Variety

containing the words presentation and Successor variety stemmers are based
presented. on the structural linguistics which determines
the word and morpheme boundaries based on
Errors in Stemming
distribution of phonemes. Successor variety
There are mainly two errors in of a string is the number of characters that
stemming. follow it in words in some body of text. For
1. over stemming example consider a body of text consisting of
2. under stemming following words.
Over-stemming is when two words with Able, ape, beatable, finable, read, readable,
different stems are stemmed to the same root. reading, reads, red, rope, ripe.
This is also known as a false positive. Let’s determine the successor variety
Under-stemming is when two words that for the word read. First letter in read is R. R
should be stemmed to the same root are not. is followed in the text body by 3 characters E,
This is also known as a false negative. I, O thus the successor variety of R is 3. The
next successor variety for read is 2 since A,
TYPES OF STEMMING
ALGORITHMS D follows RE in the text body and so on.
Following table shows the complete
i) Table Look Up Approach
successor variety for the word read.
One method to do stemming is to store
a table of all index terms and their stems.
Prefix Successor Variety Letters
R 3 E,I,O
RE 2 A,D
REA 1 D
READ 3 A,I,S

Table 1.1 Successor variety for word read

Once the successor variety for a digram method. Digram is a pair of


given word is determined then this consecutive letters. This method is called n-
information is used to segment the word. gram method since trigram or n-grams could
Hafer and Weiss discussed the ways of doing be used. In this method association measures
this. are calculated between the pairs of terms
based on shared unique digram.
1. Cut Off Method: Some cutoff value is For example: consider two words
selected and a boundary is identified Stemming and Stemmer
whenever the cut off value is reached. Stemming st te em mm mi in ng
2. Peak and Plateau method: In this method a
Stemmer st te em mm me er
segment break is made after a character
In this example the word
whose successor variety exceeds that of the
stemming has 7 unique digrams, stemmer has
characters immediately preceding and
6 unique digrams, these two words share 5
following it.
unique digrams st, te, em, mm ,me. Once the
3. Complete word method: Break is made
number of unique digrams is found then a
after a segment if a segment is a complete
similarity measure based on the unique
word in the corpus.
digrams is calculated using dice coefficient.
Dice coefficient is defined as
iii) N-Gram stemmers
S=2C/(A+B)
This method has been designed by
Adamson and Boreham. It is called as shared
Where C is the common unique digrams, A is V. Conclusion
the number of unique digrams in first word; In this work we have presented
B is the number of unique digrams in second efficient preprocessing techniques. These
word. Similarity measures are determined for pre-processing techniques eliminates noisy
all pairs of terms in the database, forming a from text data, later identifies the root word
similarity matrix. Once such a similarity for actual words and reduces the size of the
matrix is available, the terms are clustered text data. This improves performance of the
using a single link clustering method. IR system.

iv) Affix Removal Stemmers


References

1.Vishal Gupta , Gurpreet S. Lehal “A Survey of Text


Affix removal stemmers removes
Mining Techniques and Applications” Journal of
the suffixes or prefixes from the terms
Emerging technologies in web intelligence, vol,1 no1
leaving the stem. One of the example of the August 2009.
affix removal stemmer is one which removes 2. Durmaz,O.Bilge, H.S “Effect of dimensionality
the plurals form of the terms. Some set of reduction and feature selection in text classification ”
in IEEE conference ,2011, Page 21-24 ,2011.
rules for such a stemmer are as follows
3. G.Salton. The SMART Retrieval System:
(Harman)
Experiments in Automatic Document Processing.
a) If a word ends in “ies” but not “eies” or Prentice-Hall, Inc.
“aies ” 4. Paice Chris D. “An evaluation method for stemming
Then “ies” -> “y” algorithms”. Proceedings of the 17th annual
international ACM SIGIR conference on Research and
b) If a word ends in “es” but not “aes”, or
development in information retrieval. 1994, 42- 50.
“ees” or “oes”
5. J. Cowie and Y. Wilks, Information extraction, New
Then “es” -> “e” York, 2000.
c) If a word ends in “s” but not “us” or “ss ” 6. Ms. Anjali Ganesh Jivani “A Comparative Study of
Then “s” -> “NULL” Stemming Algorithms” Int. J. Comp.
Tech. Appl., Vol 2 (6), 1930-1938

View publication stats

You might also like