0% found this document useful (0 votes)

40 views7 pages

Preprocessing Techniquesfor Text Mining

Uploaded by

comeonitsa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views7 pages

Preprocessing Techniquesfor Text Mining

Uploaded by

comeonitsa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/273127322

Preprocessing Techniques for Text Mining

Conference Paper · October 2014

CITATIONS READS
170 55,958

2 authors:

Vairaprakash Gurusamy Subbu Kannan

Madurai Kamaraj University Madurai Kamaraj University
25 PUBLICATIONS 292 CITATIONS 26 PUBLICATIONS 361 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Vairaprakash Gurusamy on 05 March 2015.

The user has requested enhancement of the downloaded file.

Preprocessing Techniques for Text Mining

Dr.S.Kannan, Vairaprakash Gurusamy,

Associate Professor, Research Scholar,

Department of Computer Applications, Department of Computer Applications,

Madurai Kamaraj University. Madurai Kamaraj University.

[email protected] [email protected]

Abstract
Preprocessing is an important task and appearing in the document itself. The
critical step in Text mining, Natural decision may be binary (retrieve/reject), or it
Language Processing (NLP) and information may involve estimating the degree of
retrieval (IR). In the area of Text Mining, data relevance that the document has to query.
preprocessing used for extracting interesting Unfortunately, the words that appear in
and non-trivial and knowledge from documents and in queries often have many
unstructured text data. Information Retrieval structural variants. So before the information
(IR) is essentially a matter of deciding which retrieval from the documents, the data
documents in a collection should be retrieved preprocessing techniques are applied on the
to satisfy a user's need for information. The target data set to reduce the size of the data
user's need for information is represented by set which will increase the effectiveness of IR
a query or profile, and contains one or more System The objective of this study is to
search terms, plus some additional analyze the issues of preprocessing methods
information such as weight of the words. such as Tokenization, Stop word removal and
Hence, the retrieval decision is made by Stemming for the text documents
comparing the terms of the query with the
Keywords: Text Mining, NLP, IR, Stemming
index terms (important words or phrases)
I. Introduction Need of Text Preprocessing in NLP System

Text pre-processing 1. To reduce indexing(or data) file size

is an essential part of any NLP system, since of the Text documents
the characters, words, and sentences i) Stop words accounts 20-30%
identified at this stage are the fundamental of total word counts in a
units passed to all further processing stages, particular text documents
from analysis and tagging components, such ii) Stemming may reduce
as morphological analyzers and part-of- indexing size as much as 40-
speech taggers, through applications, such as 50%
information retrieval and machine translation 2. To improve the efficiency and
systems. It is a Collection of activities in effectiveness of the IR system
which Text Documents are pre-processed. i) Stop words are not useful for
Because the text data often contains some searching or Text mining and
special formats like number formats, date they may confuse the retrieval
formats and the most common words that system
unlikely to help Text mining such as ii) Stemming used for matching
prepositions, articles, and pro-nouns can be the similar words in a text
eliminated document

II. Tokenization in computer science, where it forms part of

lexical analysis. Textual data is only a block
Tokenization is the process of breaking a
of characters at the beginning. All processes
stream of text into words, phrases, symbols,
in information retrieval require the words of
or other meaningful elements called tokens
the data set. Hence, the requirement for a
.The aim of the tokenization is the
parser is a tokenization of documents. This
exploration of the words in a sentence. The
may sound trivial as the text is already stored
list of tokens becomes input for further
in machine-readable formats. Nevertheless,
processing such as parsing or text mining.
some problems are still left, like the removal
Tokenization is useful both in linguistics
of punctuation marks. Other characters like
(where it is a form of text segmentation), and
brackets, hyphens, etc require processing as
well. Furthermore, tokenizer can cater for Inflectional: Boundaries between
consistency in the documents. The main use morphemes are not clear and ambiguous in
of tokenization is identifying the meaningful terms of grammatical meaning. Example:
keywords. The inconsistency can be different Latin.
number and time formats. Another problem
III. Stop Word Removal
are abbreviations and acronyms which have
to be transformed into a standard form. Many words in documents recur very
frequently but are essentially meaningless as
Challenges in Tokenization
they are used to join words together in a
Challenges in tokenization depend on sentence. It is commonly understood that stop
the type of language. Languages such as words do not contribute to the context or
English and French are referred to as space- content of textual documents. Due to their
delimited as most of the words are separated high frequency of occurrence, their presence
from each other by white spaces. Languages in text mining presents an obstacle in
such as Chinese and Thai are referred to as understanding the content of the documents.
unsegmented as words do not have clear
Stop words are very frequently used
boundaries. Tokenizing unsegmented
common words like ‘and’, ‘are’, ‘this’ etc.
language sentences requires additional
They are not useful in classification of
lexical and morphological information.
documents. So they must be removed.
Tokenization is also affected by writing
However, the development of such stop
system and the typographical structure of the
words list is difficult and inconsistent
words. Structure of languages can be grouped
between textual sources. This process also
into three categories:
reduces the text data and improves the system

Isolating: Words do not divide into smaller performance. Every text document deals with

units. Example: Mandarin Chinese these words which are not necessary for text
mining applications.
Agglutinative: Words divide into smaller
units. Example: Japanese, Tamil
IV. Stemming Terms from the queries and indexes could

Stemming is the process of conflating then be stemmed via lookup table, using b-

the variant forms of a word into a common trees or hash tables. Such lookups are very

representation, the stem. For example, the fast, but there are problems with this

words: “presentation”, “presented”, approach. First there is no such data for

“presenting” could all be reduced to a English, even if there were they may not be

common representation “present”. This is a represented because they are domain specific

widely used procedure in text processing for and require some other stemming methods.

information retrieval (IR) based on the Second issue is storage overhead.

assumption that posing a query with the term

presenting implies an interest in documents ii) Successor Variety

containing the words presentation and Successor variety stemmers are based
presented. on the structural linguistics which determines
the word and morpheme boundaries based on
Errors in Stemming
distribution of phonemes. Successor variety
There are mainly two errors in of a string is the number of characters that
stemming. follow it in words in some body of text. For
1. over stemming example consider a body of text consisting of
2. under stemming following words.
Over-stemming is when two words with Able, ape, beatable, finable, read, readable,
different stems are stemmed to the same root. reading, reads, red, rope, ripe.
This is also known as a false positive. Let’s determine the successor variety
Under-stemming is when two words that for the word read. First letter in read is R. R
should be stemmed to the same root are not. is followed in the text body by 3 characters E,
This is also known as a false negative. I, O thus the successor variety of R is 3. The
next successor variety for read is 2 since A,
TYPES OF STEMMING
ALGORITHMS D follows RE in the text body and so on.
Following table shows the complete
i) Table Look Up Approach
successor variety for the word read.
One method to do stemming is to store
a table of all index terms and their stems.
Prefix Successor Variety Letters
R 3 E,I,O
RE 2 A,D
REA 1 D
READ 3 A,I,S

Table 1.1 Successor variety for word read

Once the successor variety for a digram method. Digram is a pair of

given word is determined then this consecutive letters. This method is called n-
information is used to segment the word. gram method since trigram or n-grams could
Hafer and Weiss discussed the ways of doing be used. In this method association measures
this. are calculated between the pairs of terms
based on shared unique digram.
1. Cut Off Method: Some cutoff value is For example: consider two words
selected and a boundary is identified Stemming and Stemmer
whenever the cut off value is reached. Stemming st te em mm mi in ng
2. Peak and Plateau method: In this method a
Stemmer st te em mm me er
segment break is made after a character
In this example the word
whose successor variety exceeds that of the
stemming has 7 unique digrams, stemmer has
characters immediately preceding and
6 unique digrams, these two words share 5
following it.
unique digrams st, te, em, mm ,me. Once the
3. Complete word method: Break is made
number of unique digrams is found then a
after a segment if a segment is a complete
similarity measure based on the unique
word in the corpus.
digrams is calculated using dice coefficient.
Dice coefficient is defined as
iii) N-Gram stemmers
S=2C/(A+B)
This method has been designed by
Adamson and Boreham. It is called as shared
Where C is the common unique digrams, A is V. Conclusion
the number of unique digrams in first word; In this work we have presented
B is the number of unique digrams in second efficient preprocessing techniques. These
word. Similarity measures are determined for pre-processing techniques eliminates noisy
all pairs of terms in the database, forming a from text data, later identifies the root word
similarity matrix. Once such a similarity for actual words and reduces the size of the
matrix is available, the terms are clustered text data. This improves performance of the
using a single link clustering method. IR system.

iv) Affix Removal Stemmers

References

1.Vishal Gupta , Gurpreet S. Lehal “A Survey of Text

Affix removal stemmers removes
Mining Techniques and Applications” Journal of
the suffixes or prefixes from the terms
Emerging technologies in web intelligence, vol,1 no1
leaving the stem. One of the example of the August 2009.
affix removal stemmer is one which removes 2. Durmaz,O.Bilge, H.S “Effect of dimensionality
the plurals form of the terms. Some set of reduction and feature selection in text classification ”
in IEEE conference ,2011, Page 21-24 ,2011.
rules for such a stemmer are as follows
3. G.Salton. The SMART Retrieval System:
(Harman)
Experiments in Automatic Document Processing.
a) If a word ends in “ies” but not “eies” or Prentice-Hall, Inc.
“aies ” 4. Paice Chris D. “An evaluation method for stemming
Then “ies” -> “y” algorithms”. Proceedings of the 17th annual
international ACM SIGIR conference on Research and
b) If a word ends in “es” but not “aes”, or
development in information retrieval. 1994, 42- 50.
“ees” or “oes”
5. J. Cowie and Y. Wilks, Information extraction, New
Then “es” -> “e” York, 2000.
c) If a word ends in “s” but not “us” or “ss ” 6. Ms. Anjali Ganesh Jivani “A Comparative Study of
Then “s” -> “NULL” Stemming Algorithms” Int. J. Comp.
Tech. Appl., Vol 2 (6), 1930-1938

View publication stats

The Power of Reading Stephen - Krashen
75% (4)
The Power of Reading Stephen - Krashen
3 pages
Little Boy Crying
63% (8)
Little Boy Crying
2 pages
EmSAT English College Entry Exam Specification English
100% (1)
EmSAT English College Entry Exam Specification English
1 page
VO_MCA_SEM 4 _ Text Mining _U2
No ratings yet
VO_MCA_SEM 4 _ Text Mining _U2
15 pages
1) What Is Natural Language Processing?
No ratings yet
1) What Is Natural Language Processing?
14 pages
43.IJCSCN PreprocessingTechniquesforTextMining Ilamathi Nithya
No ratings yet
43.IJCSCN PreprocessingTechniquesforTextMining Ilamathi Nithya
11 pages
Multilingual Information Retrieval
No ratings yet
Multilingual Information Retrieval
18 pages
Coba Coba Upload
No ratings yet
Coba Coba Upload
3 pages
Text Analytics Basics
No ratings yet
Text Analytics Basics
28 pages
Lecture 8 - Pre Processing Techniques
No ratings yet
Lecture 8 - Pre Processing Techniques
14 pages
Module 3
No ratings yet
Module 3
40 pages
Ijcst V3i2p17
No ratings yet
Ijcst V3i2p17
5 pages
ir manual
No ratings yet
ir manual
53 pages
A New Approach To Represent Textual Documents Using CVSM
No ratings yet
A New Approach To Represent Textual Documents Using CVSM
6 pages
Introduction To NLP
No ratings yet
Introduction To NLP
50 pages
Text Mining: Open Source Tokenization Tools - An Analysis
No ratings yet
Text Mining: Open Source Tokenization Tools - An Analysis
11 pages
(IJCST-V6I3P19) :vignesh Venkatesh
No ratings yet
(IJCST-V6I3P19) :vignesh Venkatesh
16 pages
Different Text Mining Techniques
No ratings yet
Different Text Mining Techniques
4 pages
Chapter 7.1 - Introducing Natural Language Processing
No ratings yet
Chapter 7.1 - Introducing Natural Language Processing
39 pages
Chapter-1 Introduction To NLP
No ratings yet
Chapter-1 Introduction To NLP
12 pages
BCSE206L_FDS_MODULE-4_SMSATAPATHY
No ratings yet
BCSE206L_FDS_MODULE-4_SMSATAPATHY
50 pages
Natural Language Processing Revision Notes
No ratings yet
Natural Language Processing Revision Notes
4 pages
DLT Unit-5
No ratings yet
DLT Unit-5
48 pages
Text Prediction Analysis
No ratings yet
Text Prediction Analysis
12 pages
Chapter 4
No ratings yet
Chapter 4
17 pages
An Overview on Extractive Text Summariza
No ratings yet
An Overview on Extractive Text Summariza
13 pages
NLP notes
No ratings yet
NLP notes
203 pages
NLP KEY
No ratings yet
NLP KEY
16 pages
IR....
No ratings yet
IR....
5 pages
Week 6: Introduction To Natural Language Processing
No ratings yet
Week 6: Introduction To Natural Language Processing
18 pages
ccs369 Unit 1 Summarized Lecture Note
No ratings yet
ccs369 Unit 1 Summarized Lecture Note
29 pages
Ass7 Write Up .Final
No ratings yet
Ass7 Write Up .Final
11 pages
Sentiment Analysis Using Supervised Machine Learning Ijariie13051
No ratings yet
Sentiment Analysis Using Supervised Machine Learning Ijariie13051
7 pages
Lecture 02 - NLU concepts
No ratings yet
Lecture 02 - NLU concepts
27 pages
ETB Text analytics using Machine Learning -20-12-24
No ratings yet
ETB Text analytics using Machine Learning -20-12-24
38 pages
01_Introduction to Text Analytics_part2
No ratings yet
01_Introduction to Text Analytics_part2
48 pages
NLP Text Preprocessing
No ratings yet
NLP Text Preprocessing
19 pages
5.2 Natural Language Processing
No ratings yet
5.2 Natural Language Processing
43 pages
Unit 5
No ratings yet
Unit 5
8 pages
Unit 1 Text and Speech Analysis Notes
No ratings yet
Unit 1 Text and Speech Analysis Notes
28 pages
Project Report
No ratings yet
Project Report
12 pages
Unit 4 NLP Notes
No ratings yet
Unit 4 NLP Notes
35 pages
AIUnit 6 10
No ratings yet
AIUnit 6 10
8 pages
Great Big Natural Language Processing Primer KDnuggets
No ratings yet
Great Big Natural Language Processing Primer KDnuggets
25 pages
Handling Corpus Raw Text
No ratings yet
Handling Corpus Raw Text
15 pages
Unit 6 Endsem PYQs
No ratings yet
Unit 6 Endsem PYQs
15 pages
Effective Classification of Text
No ratings yet
Effective Classification of Text
6 pages
Week 12
No ratings yet
Week 12
19 pages
18 Text Mining - Text Preprocessing
No ratings yet
18 Text Mining - Text Preprocessing
40 pages
NLP_AI_X
No ratings yet
NLP_AI_X
6 pages
NLP Steps Basic
No ratings yet
NLP Steps Basic
26 pages
NLP - Srilakshmi H - PPT Assignment
No ratings yet
NLP - Srilakshmi H - PPT Assignment
29 pages
NLP Practical
No ratings yet
NLP Practical
27 pages
Seminar On Natural Language Processing
No ratings yet
Seminar On Natural Language Processing
21 pages
Unit 1 2 3 4 5 NLP Notes Merged
100% (1)
Unit 1 2 3 4 5 NLP Notes Merged
105 pages
Text Mining
No ratings yet
Text Mining
62 pages
CCS369-UNIT 1
No ratings yet
CCS369-UNIT 1
27 pages
AIML-HC Mod 04
No ratings yet
AIML-HC Mod 04
71 pages
NLP EXP 3 (1)
No ratings yet
NLP EXP 3 (1)
24 pages
Statistical NLP
No ratings yet
Statistical NLP
45 pages
Chapter 5 Predictive Analytics II Text^j Web^j and Social Media Analytics
No ratings yet
Chapter 5 Predictive Analytics II Text^j Web^j and Social Media Analytics
5 pages
Regular Expressions Demystified: A Practical Guide with Examples
From Everand
Regular Expressions Demystified: A Practical Guide with Examples
William E. Clark
No ratings yet
Concept Mining: Fundamentals and Applications
From Everand
Concept Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
(36)Production-inventory systems with imperfect advance demand information and updating
No ratings yet
(36)Production-inventory systems with imperfect advance demand information and updating
57 pages
10.1007_s10951-022-00727-9_acnk
No ratings yet
10.1007_s10951-022-00727-9_acnk
14 pages
EMNLP_2021_REBEL__Camera_Ready_
No ratings yet
EMNLP_2021_REBEL__Camera_Ready_
12 pages
10.1016_j.autcon.2022.104211_82h3
No ratings yet
10.1016_j.autcon.2022.104211_82h3
11 pages
Fuzzy rule based unsupervised sentiment analysis from social media posts
No ratings yet
Fuzzy rule based unsupervised sentiment analysis from social media posts
50 pages
10.1007_s13235-023-00530-x_5f3h
No ratings yet
10.1007_s13235-023-00530-x_5f3h
33 pages
Identifying Significance of Product Features
No ratings yet
Identifying Significance of Product Features
18 pages
guilherme-sales-smania-car-subscription-services
No ratings yet
guilherme-sales-smania-car-subscription-services
10 pages
Federated learning Overview, strategies, applications, tools and
No ratings yet
Federated learning Overview, strategies, applications, tools and
24 pages
SHAP-IQ Unified Approximation of Any-Order Shapley
No ratings yet
SHAP-IQ Unified Approximation of Any-Order Shapley
27 pages
Federated and edge learning for large language models
No ratings yet
Federated and edge learning for large language models
19 pages
The Influence of Economic Policy
No ratings yet
The Influence of Economic Policy
9 pages
1 s2.0 S0167923624000472 Main
No ratings yet
1 s2.0 S0167923624000472 Main
14 pages
Multimodal Knowledge Graph Construction For Risk Identification in Water Diversion Projects
No ratings yet
Multimodal Knowledge Graph Construction For Risk Identification in Water Diversion Projects
15 pages
Causal ML Python package for causal inference machine learning
No ratings yet
Causal ML Python package for causal inference machine learning
7 pages
An Ontology-Based Methodology For Hazard Identification and Causation Analysis
No ratings yet
An Ontology-Based Methodology For Hazard Identification and Causation Analysis
12 pages
Using Text Mining To Establish Knowledge Graph From Accidentincident Reports in Risk Assessment
No ratings yet
Using Text Mining To Establish Knowledge Graph From Accidentincident Reports in Risk Assessment
20 pages
Text-Based Measure of ESG Risk Exposure
No ratings yet
Text-Based Measure of ESG Risk Exposure
8 pages
A Knowledge Graph-Based Hazard Prediction Approach For Preventing
No ratings yet
A Knowledge Graph-Based Hazard Prediction Approach For Preventing
19 pages
Full Dataset
No ratings yet
Full Dataset
93 pages
An Artificial Neural Network Approach To Enrich HAZOP Analysis of
No ratings yet
An Artificial Neural Network Approach To Enrich HAZOP Analysis of
13 pages
Hospitality Communication in Hospitality & Tourism - AHA - OnLINE World Campus
No ratings yet
Hospitality Communication in Hospitality & Tourism - AHA - OnLINE World Campus
9 pages
Lesson Plan in English 7
No ratings yet
Lesson Plan in English 7
5 pages
1 Reported Speech PDF
No ratings yet
1 Reported Speech PDF
17 pages
English 6 Worksheets 1
No ratings yet
English 6 Worksheets 1
146 pages
Translation Method (Newmark & Hoed)
No ratings yet
Translation Method (Newmark & Hoed)
12 pages
ADJECTIVE (Kata Sifat) : Mobil Itu Sangat Mahal Mereka Sangat Tua. (They Are Very OLD)
No ratings yet
ADJECTIVE (Kata Sifat) : Mobil Itu Sangat Mahal Mereka Sangat Tua. (They Are Very OLD)
2 pages
Syllabus
No ratings yet
Syllabus
16 pages
ATLEGANG LEKALAKALA_12074104_0
No ratings yet
ATLEGANG LEKALAKALA_12074104_0
11 pages
2017 Oimisaki-Language Barrier FORMAL
No ratings yet
2017 Oimisaki-Language Barrier FORMAL
5 pages
Prepare 4 - Unit 4 Grammar & Vocabulary Review
No ratings yet
Prepare 4 - Unit 4 Grammar & Vocabulary Review
2 pages
Q3 Week 3 Day 4
No ratings yet
Q3 Week 3 Day 4
6 pages
English Verbal Ability Test and Syllabus Sample
No ratings yet
English Verbal Ability Test and Syllabus Sample
42 pages
Ta8 Tuan 26 (LBG Tuan 28) Unit 8
No ratings yet
Ta8 Tuan 26 (LBG Tuan 28) Unit 8
11 pages
Semantics Unit 2 Part 1
No ratings yet
Semantics Unit 2 Part 1
9 pages
Untitled
No ratings yet
Untitled
3 pages
DAV_ KÌ 2 NĂM 2 - Google Tài liệu
No ratings yet
DAV_ KÌ 2 NĂM 2 - Google Tài liệu
18 pages
Psycholinguistics Glossary
No ratings yet
Psycholinguistics Glossary
10 pages
Baby Sign Resources (PDF)
No ratings yet
Baby Sign Resources (PDF)
2 pages
Teaching English Through Poetry
100% (1)
Teaching English Through Poetry
4 pages
Remedial Instruction in READING Handout
100% (3)
Remedial Instruction in READING Handout
5 pages
Dzexams 3am Anglais 794638
No ratings yet
Dzexams 3am Anglais 794638
5 pages
B2 Lesson Class Plan
No ratings yet
B2 Lesson Class Plan
12 pages
Reading and Writing Reviewer
No ratings yet
Reading and Writing Reviewer
8 pages
Alpha Numeric Series Work Sheet (1403)
No ratings yet
Alpha Numeric Series Work Sheet (1403)
6 pages
Spelling List: Permai School Term 1 Worksheet 2021 - 2022
No ratings yet
Spelling List: Permai School Term 1 Worksheet 2021 - 2022
9 pages
MDLP1 Revised
No ratings yet
MDLP1 Revised
4 pages
How I Learn Languages - Kató Lomb-Ed
100% (1)
How I Learn Languages - Kató Lomb-Ed
217 pages

Preprocessing Techniquesfor Text Mining

Uploaded by

Preprocessing Techniquesfor Text Mining

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Preprocessing Techniques for Text Mining

Conference Paper · October 2014

Vairaprakash Gurusamy Subbu Kannan

SEE PROFILE SEE PROFILE

The user has requested enhancement of the downloaded file.

Dr.S.Kannan, Vairaprakash Gurusamy,

Department of Computer Applications, Department of Computer Applications,

Madurai Kamaraj University. Madurai Kamaraj University.

Text pre-processing 1. To reduce indexing(or data) file size

II. Tokenization in computer science, where it forms part of

words: “presentation”, “presented”, approach. First there is no such data for

information retrieval (IR) based on the Second issue is storage overhead.

assumption that posing a query with the term

Table 1.1 Successor variety for word read

Once the successor variety for a digram method. Digram is a pair of

iv) Affix Removal Stemmers

1.Vishal Gupta , Gurpreet S. Lehal “A Survey of Text

View publication stats

You might also like