0% found this document useful (0 votes)
57 views

Text Mining: Seminar Submitted by

Text mining is the analysis of natural language text to extract useful information. It differs from data mining as it operates on unstructured text documents rather than structured data sets. The text mining process involves steps like filtering, segmentation, stemming, eliminating excessive words and basic text analysis. Text mining has applications in areas like call centers, anti-spam, market intelligence and mining the web. Some challenges are that text is unstructured and natural language requires skill to analyze large document collections.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views

Text Mining: Seminar Submitted by

Text mining is the analysis of natural language text to extract useful information. It differs from data mining as it operates on unstructured text documents rather than structured data sets. The text mining process involves steps like filtering, segmentation, stemming, eliminating excessive words and basic text analysis. Text mining has applications in areas like call centers, anti-spam, market intelligence and mining the web. Some challenges are that text is unstructured and natural language requires skill to analyze large document collections.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

TEXT MINING

seminar submitted by:


Ali Abdul_Zahraa
Msc,MathcompUOK
[email protected]
Outline
Introduction
Data Mining vs Text Mining
Text Mining Process
Text Mining Applications
Challenges in Text Mining
Conclusion
Introduction
• What is Text Mining?
– Text mining is the analysis of data contained in
natural language text
Introduction
• Why Text Mining?
– Massive amount of new information being
created World’s data doubles every 18 months
(Jacques Vallee Ph.D)
– 80-90% of all data is held in various
unstructured formats
– Useful information can be derived from this
unstructured data
Unstructured Data Examples “Ore”

• Email • Customer
• Insurance claims complaint letters
• News articles • Contracts
• Web pages • Transcripts of
• Patent portfolios phone calls with
customers
• Technical
documents
Reasons for Text Mining
90
80
70
60
Collections of
50 Text
40 Structured Data
30
20
10
0
Percentage
How Text Mining Differs from Data
Mining
Data Mining Text Mining
• Identify data sets • Identify documents
• Select features • Extract features
• Prepare data • Select features by
• Analyze algorithm
distribution • Prepare data
• Analyze
distribution
Mining

 Filtering : remove punctuation, special


characters .
Segmentation: segment document to
words.
Mining
Stemming : Techniques used to
find out the root/stem of a word:
– E.g.,
– user engineering
– users engineered
– used engineer
– using
• Stem (root) : use engineer
Usefulness
• improving effectiveness of retrieval and text mining
– matching similar words
• reducing indexing size
– combing words with same roots may reduce indexing size as much
as 40-50%.
Mining
 Basic stemming methods
• remove ending
– if a word ends with a consonant other than s,
followed by an s, then delete s.
– if a word ends in es, drop the s.
– if a word ends in ing, delete the ing unless the remaining word consists only
of one letter or of th.
– If a word ends with ed, preceded by a consonant, delete the ed unless this
leaves only a single letter.
– …...
• transform words
– if a word ends with “ies” but not “eies” or “aies” then “ies ”
Mining
eliminate excessive words : words that not
give meaning by itself such as preposition
, conjunction , conditional particle.

That is performed by comparison with a list


of these words.
Canonical Names
President Bush
Mr. Bush Canonical Name:
George Bush George Bush

• The canonical name is the most explicit, least


ambiguous name constructed from the different
variants found in the document
• Reduces ambiguity of variants
Mining
Clipping : eliminate words that appear in high
or low frequency.
o The low frequency’s words will forms small
clusters that not useful , and high frequency’s
words that is always appear and it’s also not
useful.
o There is many ways to calculate word’s
frequency in document(s)
Mining
Clustering : Clustering interrelated
documents, based on documents topics.
Text Mining: Analysis

• Which words are most present.


• Which words are most interesting .
• Which words help define the document.
• What are the interesting text phrases?
Text mining applications
• Call Center Software.
• Anti-Spam.
• Market Intelligence.
• Mining in web .
Actual examples
• One of clinical center in USA be capable of
determine one of genes that responsible for
one of harmful diseases by treat greater than
150,000 news paper.
• Text mining in holy Quran.
• Etc….
Challenges in Text Mining
• Information is in unstructured textual form and it’s
in Natural Language (NL).
• Not readily accessible to be used by computers.
• Dealing with huge collections of documents.
• Require Skillful person to choose which documents
that will treat , and analysis the output .
• Require more time.
• Cost , 50,000$ just to software.
More information
• Central Intelligence Agency (CIA) the most
supportive to text mining .
- 11/ September events.
- mining in E-mail , chat rooms, and social
networks .
-So its support many companies such as
Attensity ،Inxight , Intelliseek.
More information
• SPSS company statistic’s : text mining software
user’s so little comparing with data mining
software user’s.
conclusion
• Finally, most refer to that the field of text
mining are still in the research phase
• and still its applications limited operation at
the present time
• but the possibilities that can be provided,
which helps to understand the huge amounts
of text and extract the core of which
information is important and useful prospects
in many areas .

You might also like