SlideShare a Scribd company logo
Submitted by,
Gokul K
LE48MCA15
No:28
FISAT
 Defining Text Mining
 Structured vs. Unstructured Data
 Why Text Mining
 Some Text Mining Ambiguities
 Text Mining Practice Areas
 Pre-processing Techniques
 Challenges in Text Mining
 Conclusion
• The use of computational methods and techniques to
extract high quality information from text
• The discovery by computer of new, previously unknown
information, by automatically extracting information from a
usually large amount of different unstructured textual
resources
 We have a collection of documents (mainly text or
html-based)
 We have a set of users
 A user wants to retrieve the documents related to
a given concept
 He consequently submits a query expressed
through words or terms
 An information retrieval system returns the
documents most related to this concept
Textmining
 Unstructured text is present in various forms, and
in huge and ever increasing quantities:
1. books
2. financial and other business reports
3. various kinds of business and
administrative documents
4. news articles
 It is estimated that ~80% of all the available data are
unstructured data
 TM research and practice are focused on the
development, continual improvement and
application of such methods
 To enable effective and efficient use of such huge
quantities of textual content, we need
computational methods for
1. automated extraction of information from
unstructured text
2. analysis and summarization of extracted
information
 Language is ambiguous
 Context is needed to clarify
 The same words can have different meaning
 Bear (verb) – to support or carry
 Bear (noun) – a large animal
 Different words can mean the same (synonyms)
 Language is subtle(difficult to analyse
 Concept / word extraction usually results in huge number of
dimensions
 Thousands of new fields
 Each field typically has low information content (sparse)
 Misspellings, abbreviations, spelling variants
 Renders search engines, SQL queries.. ineffective.
 Homonomy: same word, different meaning
Mary walked along the bank of the river
HarborBank is the richest bank in the citys
 Synonymy: Synonyms, different words, similar or
same meaning, can substitute one word for other
without changing meaning.
Miss Nelson became a kind of big sister to Benjamin
Miss Nelson became a kind of large sister to Benjamin.
 Polysemy: same word or form, but different,
albeit related meaning
The bank raised its interest rates yesterday
The store is next to the newly constructed bank
The bank appeared first in Italy I the Renaissance
 Hyponymy: Concept hierarchy or subclass
Animal (noun) – cat, dog
Injury – broken leg, intusion
 Search and Information Retrieval – storage and
retrieval of text documents, including search
engines and keyword search
 Document Clustering – Grouping and categorizing
terms, snippets, paragraphs or documents using
clustering methods
 Document Classification – grouping and
categorizing snippets, paragraphs or document
using data mining classification methods, based on
methods trained on labelled examples
 Web Mining – Data and Text mining on the
internet with specific focus on scaled and
interconnectedness of the web
 Information Extraction – Identification and
extraction of relevant facts and relationships from
unstructured text
 Natural Language Processing – Low level language
processing and understanding of tasks (eg. Tagging
part of speech)
 Concept extraction – Grouping of words and
phrases into semantically similar groups
 Document – a sequence of words and punctuation,
following the grammatical rules of the language.
 Term – usually a word, but can be a word-pair or
phrase
 Corpus – a collection of documents
 Lexicon – set of all unique words in corpus
 Text Normalization
 Parts of Speech Tagging
 Removal of stop words
 Stop words – common words that don’t add
meaningful content to the document
 Stemming
 Removing suffices and prefixes leaving the root or stem of
the word.
 Tokenization
Textmining
 Case
 Make all lower case (if you don’t care about proper
nouns, titles, etc)
 Clean up transcription and typing errors
 do n’t, movei
 Correct misspelled words
 Phonetically
 Use fuzzy matching algorithms such as Soundex,
Metaphone or string edit distance
 Dictionaries
 Use POS and context to make good guess
 POS tagging is a process of assigning a POS or
lexical class marker to each word in a sentence
(and all sentences in a corpus).
 Input: the lead paint is unsafe
 Output: the/Det lead/N paint/N is/V
unsafe/Adj
 Tokenization is the process of breaking a stream
of text up into words, phrases, symbols, or other
meaningful elements called tokens.
 Converts streams of characters into words
 Tokens or words are separated by whitespace,
punctuation marks or line breaks.
 Normalizes / unifies variations of the same data
 ‘walking’, ‘walks’, ‘walked’, ‘walked’  walk
 Inflectional stemming
 Remove plurals
 Normalize verb tenses
 Remove other affixes
 Stemming to root
 Reduce word to most basic element
 More aggressive than inflectional
 ‘ ‘Apply’, ‘applications’, ‘reapplied’  apply
 The uppermost problem in text mining is the ambiguity
of the language i.e. the capability of being understood in
two or more possible sense. Because one word or phrase
may have multiple meanings those can lead to ambiguity
problem.
 In fields like Bioinformatics there are multiple names
for a single gene or protein that may also lead to
ambiguity problem.
  One more problem with test mining is when we
use the social media data i.e. status updates,
tweets, comments, reviews etc. most people use
slang words like- “btw” for by the way, “ppl” for
people etc. these words do not exist in the
dictionary that’s why they affects the mining
results.
 Another problem with text mining is cleaning the
data, if we extract online texts then we also get the
reference addresses of the images linked with the
text and those references are hard to remove.
Text analysis presently is really a fascinating technique
to determine the useful results from the textual data. By
using text mining techniques we can easily extract public
reviews, can classify the text into predefined classes, can
conclude the documents and also can make group or
cluster of multiple documents.
 https://en.wikipedia.org/wiki/Text_mining
 http://searchbusinessanalytics.techtarget.com/defi
nition/text-mining
 https://www.ijircce.com/upload/2016/april/40_Tex
t.pdf
Textmining
Ad

More Related Content

What's hot (20)

11 terms in corpus linguistics1 (1)
11 terms in corpus linguistics1 (1)11 terms in corpus linguistics1 (1)
11 terms in corpus linguistics1 (1)
ThennarasuSakkan
 
Natural Language Processing glossary for Coders
Natural Language Processing glossary for CodersNatural Language Processing glossary for Coders
Natural Language Processing glossary for Coders
Aravind Mohanoor
 
An Intuitive Natural Language Understanding System
An Intuitive Natural Language Understanding SystemAn Intuitive Natural Language Understanding System
An Intuitive Natural Language Understanding System
inscit2006
 
Semantics and Computational Semantics
Semantics and Computational SemanticsSemantics and Computational Semantics
Semantics and Computational Semantics
Marina Santini
 
Web classification of Digital Libraries using GATE Machine Learning  
Web classification of Digital Libraries using GATE Machine Learning  	Web classification of Digital Libraries using GATE Machine Learning  
Web classification of Digital Libraries using GATE Machine Learning  
sstose
 
Survey On Building A Database Driven Reverse Dictionary
Survey On Building A Database Driven Reverse DictionarySurvey On Building A Database Driven Reverse Dictionary
Survey On Building A Database Driven Reverse Dictionary
Editor IJMTER
 
Lecture 2: Computational Semantics
Lecture 2: Computational SemanticsLecture 2: Computational Semantics
Lecture 2: Computational Semantics
Marina Santini
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrieval
ssbd6985
 
Terminology work and term databases in Estonia
Terminology work and term databases in EstoniaTerminology work and term databases in Estonia
Terminology work and term databases in Estonia
Arvi Tavast
 
Taking into account communities of practice’s specific vocabularies in inform...
Taking into account communities of practice’s specific vocabularies in inform...Taking into account communities of practice’s specific vocabularies in inform...
Taking into account communities of practice’s specific vocabularies in inform...
inscit2006
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
Mariana Soffer
 
Text mining introduction-1
Text mining   introduction-1Text mining   introduction-1
Text mining introduction-1
Sumit Sony
 
Lecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language TechnologyLecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language Technology
Marina Santini
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
Ignacio Delgado
 
2010-04-29-swnj-pcls-presentation
2010-04-29-swnj-pcls-presentation2010-04-29-swnj-pcls-presentation
2010-04-29-swnj-pcls-presentation
Douglas Randall
 
3. introduction to text mining
3. introduction to text mining3. introduction to text mining
3. introduction to text mining
Lokesh Ramaswamy
 
Textmining Information Extraction
Textmining Information ExtractionTextmining Information Extraction
Textmining Information Extraction
guest0edcaf
 
What are the basics of Analysing a corpus? chpt.10 Routledge
What are the basics of Analysing a corpus? chpt.10 RoutledgeWhat are the basics of Analysing a corpus? chpt.10 Routledge
What are the basics of Analysing a corpus? chpt.10 Routledge
RajpootBhatti5
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
Alia Hamwi
 
Lecture: Semantic Word Clouds
Lecture: Semantic Word CloudsLecture: Semantic Word Clouds
Lecture: Semantic Word Clouds
Marina Santini
 
11 terms in corpus linguistics1 (1)
11 terms in corpus linguistics1 (1)11 terms in corpus linguistics1 (1)
11 terms in corpus linguistics1 (1)
ThennarasuSakkan
 
Natural Language Processing glossary for Coders
Natural Language Processing glossary for CodersNatural Language Processing glossary for Coders
Natural Language Processing glossary for Coders
Aravind Mohanoor
 
An Intuitive Natural Language Understanding System
An Intuitive Natural Language Understanding SystemAn Intuitive Natural Language Understanding System
An Intuitive Natural Language Understanding System
inscit2006
 
Semantics and Computational Semantics
Semantics and Computational SemanticsSemantics and Computational Semantics
Semantics and Computational Semantics
Marina Santini
 
Web classification of Digital Libraries using GATE Machine Learning  
Web classification of Digital Libraries using GATE Machine Learning  	Web classification of Digital Libraries using GATE Machine Learning  
Web classification of Digital Libraries using GATE Machine Learning  
sstose
 
Survey On Building A Database Driven Reverse Dictionary
Survey On Building A Database Driven Reverse DictionarySurvey On Building A Database Driven Reverse Dictionary
Survey On Building A Database Driven Reverse Dictionary
Editor IJMTER
 
Lecture 2: Computational Semantics
Lecture 2: Computational SemanticsLecture 2: Computational Semantics
Lecture 2: Computational Semantics
Marina Santini
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrieval
ssbd6985
 
Terminology work and term databases in Estonia
Terminology work and term databases in EstoniaTerminology work and term databases in Estonia
Terminology work and term databases in Estonia
Arvi Tavast
 
Taking into account communities of practice’s specific vocabularies in inform...
Taking into account communities of practice’s specific vocabularies in inform...Taking into account communities of practice’s specific vocabularies in inform...
Taking into account communities of practice’s specific vocabularies in inform...
inscit2006
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
Mariana Soffer
 
Text mining introduction-1
Text mining   introduction-1Text mining   introduction-1
Text mining introduction-1
Sumit Sony
 
Lecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language TechnologyLecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language Technology
Marina Santini
 
2010-04-29-swnj-pcls-presentation
2010-04-29-swnj-pcls-presentation2010-04-29-swnj-pcls-presentation
2010-04-29-swnj-pcls-presentation
Douglas Randall
 
3. introduction to text mining
3. introduction to text mining3. introduction to text mining
3. introduction to text mining
Lokesh Ramaswamy
 
Textmining Information Extraction
Textmining Information ExtractionTextmining Information Extraction
Textmining Information Extraction
guest0edcaf
 
What are the basics of Analysing a corpus? chpt.10 Routledge
What are the basics of Analysing a corpus? chpt.10 RoutledgeWhat are the basics of Analysing a corpus? chpt.10 Routledge
What are the basics of Analysing a corpus? chpt.10 Routledge
RajpootBhatti5
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
Alia Hamwi
 
Lecture: Semantic Word Clouds
Lecture: Semantic Word CloudsLecture: Semantic Word Clouds
Lecture: Semantic Word Clouds
Marina Santini
 

Viewers also liked (15)

Textmining Introduction
Textmining IntroductionTextmining Introduction
Textmining Introduction
Datamining Tools
 
II-SDV 2012 Dealing with Large Data Volumes in Statistical Analysis and Text ...
II-SDV 2012 Dealing with Large Data Volumes in Statistical Analysis and Text ...II-SDV 2012 Dealing with Large Data Volumes in Statistical Analysis and Text ...
II-SDV 2012 Dealing with Large Data Volumes in Statistical Analysis and Text ...
Dr. Haxel Consult
 
Essae Technologys Private Limited, Bengaluru, Desktop Printers
Essae Technologys Private Limited, Bengaluru, Desktop PrintersEssae Technologys Private Limited, Bengaluru, Desktop Printers
Essae Technologys Private Limited, Bengaluru, Desktop Printers
IndiaMART InterMESH Limited
 
Suspensa lei que obrigava concessionárias a plantar árvore por cada veículo v...
Suspensa lei que obrigava concessionárias a plantar árvore por cada veículo v...Suspensa lei que obrigava concessionárias a plantar árvore por cada veículo v...
Suspensa lei que obrigava concessionárias a plantar árvore por cada veículo v...
Rondoniadinamica Jornal Eletrônico
 
Outsourcing in Greece
Outsourcing in GreeceOutsourcing in Greece
Outsourcing in Greece
Delta Pi Systems
 
Premsons Plastics Private Limited, Mumbai, Plastic Water Bottles
Premsons Plastics Private Limited, Mumbai, Plastic Water BottlesPremsons Plastics Private Limited, Mumbai, Plastic Water Bottles
Premsons Plastics Private Limited, Mumbai, Plastic Water Bottles
IndiaMART InterMESH Limited
 
Embargos infringentes
Embargos infringentesEmbargos infringentes
Embargos infringentes
Diego Guedes
 
Polje, Јovan Ducic
Polje, Јovan DucicPolje, Јovan Ducic
Polje, Јovan Ducic
Valentina Nedic
 
Company 2 EBITDA and CROCI
Company 2 EBITDA and CROCICompany 2 EBITDA and CROCI
Company 2 EBITDA and CROCI
Steve Ellis
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
DataWorks Summit
 
Text Mining and Visualization
Text Mining and VisualizationText Mining and Visualization
Text Mining and Visualization
Seth Grimes
 
Text analytics opportunities in the Insurance domain
Text analytics opportunities in the Insurance domainText analytics opportunities in the Insurance domain
Text analytics opportunities in the Insurance domain
bsamar99
 
An Introduction to Text Analytics: 2013 Workshop presentation
An Introduction to Text Analytics: 2013 Workshop presentationAn Introduction to Text Analytics: 2013 Workshop presentation
An Introduction to Text Analytics: 2013 Workshop presentation
Seth Grimes
 
agenesia , aplasias y hipoplasia pulmonar
agenesia , aplasias y hipoplasia pulmonaragenesia , aplasias y hipoplasia pulmonar
agenesia , aplasias y hipoplasia pulmonar
no travajo mis padres me mantienen
 
Opening sequence conventions
Opening sequence conventionsOpening sequence conventions
Opening sequence conventions
CsengeNemeti
 
II-SDV 2012 Dealing with Large Data Volumes in Statistical Analysis and Text ...
II-SDV 2012 Dealing with Large Data Volumes in Statistical Analysis and Text ...II-SDV 2012 Dealing with Large Data Volumes in Statistical Analysis and Text ...
II-SDV 2012 Dealing with Large Data Volumes in Statistical Analysis and Text ...
Dr. Haxel Consult
 
Essae Technologys Private Limited, Bengaluru, Desktop Printers
Essae Technologys Private Limited, Bengaluru, Desktop PrintersEssae Technologys Private Limited, Bengaluru, Desktop Printers
Essae Technologys Private Limited, Bengaluru, Desktop Printers
IndiaMART InterMESH Limited
 
Suspensa lei que obrigava concessionárias a plantar árvore por cada veículo v...
Suspensa lei que obrigava concessionárias a plantar árvore por cada veículo v...Suspensa lei que obrigava concessionárias a plantar árvore por cada veículo v...
Suspensa lei que obrigava concessionárias a plantar árvore por cada veículo v...
Rondoniadinamica Jornal Eletrônico
 
Premsons Plastics Private Limited, Mumbai, Plastic Water Bottles
Premsons Plastics Private Limited, Mumbai, Plastic Water BottlesPremsons Plastics Private Limited, Mumbai, Plastic Water Bottles
Premsons Plastics Private Limited, Mumbai, Plastic Water Bottles
IndiaMART InterMESH Limited
 
Embargos infringentes
Embargos infringentesEmbargos infringentes
Embargos infringentes
Diego Guedes
 
Company 2 EBITDA and CROCI
Company 2 EBITDA and CROCICompany 2 EBITDA and CROCI
Company 2 EBITDA and CROCI
Steve Ellis
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
DataWorks Summit
 
Text Mining and Visualization
Text Mining and VisualizationText Mining and Visualization
Text Mining and Visualization
Seth Grimes
 
Text analytics opportunities in the Insurance domain
Text analytics opportunities in the Insurance domainText analytics opportunities in the Insurance domain
Text analytics opportunities in the Insurance domain
bsamar99
 
An Introduction to Text Analytics: 2013 Workshop presentation
An Introduction to Text Analytics: 2013 Workshop presentationAn Introduction to Text Analytics: 2013 Workshop presentation
An Introduction to Text Analytics: 2013 Workshop presentation
Seth Grimes
 
Opening sequence conventions
Opening sequence conventionsOpening sequence conventions
Opening sequence conventions
CsengeNemeti
 
Ad

Similar to Textmining (20)

Information retrieval chapter 2-Text Operations.ppt
Information retrieval chapter 2-Text Operations.pptInformation retrieval chapter 2-Text Operations.ppt
Information retrieval chapter 2-Text Operations.ppt
SamuelKetema1
 
The impact of standardized terminologies and domain-ontologies in multilingua...
The impact of standardized terminologies and domain-ontologies in multilingua...The impact of standardized terminologies and domain-ontologies in multilingua...
The impact of standardized terminologies and domain-ontologies in multilingua...
AIMS (Agricultural Information Management Standards)
 
Text Analytics for Semantic Computing
Text Analytics for Semantic ComputingText Analytics for Semantic Computing
Text Analytics for Semantic Computing
Meena Nagarajan
 
overview of natural language processing concepts
overview of natural language processing conceptsoverview of natural language processing concepts
overview of natural language processing concepts
nazimsattar
 
Chapter 2 Text Operation.pdf
Chapter 2 Text Operation.pdfChapter 2 Text Operation.pdf
Chapter 2 Text Operation.pdf
Habtamu100
 
REPORT.doc
REPORT.docREPORT.doc
REPORT.doc
IswaryaPurushothaman1
 
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
beshahashenafe20
 
Metaphic or the art of looking another way.
Metaphic or the art of looking another way.Metaphic or the art of looking another way.
Metaphic or the art of looking another way.
Suresh Manian
 
data science and analytics in computer science
data science and analytics in computer sciencedata science and analytics in computer science
data science and analytics in computer science
uthradevia5
 
Big Data and Natural Language Processing
Big Data and Natural Language ProcessingBig Data and Natural Language Processing
Big Data and Natural Language Processing
Michel Bruley
 
IR CHAPTER_TWO Most important for students
IR CHAPTER_TWO Most important for studentsIR CHAPTER_TWO Most important for students
IR CHAPTER_TWO Most important for students
abduwasiahmed
 
Text Mining at Feature Level: A Review
Text Mining at Feature Level: A ReviewText Mining at Feature Level: A Review
Text Mining at Feature Level: A Review
INFOGAIN PUBLICATION
 
05 handbook summ-hovy
05 handbook summ-hovy05 handbook summ-hovy
05 handbook summ-hovy
Sagar Dabhi
 
Chapter 2: Text Operation in information stroage and retrieval
Chapter 2: Text Operation in information stroage and retrievalChapter 2: Text Operation in information stroage and retrieval
Chapter 2: Text Operation in information stroage and retrieval
captainmactavish1996
 
Nlp
NlpNlp
Nlp
Nishanthini Mary
 
Ir 03
Ir   03Ir   03
Ir 03
Mohammed Romi
 
NLP Msc Computer science S2 Kerala University
NLP Msc Computer science S2 Kerala UniversityNLP Msc Computer science S2 Kerala University
NLP Msc Computer science S2 Kerala University
vineethpradeep50
 
A0210110
A0210110A0210110
A0210110
inventionjournals
 
Nlp (1)
Nlp (1)Nlp (1)
Nlp (1)
SubramanianMuthusamy3
 
lect36-tasks.ppt
lect36-tasks.pptlect36-tasks.ppt
lect36-tasks.ppt
HaHa501620
 
Information retrieval chapter 2-Text Operations.ppt
Information retrieval chapter 2-Text Operations.pptInformation retrieval chapter 2-Text Operations.ppt
Information retrieval chapter 2-Text Operations.ppt
SamuelKetema1
 
Text Analytics for Semantic Computing
Text Analytics for Semantic ComputingText Analytics for Semantic Computing
Text Analytics for Semantic Computing
Meena Nagarajan
 
overview of natural language processing concepts
overview of natural language processing conceptsoverview of natural language processing concepts
overview of natural language processing concepts
nazimsattar
 
Chapter 2 Text Operation.pdf
Chapter 2 Text Operation.pdfChapter 2 Text Operation.pdf
Chapter 2 Text Operation.pdf
Habtamu100
 
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
beshahashenafe20
 
Metaphic or the art of looking another way.
Metaphic or the art of looking another way.Metaphic or the art of looking another way.
Metaphic or the art of looking another way.
Suresh Manian
 
data science and analytics in computer science
data science and analytics in computer sciencedata science and analytics in computer science
data science and analytics in computer science
uthradevia5
 
Big Data and Natural Language Processing
Big Data and Natural Language ProcessingBig Data and Natural Language Processing
Big Data and Natural Language Processing
Michel Bruley
 
IR CHAPTER_TWO Most important for students
IR CHAPTER_TWO Most important for studentsIR CHAPTER_TWO Most important for students
IR CHAPTER_TWO Most important for students
abduwasiahmed
 
Text Mining at Feature Level: A Review
Text Mining at Feature Level: A ReviewText Mining at Feature Level: A Review
Text Mining at Feature Level: A Review
INFOGAIN PUBLICATION
 
05 handbook summ-hovy
05 handbook summ-hovy05 handbook summ-hovy
05 handbook summ-hovy
Sagar Dabhi
 
Chapter 2: Text Operation in information stroage and retrieval
Chapter 2: Text Operation in information stroage and retrievalChapter 2: Text Operation in information stroage and retrieval
Chapter 2: Text Operation in information stroage and retrieval
captainmactavish1996
 
NLP Msc Computer science S2 Kerala University
NLP Msc Computer science S2 Kerala UniversityNLP Msc Computer science S2 Kerala University
NLP Msc Computer science S2 Kerala University
vineethpradeep50
 
lect36-tasks.ppt
lect36-tasks.pptlect36-tasks.ppt
lect36-tasks.ppt
HaHa501620
 
Ad

Recently uploaded (20)

Handling Multiple Choice Responses: Fortune Effiong.pptx
Handling Multiple Choice Responses: Fortune Effiong.pptxHandling Multiple Choice Responses: Fortune Effiong.pptx
Handling Multiple Choice Responses: Fortune Effiong.pptx
AuthorAIDNationalRes
 
Multi-currency in odoo accounting and Update exchange rates automatically in ...
Multi-currency in odoo accounting and Update exchange rates automatically in ...Multi-currency in odoo accounting and Update exchange rates automatically in ...
Multi-currency in odoo accounting and Update exchange rates automatically in ...
Celine George
 
Phoenix – A Collaborative Renewal of Children’s and Young People’s Services C...
Phoenix – A Collaborative Renewal of Children’s and Young People’s Services C...Phoenix – A Collaborative Renewal of Children’s and Young People’s Services C...
Phoenix – A Collaborative Renewal of Children’s and Young People’s Services C...
Library Association of Ireland
 
YSPH VMOC Special Report - Measles Outbreak Southwest US 5-3-2025.pptx
YSPH VMOC Special Report - Measles Outbreak  Southwest US 5-3-2025.pptxYSPH VMOC Special Report - Measles Outbreak  Southwest US 5-3-2025.pptx
YSPH VMOC Special Report - Measles Outbreak Southwest US 5-3-2025.pptx
Yale School of Public Health - The Virtual Medical Operations Center (VMOC)
 
Quality Contril Analysis of Containers.pdf
Quality Contril Analysis of Containers.pdfQuality Contril Analysis of Containers.pdf
Quality Contril Analysis of Containers.pdf
Dr. Bindiya Chauhan
 
How to Customize Your Financial Reports & Tax Reports With Odoo 17 Accounting
How to Customize Your Financial Reports & Tax Reports With Odoo 17 AccountingHow to Customize Your Financial Reports & Tax Reports With Odoo 17 Accounting
How to Customize Your Financial Reports & Tax Reports With Odoo 17 Accounting
Celine George
 
Ultimate VMware 2V0-11.25 Exam Dumps for Exam Success
Ultimate VMware 2V0-11.25 Exam Dumps for Exam SuccessUltimate VMware 2V0-11.25 Exam Dumps for Exam Success
Ultimate VMware 2V0-11.25 Exam Dumps for Exam Success
Mark Soia
 
Presentation of the MIPLM subject matter expert Erdem Kaya
Presentation of the MIPLM subject matter expert Erdem KayaPresentation of the MIPLM subject matter expert Erdem Kaya
Presentation of the MIPLM subject matter expert Erdem Kaya
MIPLM
 
New Microsoft PowerPoint Presentation.pptx
New Microsoft PowerPoint Presentation.pptxNew Microsoft PowerPoint Presentation.pptx
New Microsoft PowerPoint Presentation.pptx
milanasargsyan5
 
LDMMIA Reiki Master Spring 2025 Mini Updates
LDMMIA Reiki Master Spring 2025 Mini UpdatesLDMMIA Reiki Master Spring 2025 Mini Updates
LDMMIA Reiki Master Spring 2025 Mini Updates
LDM Mia eStudios
 
Unit 6_Introduction_Phishing_Password Cracking.pdf
Unit 6_Introduction_Phishing_Password Cracking.pdfUnit 6_Introduction_Phishing_Password Cracking.pdf
Unit 6_Introduction_Phishing_Password Cracking.pdf
KanchanPatil34
 
YSPH VMOC Special Report - Measles Outbreak Southwest US 4-30-2025.pptx
YSPH VMOC Special Report - Measles Outbreak  Southwest US 4-30-2025.pptxYSPH VMOC Special Report - Measles Outbreak  Southwest US 4-30-2025.pptx
YSPH VMOC Special Report - Measles Outbreak Southwest US 4-30-2025.pptx
Yale School of Public Health - The Virtual Medical Operations Center (VMOC)
 
SCI BIZ TECH QUIZ (OPEN) PRELIMS XTASY 2025.pptx
SCI BIZ TECH QUIZ (OPEN) PRELIMS XTASY 2025.pptxSCI BIZ TECH QUIZ (OPEN) PRELIMS XTASY 2025.pptx
SCI BIZ TECH QUIZ (OPEN) PRELIMS XTASY 2025.pptx
Ronisha Das
 
One Hot encoding a revolution in Machine learning
One Hot encoding a revolution in Machine learningOne Hot encoding a revolution in Machine learning
One Hot encoding a revolution in Machine learning
momer9505
 
Understanding P–N Junction Semiconductors: A Beginner’s Guide
Understanding P–N Junction Semiconductors: A Beginner’s GuideUnderstanding P–N Junction Semiconductors: A Beginner’s Guide
Understanding P–N Junction Semiconductors: A Beginner’s Guide
GS Virdi
 
Metamorphosis: Life's Transformative Journey
Metamorphosis: Life's Transformative JourneyMetamorphosis: Life's Transformative Journey
Metamorphosis: Life's Transformative Journey
Arshad Shaikh
 
P-glycoprotein pamphlet: iteration 4 of 4 final
P-glycoprotein pamphlet: iteration 4 of 4 finalP-glycoprotein pamphlet: iteration 4 of 4 final
P-glycoprotein pamphlet: iteration 4 of 4 final
bs22n2s
 
SPRING FESTIVITIES - UK AND USA -
SPRING FESTIVITIES - UK AND USA            -SPRING FESTIVITIES - UK AND USA            -
SPRING FESTIVITIES - UK AND USA -
Colégio Santa Teresinha
 
UNIT 3 NATIONAL HEALTH PROGRAMMEE. SOCIAL AND PREVENTIVE PHARMACY
UNIT 3 NATIONAL HEALTH PROGRAMMEE. SOCIAL AND PREVENTIVE PHARMACYUNIT 3 NATIONAL HEALTH PROGRAMMEE. SOCIAL AND PREVENTIVE PHARMACY
UNIT 3 NATIONAL HEALTH PROGRAMMEE. SOCIAL AND PREVENTIVE PHARMACY
DR.PRISCILLA MARY J
 
pulse ppt.pptx Types of pulse , characteristics of pulse , Alteration of pulse
pulse  ppt.pptx Types of pulse , characteristics of pulse , Alteration of pulsepulse  ppt.pptx Types of pulse , characteristics of pulse , Alteration of pulse
pulse ppt.pptx Types of pulse , characteristics of pulse , Alteration of pulse
sushreesangita003
 
Handling Multiple Choice Responses: Fortune Effiong.pptx
Handling Multiple Choice Responses: Fortune Effiong.pptxHandling Multiple Choice Responses: Fortune Effiong.pptx
Handling Multiple Choice Responses: Fortune Effiong.pptx
AuthorAIDNationalRes
 
Multi-currency in odoo accounting and Update exchange rates automatically in ...
Multi-currency in odoo accounting and Update exchange rates automatically in ...Multi-currency in odoo accounting and Update exchange rates automatically in ...
Multi-currency in odoo accounting and Update exchange rates automatically in ...
Celine George
 
Phoenix – A Collaborative Renewal of Children’s and Young People’s Services C...
Phoenix – A Collaborative Renewal of Children’s and Young People’s Services C...Phoenix – A Collaborative Renewal of Children’s and Young People’s Services C...
Phoenix – A Collaborative Renewal of Children’s and Young People’s Services C...
Library Association of Ireland
 
Quality Contril Analysis of Containers.pdf
Quality Contril Analysis of Containers.pdfQuality Contril Analysis of Containers.pdf
Quality Contril Analysis of Containers.pdf
Dr. Bindiya Chauhan
 
How to Customize Your Financial Reports & Tax Reports With Odoo 17 Accounting
How to Customize Your Financial Reports & Tax Reports With Odoo 17 AccountingHow to Customize Your Financial Reports & Tax Reports With Odoo 17 Accounting
How to Customize Your Financial Reports & Tax Reports With Odoo 17 Accounting
Celine George
 
Ultimate VMware 2V0-11.25 Exam Dumps for Exam Success
Ultimate VMware 2V0-11.25 Exam Dumps for Exam SuccessUltimate VMware 2V0-11.25 Exam Dumps for Exam Success
Ultimate VMware 2V0-11.25 Exam Dumps for Exam Success
Mark Soia
 
Presentation of the MIPLM subject matter expert Erdem Kaya
Presentation of the MIPLM subject matter expert Erdem KayaPresentation of the MIPLM subject matter expert Erdem Kaya
Presentation of the MIPLM subject matter expert Erdem Kaya
MIPLM
 
New Microsoft PowerPoint Presentation.pptx
New Microsoft PowerPoint Presentation.pptxNew Microsoft PowerPoint Presentation.pptx
New Microsoft PowerPoint Presentation.pptx
milanasargsyan5
 
LDMMIA Reiki Master Spring 2025 Mini Updates
LDMMIA Reiki Master Spring 2025 Mini UpdatesLDMMIA Reiki Master Spring 2025 Mini Updates
LDMMIA Reiki Master Spring 2025 Mini Updates
LDM Mia eStudios
 
Unit 6_Introduction_Phishing_Password Cracking.pdf
Unit 6_Introduction_Phishing_Password Cracking.pdfUnit 6_Introduction_Phishing_Password Cracking.pdf
Unit 6_Introduction_Phishing_Password Cracking.pdf
KanchanPatil34
 
SCI BIZ TECH QUIZ (OPEN) PRELIMS XTASY 2025.pptx
SCI BIZ TECH QUIZ (OPEN) PRELIMS XTASY 2025.pptxSCI BIZ TECH QUIZ (OPEN) PRELIMS XTASY 2025.pptx
SCI BIZ TECH QUIZ (OPEN) PRELIMS XTASY 2025.pptx
Ronisha Das
 
One Hot encoding a revolution in Machine learning
One Hot encoding a revolution in Machine learningOne Hot encoding a revolution in Machine learning
One Hot encoding a revolution in Machine learning
momer9505
 
Understanding P–N Junction Semiconductors: A Beginner’s Guide
Understanding P–N Junction Semiconductors: A Beginner’s GuideUnderstanding P–N Junction Semiconductors: A Beginner’s Guide
Understanding P–N Junction Semiconductors: A Beginner’s Guide
GS Virdi
 
Metamorphosis: Life's Transformative Journey
Metamorphosis: Life's Transformative JourneyMetamorphosis: Life's Transformative Journey
Metamorphosis: Life's Transformative Journey
Arshad Shaikh
 
P-glycoprotein pamphlet: iteration 4 of 4 final
P-glycoprotein pamphlet: iteration 4 of 4 finalP-glycoprotein pamphlet: iteration 4 of 4 final
P-glycoprotein pamphlet: iteration 4 of 4 final
bs22n2s
 
UNIT 3 NATIONAL HEALTH PROGRAMMEE. SOCIAL AND PREVENTIVE PHARMACY
UNIT 3 NATIONAL HEALTH PROGRAMMEE. SOCIAL AND PREVENTIVE PHARMACYUNIT 3 NATIONAL HEALTH PROGRAMMEE. SOCIAL AND PREVENTIVE PHARMACY
UNIT 3 NATIONAL HEALTH PROGRAMMEE. SOCIAL AND PREVENTIVE PHARMACY
DR.PRISCILLA MARY J
 
pulse ppt.pptx Types of pulse , characteristics of pulse , Alteration of pulse
pulse  ppt.pptx Types of pulse , characteristics of pulse , Alteration of pulsepulse  ppt.pptx Types of pulse , characteristics of pulse , Alteration of pulse
pulse ppt.pptx Types of pulse , characteristics of pulse , Alteration of pulse
sushreesangita003
 

Textmining

  • 2.  Defining Text Mining  Structured vs. Unstructured Data  Why Text Mining  Some Text Mining Ambiguities  Text Mining Practice Areas  Pre-processing Techniques  Challenges in Text Mining  Conclusion
  • 3. • The use of computational methods and techniques to extract high quality information from text • The discovery by computer of new, previously unknown information, by automatically extracting information from a usually large amount of different unstructured textual resources
  • 4.  We have a collection of documents (mainly text or html-based)  We have a set of users  A user wants to retrieve the documents related to a given concept  He consequently submits a query expressed through words or terms  An information retrieval system returns the documents most related to this concept
  • 6.  Unstructured text is present in various forms, and in huge and ever increasing quantities: 1. books 2. financial and other business reports 3. various kinds of business and administrative documents 4. news articles  It is estimated that ~80% of all the available data are unstructured data
  • 7.  TM research and practice are focused on the development, continual improvement and application of such methods  To enable effective and efficient use of such huge quantities of textual content, we need computational methods for 1. automated extraction of information from unstructured text 2. analysis and summarization of extracted information
  • 8.  Language is ambiguous  Context is needed to clarify  The same words can have different meaning  Bear (verb) – to support or carry  Bear (noun) – a large animal  Different words can mean the same (synonyms)  Language is subtle(difficult to analyse  Concept / word extraction usually results in huge number of dimensions  Thousands of new fields  Each field typically has low information content (sparse)  Misspellings, abbreviations, spelling variants  Renders search engines, SQL queries.. ineffective.
  • 9.  Homonomy: same word, different meaning Mary walked along the bank of the river HarborBank is the richest bank in the citys  Synonymy: Synonyms, different words, similar or same meaning, can substitute one word for other without changing meaning. Miss Nelson became a kind of big sister to Benjamin Miss Nelson became a kind of large sister to Benjamin.
  • 10.  Polysemy: same word or form, but different, albeit related meaning The bank raised its interest rates yesterday The store is next to the newly constructed bank The bank appeared first in Italy I the Renaissance  Hyponymy: Concept hierarchy or subclass Animal (noun) – cat, dog Injury – broken leg, intusion
  • 11.  Search and Information Retrieval – storage and retrieval of text documents, including search engines and keyword search  Document Clustering – Grouping and categorizing terms, snippets, paragraphs or documents using clustering methods  Document Classification – grouping and categorizing snippets, paragraphs or document using data mining classification methods, based on methods trained on labelled examples  Web Mining – Data and Text mining on the internet with specific focus on scaled and interconnectedness of the web
  • 12.  Information Extraction – Identification and extraction of relevant facts and relationships from unstructured text  Natural Language Processing – Low level language processing and understanding of tasks (eg. Tagging part of speech)  Concept extraction – Grouping of words and phrases into semantically similar groups
  • 13.  Document – a sequence of words and punctuation, following the grammatical rules of the language.  Term – usually a word, but can be a word-pair or phrase  Corpus – a collection of documents  Lexicon – set of all unique words in corpus
  • 14.  Text Normalization  Parts of Speech Tagging  Removal of stop words  Stop words – common words that don’t add meaningful content to the document  Stemming  Removing suffices and prefixes leaving the root or stem of the word.  Tokenization
  • 16.  Case  Make all lower case (if you don’t care about proper nouns, titles, etc)  Clean up transcription and typing errors  do n’t, movei  Correct misspelled words  Phonetically  Use fuzzy matching algorithms such as Soundex, Metaphone or string edit distance  Dictionaries  Use POS and context to make good guess
  • 17.  POS tagging is a process of assigning a POS or lexical class marker to each word in a sentence (and all sentences in a corpus).  Input: the lead paint is unsafe  Output: the/Det lead/N paint/N is/V unsafe/Adj
  • 18.  Tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens.  Converts streams of characters into words  Tokens or words are separated by whitespace, punctuation marks or line breaks.
  • 19.  Normalizes / unifies variations of the same data  ‘walking’, ‘walks’, ‘walked’, ‘walked’  walk  Inflectional stemming  Remove plurals  Normalize verb tenses  Remove other affixes  Stemming to root  Reduce word to most basic element  More aggressive than inflectional  ‘ ‘Apply’, ‘applications’, ‘reapplied’  apply
  • 20.  The uppermost problem in text mining is the ambiguity of the language i.e. the capability of being understood in two or more possible sense. Because one word or phrase may have multiple meanings those can lead to ambiguity problem.  In fields like Bioinformatics there are multiple names for a single gene or protein that may also lead to ambiguity problem.
  • 21.   One more problem with test mining is when we use the social media data i.e. status updates, tweets, comments, reviews etc. most people use slang words like- “btw” for by the way, “ppl” for people etc. these words do not exist in the dictionary that’s why they affects the mining results.  Another problem with text mining is cleaning the data, if we extract online texts then we also get the reference addresses of the images linked with the text and those references are hard to remove.
  • 22. Text analysis presently is really a fascinating technique to determine the useful results from the textual data. By using text mining techniques we can easily extract public reviews, can classify the text into predefined classes, can conclude the documents and also can make group or cluster of multiple documents.