SlideShare a Scribd company logo
TEXT MINING
BY
THEJESWINI
B.Tech CSE 3 Year
SUBCODE:XCSE65
SUBNAME:DATA MINING
CONTENTS
INTRODUCTION
DATA MINING vs TEXT MINING
AREAS OF TEXT MINING
INFORMATION RETRIEVAL
TEXT MINING PROCESS
TEXT MINING APPROACHES
CHALLENGES OF TEXT MINING
REFERNECES
INTRODUCTION
 Nowadays, there is a rapid growth in text databases due to many sources
generating data in text.
 Sources that generate text databases are : collections of documents from
various sources - such as news articles, research papers, books, digital
libraries, e-mail messages, and World Wide web(which can also be viewed
as a huge, interconnected, dynamic text database) and also many
government and business institutions also store their data in form of text.
 Understanding that generated text patterns and obtaining useful and
reliable information has become the main reason for text mining.
INTRODUCTION...(CONTD)
 Text mining is formally defined as process of extracting relevant
information or pattern from different sources that are in unstructured or
semi-structured format
 Data stored in most text databases are semi structured data ,i.e. they are
neither completely unstructured nor completely structured.
 For example, a document may contain a few structured fields, such as title,
authors, publication date, category, and so on, but also contain some
largely unstructured text components, such as abstract and contents.
DATA MINING vs TEXT MINING
DATA MINING TEXT MINING
It is the process of finding patterns and
extracting useful data from large data sets.
Is applied on data from from various text
documents
Applied on all types of data Applied on text data, which is mostly semi
structured or unstructured
Processing of data is done directly. Processing of data is done linguistically.
Statistical techniques are used to evaluate
data.
Computational linguistic principles are used
to evaluate text.
AREAS OF TEXT MINING
IR(Information
Retrieval)
NLP(Natural
Language
Processing)
IE(Information
Extraction)
Data Mining
Query based search on large text documents
The development of the NLP application generally expect
humans to "Speak" to them in a programming language that
is accurate, clear, and exceptionally structured. Human
speech is usually not authentic so that it can depend on
many complex variables, including slang, social context, and
regional dialects.
The automatic extraction of structured data such as entities,
entities relationships, and attributes describing entities from
an unstructured source is called information extraction.
Data mining refers to the extraction of useful data, hidden
patterns from large data sets. Data mining tools can predict
behaviors and future trends that allow businesses to make a
better data-driven decision..
INFORMATION RETRIEVAL
Information retrieval is a method to retrieve information from a large number
of text-based documents.
Due to the abundance of text information, information retrieval has found
many applications. There exist many information retrieval systems, such as :
-on-line library catalog systems,
-on-line document management systems, and
-the more recently developed Web search engine
 A typical information retrieval problem is to locate relevant documents in a
document collection based on a user’s query, which is often some keywords
describing an information need.
INFORMATION RETRIEVAL…(CONTD)
1. BASIC MEASURES OF INFORMATION RETRIEVAL
There are two basic measures for assessing the quality of text retrieval:
Precision: This is the percentage of retrieved documents that are in fact
relevant to the query (i.e., “correct” responses). It is formally defined as
Recall: This is the percentage of documents that are relevant to the query and
were, in fact, retrieved. It is formally defined as
One commonly used trade-off is the F-score, which is defined as the harmonic
mean of recall and precision:
precision = |{Relevant} ∩ {Retrieved}|/ |{Retrieved}|
recall = |{Relevant} ∩ {Retrieved}| /|{Relevant}|
F score = recall × precision (recall + precision)/2
INFORMATION RETRIEVAL…(CONTD)
2. TEXT RETRIEVAL METHODS
Information retrieval of text documents can be done by the following methods:
-Document selection method: In this method , the query is given by
specifying constraints for selecting relevant documents. A typical method of this
category is the “Boolean retrieval model”, in which a document is represented by
a set of keywords and a user provides a Boolean expression of keywords, such as
e.g: “car and repair shops” , “tea or coffee”
-Document ranking method: In this method, the query is used to rank all
documents in the order of relevance. The goal is to approximate the degree of
relevance of a document with a score computed based on information such as the
frequency of words in the document and the whole collection.
INFORMATION RETRIEVAL…(CONTD)
 The first step in most retrieval
systems is to identify keywords for
representing documents, a
preprocessing step often called
tokenization. To avoid indexing
useless words, a text retrieval system
often associates a “stop list” with a
set of documents.
Text Mining is a part of Data Mining
text mining part data
mining
TEXT MINING PROCESS
• Text preprocessing
-Syntactic/Semantic
-text analysis (Text cleanup, Tokenization)
• Features Generation
-Bag of words (words it contains and occurences)
-Vector space
• Features Selection
-Simple counting
-Statistics
• Text/Data Mining
-Classification(supervised)
-Clustering(unsupervised)
-Associations(relationships)
• Analyzing results
TEXT MINING APPROACHES
 The text mining approaches are based on the inputs taken in the text mining
system and the data mining tasks to be performed. In general, the major
approaches, based on the kinds of data they take as input, are:
(1) the keyword-based approach, where the input is a set of keywords or
terms in the documents,
(2) the tagging approach, where the input is a set of tags, and
(3)the information-extraction approach, which inputs semantic
information, such as events, facts, or entities uncovered by information
extraction.
1) KEY WORD ASSOCIATION BASED ANALYSIS:
It is an analysis which collects sets of keywords or terms that occur frequently
together and then finds the association or correlation relationships among them.
E.g. [Stanford, University]
2) DOCUMENT CLASSIFICATION ANALYSIS:
Automated document classification is an important text mining task because,
with the existence of a tremendous number of on-line documents, it is tedious yet
essential to be able to automatically organize such documents into classes to
facilitate document retrieval and subsequent analysis. E.g. Tagging
3) DOCUMENT CLUSTERING ANALYSIS:
Document clustering is one of the most crucial techniques for organizing
documents in an unsupervised manner.
TEXT MINING APPROACHES…(CONTD)
CHALLENGES OF TEXT MINING
 Information is in unstructured textual form
 Large textual database – Difficult to apply text mining
 Complex and subtle relationships between concepts in text
 Word ambiguity and context sensitivity
e.g windows can be either operating system or opening in the wall to
allow air flow in the house.
 Noisy data
Spelling mistakes and irrelevant data(outliers)
REFERENCES
[1]Jiawei Han University of Illinois at Urbana-Champaign Micheline Kamber
“Data Mining: Concepts and Techniques Second Edition”
[2] https://www.javatpoint.com/text-data-mining
[3] https://paginas.fe.up.pt/~ec/files_0405/slides/07%20TextMining.pdf
Text mining
Ad

More Related Content

What's hot (20)

Web mining
Web mining Web mining
Web mining
TeklayBirhane
 
Web Mining & Text Mining
Web Mining & Text MiningWeb Mining & Text Mining
Web Mining & Text Mining
Hemant Sharma
 
Data mining
Data miningData mining
Data mining
Birju Tank
 
Data mining concepts and work
Data mining concepts and workData mining concepts and work
Data mining concepts and work
Amr Abd El Latief
 
Introduction to Data mining
Introduction to Data miningIntroduction to Data mining
Introduction to Data mining
Hadi Fadlallah
 
web mining
web miningweb mining
web mining
Arpit Verma
 
Data mining tasks
Data mining tasksData mining tasks
Data mining tasks
Khwaja Aamer
 
Data Analytics Life Cycle
Data Analytics Life CycleData Analytics Life Cycle
Data Analytics Life Cycle
Dr. C.V. Suresh Babu
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
ankur bhalla
 
Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data Mining
DHIVYADEVAKI
 
Clustering in Data Mining
Clustering in Data MiningClustering in Data Mining
Clustering in Data Mining
Archana Swaminathan
 
Web mining
Web miningWeb mining
Web mining
Tanjarul Islam Mishu
 
Mining Association Rules in Large Database
Mining Association Rules in Large DatabaseMining Association Rules in Large Database
Mining Association Rules in Large Database
Er. Nawaraj Bhandari
 
Text mining
Text miningText mining
Text mining
Ali A Jalil
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
DataminingTools Inc
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
Jason Rodrigues
 
Data mining
Data miningData mining
Data mining
Annies Minu
 
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Salah Amean
 
Web content mining
Web content miningWeb content mining
Web content mining
Daminda Herath
 
Data mining techniques unit 1
Data mining techniques  unit 1Data mining techniques  unit 1
Data mining techniques unit 1
malathieswaran29
 
Web Mining & Text Mining
Web Mining & Text MiningWeb Mining & Text Mining
Web Mining & Text Mining
Hemant Sharma
 
Data mining concepts and work
Data mining concepts and workData mining concepts and work
Data mining concepts and work
Amr Abd El Latief
 
Introduction to Data mining
Introduction to Data miningIntroduction to Data mining
Introduction to Data mining
Hadi Fadlallah
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
ankur bhalla
 
Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data Mining
DHIVYADEVAKI
 
Mining Association Rules in Large Database
Mining Association Rules in Large DatabaseMining Association Rules in Large Database
Mining Association Rules in Large Database
Er. Nawaraj Bhandari
 
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Salah Amean
 
Data mining techniques unit 1
Data mining techniques  unit 1Data mining techniques  unit 1
Data mining techniques unit 1
malathieswaran29
 

Similar to Text mining (20)

Text Mining.pptx
Text Mining.pptxText Mining.pptx
Text Mining.pptx
vrundadevani
 
Web_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibWeb_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_Habib
El Habib NFAOUI
 
An Improved Annotation Based Summary Generation For Unstructured Data
An Improved Annotation Based Summary Generation For Unstructured DataAn Improved Annotation Based Summary Generation For Unstructured Data
An Improved Annotation Based Summary Generation For Unstructured Data
Melinda Watson
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
ijceronline
 
IR introduction
IR introductionIR introduction
IR introduction
ZahwaZulfiqar
 
Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...
Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...
Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...
IJERA Editor
 
Decision Support for E-Governance: A Text Mining Approach
Decision Support for E-Governance: A Text Mining ApproachDecision Support for E-Governance: A Text Mining Approach
Decision Support for E-Governance: A Text Mining Approach
IJMIT JOURNAL
 
Fundamentals Concepts on Text Analytics.pptx
Fundamentals Concepts on Text Analytics.pptxFundamentals Concepts on Text Analytics.pptx
Fundamentals Concepts on Text Analytics.pptx
aini658222
 
Mam assign
Mam assignMam assign
Mam assign
silambu111
 
Information retrieval introduction
Information retrieval introductionInformation retrieval introduction
Information retrieval introduction
nimmyjans4
 
Text mining
Text miningText mining
Text mining
Pankaj Thakur
 
IRintroduction.ppt
IRintroduction.pptIRintroduction.ppt
IRintroduction.ppt
ThiyaguPappu
 
Text mining and analytics v6 - p1
Text mining and analytics   v6 - p1Text mining and analytics   v6 - p1
Text mining and analytics v6 - p1
Dave King
 
IRJET- Concept Extraction from Ambiguous Text Document using K-Means
IRJET- Concept Extraction from Ambiguous Text Document using K-MeansIRJET- Concept Extraction from Ambiguous Text Document using K-Means
IRJET- Concept Extraction from Ambiguous Text Document using K-Means
IRJET Journal
 
Information Retrieval based on Cluster Analysis Approach
Information Retrieval based on Cluster Analysis ApproachInformation Retrieval based on Cluster Analysis Approach
Information Retrieval based on Cluster Analysis Approach
AIRCC Publishing Corporation
 
INFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACH
INFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACHINFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACH
INFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACH
ijcsit
 
Structured and Unstructured Information Extraction Using Text Mining and Natu...
Structured and Unstructured Information Extraction Using Text Mining and Natu...Structured and Unstructured Information Extraction Using Text Mining and Natu...
Structured and Unstructured Information Extraction Using Text Mining and Natu...
rahulmonikasharma
 
An Improved Mining Of Biomedical Data From Web Documents Using Clustering
An Improved Mining Of Biomedical Data From Web Documents Using ClusteringAn Improved Mining Of Biomedical Data From Web Documents Using Clustering
An Improved Mining Of Biomedical Data From Web Documents Using Clustering
Kelly Lipiec
 
An effective pre processing algorithm for information retrieval systems
An effective pre processing algorithm for information retrieval systemsAn effective pre processing algorithm for information retrieval systems
An effective pre processing algorithm for information retrieval systems
IJDMS
 
A novel approach for text extraction using effective pattern matching technique
A novel approach for text extraction using effective pattern matching techniqueA novel approach for text extraction using effective pattern matching technique
A novel approach for text extraction using effective pattern matching technique
eSAT Journals
 
Web_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibWeb_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_Habib
El Habib NFAOUI
 
An Improved Annotation Based Summary Generation For Unstructured Data
An Improved Annotation Based Summary Generation For Unstructured DataAn Improved Annotation Based Summary Generation For Unstructured Data
An Improved Annotation Based Summary Generation For Unstructured Data
Melinda Watson
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
ijceronline
 
Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...
Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...
Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...
IJERA Editor
 
Decision Support for E-Governance: A Text Mining Approach
Decision Support for E-Governance: A Text Mining ApproachDecision Support for E-Governance: A Text Mining Approach
Decision Support for E-Governance: A Text Mining Approach
IJMIT JOURNAL
 
Fundamentals Concepts on Text Analytics.pptx
Fundamentals Concepts on Text Analytics.pptxFundamentals Concepts on Text Analytics.pptx
Fundamentals Concepts on Text Analytics.pptx
aini658222
 
Information retrieval introduction
Information retrieval introductionInformation retrieval introduction
Information retrieval introduction
nimmyjans4
 
IRintroduction.ppt
IRintroduction.pptIRintroduction.ppt
IRintroduction.ppt
ThiyaguPappu
 
Text mining and analytics v6 - p1
Text mining and analytics   v6 - p1Text mining and analytics   v6 - p1
Text mining and analytics v6 - p1
Dave King
 
IRJET- Concept Extraction from Ambiguous Text Document using K-Means
IRJET- Concept Extraction from Ambiguous Text Document using K-MeansIRJET- Concept Extraction from Ambiguous Text Document using K-Means
IRJET- Concept Extraction from Ambiguous Text Document using K-Means
IRJET Journal
 
Information Retrieval based on Cluster Analysis Approach
Information Retrieval based on Cluster Analysis ApproachInformation Retrieval based on Cluster Analysis Approach
Information Retrieval based on Cluster Analysis Approach
AIRCC Publishing Corporation
 
INFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACH
INFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACHINFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACH
INFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACH
ijcsit
 
Structured and Unstructured Information Extraction Using Text Mining and Natu...
Structured and Unstructured Information Extraction Using Text Mining and Natu...Structured and Unstructured Information Extraction Using Text Mining and Natu...
Structured and Unstructured Information Extraction Using Text Mining and Natu...
rahulmonikasharma
 
An Improved Mining Of Biomedical Data From Web Documents Using Clustering
An Improved Mining Of Biomedical Data From Web Documents Using ClusteringAn Improved Mining Of Biomedical Data From Web Documents Using Clustering
An Improved Mining Of Biomedical Data From Web Documents Using Clustering
Kelly Lipiec
 
An effective pre processing algorithm for information retrieval systems
An effective pre processing algorithm for information retrieval systemsAn effective pre processing algorithm for information retrieval systems
An effective pre processing algorithm for information retrieval systems
IJDMS
 
A novel approach for text extraction using effective pattern matching technique
A novel approach for text extraction using effective pattern matching techniqueA novel approach for text extraction using effective pattern matching technique
A novel approach for text extraction using effective pattern matching technique
eSAT Journals
 
Ad

Recently uploaded (20)

How iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost FundsHow iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost Funds
ireneschmid345
 
Call illuminati Agent in uganda+256776963507/0741506136
Call illuminati Agent in uganda+256776963507/0741506136Call illuminati Agent in uganda+256776963507/0741506136
Call illuminati Agent in uganda+256776963507/0741506136
illuminati Agent uganda call+256776963507/0741506136
 
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
PRE-NATAL GRnnnmnnnnmmOWTH seminar[1].pptx
PRE-NATAL GRnnnmnnnnmmOWTH seminar[1].pptxPRE-NATAL GRnnnmnnnnmmOWTH seminar[1].pptx
PRE-NATAL GRnnnmnnnnmmOWTH seminar[1].pptx
JayeshTaneja4
 
Chromatography_Detailed_Information.docx
Chromatography_Detailed_Information.docxChromatography_Detailed_Information.docx
Chromatography_Detailed_Information.docx
NohaSalah45
 
DPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdfDPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdf
inmishra17121973
 
Shotgun detailed overview my this ppt formate
Shotgun detailed overview my this ppt formateShotgun detailed overview my this ppt formate
Shotgun detailed overview my this ppt formate
freefreefire0998
 
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 
Geometry maths presentation for begginers
Geometry maths presentation for begginersGeometry maths presentation for begginers
Geometry maths presentation for begginers
zrjacob283
 
shit yudh slideshare power likha point presen
shit yudh slideshare power likha point presenshit yudh slideshare power likha point presen
shit yudh slideshare power likha point presen
vishalgurjar11229
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptxmd-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
fatimalazaar2004
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
Digilocker under workingProcess Flow.pptx
Digilocker  under workingProcess Flow.pptxDigilocker  under workingProcess Flow.pptx
Digilocker under workingProcess Flow.pptx
satnamsadguru491
 
Cleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdfCleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdf
alcinialbob1234
 
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
gmuir1066
 
Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..
yuvarajreddy2002
 
History of Science and Technologyandits source.pptx
History of Science and Technologyandits source.pptxHistory of Science and Technologyandits source.pptx
History of Science and Technologyandits source.pptx
balongcastrojo
 
Andhra Pradesh Micro Irrigation Project”
Andhra Pradesh Micro Irrigation Project”Andhra Pradesh Micro Irrigation Project”
Andhra Pradesh Micro Irrigation Project”
vzmcareers
 
Introcomputerscienceand datascience.pptx
Introcomputerscienceand datascience.pptxIntrocomputerscienceand datascience.pptx
Introcomputerscienceand datascience.pptx
abdulrehmanbscsf22
 
How iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost FundsHow iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost Funds
ireneschmid345
 
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
PRE-NATAL GRnnnmnnnnmmOWTH seminar[1].pptx
PRE-NATAL GRnnnmnnnnmmOWTH seminar[1].pptxPRE-NATAL GRnnnmnnnnmmOWTH seminar[1].pptx
PRE-NATAL GRnnnmnnnnmmOWTH seminar[1].pptx
JayeshTaneja4
 
Chromatography_Detailed_Information.docx
Chromatography_Detailed_Information.docxChromatography_Detailed_Information.docx
Chromatography_Detailed_Information.docx
NohaSalah45
 
DPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdfDPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdf
inmishra17121973
 
Shotgun detailed overview my this ppt formate
Shotgun detailed overview my this ppt formateShotgun detailed overview my this ppt formate
Shotgun detailed overview my this ppt formate
freefreefire0998
 
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 
Geometry maths presentation for begginers
Geometry maths presentation for begginersGeometry maths presentation for begginers
Geometry maths presentation for begginers
zrjacob283
 
shit yudh slideshare power likha point presen
shit yudh slideshare power likha point presenshit yudh slideshare power likha point presen
shit yudh slideshare power likha point presen
vishalgurjar11229
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptxmd-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
fatimalazaar2004
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
Digilocker under workingProcess Flow.pptx
Digilocker  under workingProcess Flow.pptxDigilocker  under workingProcess Flow.pptx
Digilocker under workingProcess Flow.pptx
satnamsadguru491
 
Cleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdfCleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdf
alcinialbob1234
 
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
gmuir1066
 
Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..
yuvarajreddy2002
 
History of Science and Technologyandits source.pptx
History of Science and Technologyandits source.pptxHistory of Science and Technologyandits source.pptx
History of Science and Technologyandits source.pptx
balongcastrojo
 
Andhra Pradesh Micro Irrigation Project”
Andhra Pradesh Micro Irrigation Project”Andhra Pradesh Micro Irrigation Project”
Andhra Pradesh Micro Irrigation Project”
vzmcareers
 
Introcomputerscienceand datascience.pptx
Introcomputerscienceand datascience.pptxIntrocomputerscienceand datascience.pptx
Introcomputerscienceand datascience.pptx
abdulrehmanbscsf22
 
Ad

Text mining

  • 1. TEXT MINING BY THEJESWINI B.Tech CSE 3 Year SUBCODE:XCSE65 SUBNAME:DATA MINING
  • 2. CONTENTS INTRODUCTION DATA MINING vs TEXT MINING AREAS OF TEXT MINING INFORMATION RETRIEVAL TEXT MINING PROCESS TEXT MINING APPROACHES CHALLENGES OF TEXT MINING REFERNECES
  • 3. INTRODUCTION  Nowadays, there is a rapid growth in text databases due to many sources generating data in text.  Sources that generate text databases are : collections of documents from various sources - such as news articles, research papers, books, digital libraries, e-mail messages, and World Wide web(which can also be viewed as a huge, interconnected, dynamic text database) and also many government and business institutions also store their data in form of text.  Understanding that generated text patterns and obtaining useful and reliable information has become the main reason for text mining.
  • 4. INTRODUCTION...(CONTD)  Text mining is formally defined as process of extracting relevant information or pattern from different sources that are in unstructured or semi-structured format  Data stored in most text databases are semi structured data ,i.e. they are neither completely unstructured nor completely structured.  For example, a document may contain a few structured fields, such as title, authors, publication date, category, and so on, but also contain some largely unstructured text components, such as abstract and contents.
  • 5. DATA MINING vs TEXT MINING DATA MINING TEXT MINING It is the process of finding patterns and extracting useful data from large data sets. Is applied on data from from various text documents Applied on all types of data Applied on text data, which is mostly semi structured or unstructured Processing of data is done directly. Processing of data is done linguistically. Statistical techniques are used to evaluate data. Computational linguistic principles are used to evaluate text.
  • 6. AREAS OF TEXT MINING IR(Information Retrieval) NLP(Natural Language Processing) IE(Information Extraction) Data Mining Query based search on large text documents The development of the NLP application generally expect humans to "Speak" to them in a programming language that is accurate, clear, and exceptionally structured. Human speech is usually not authentic so that it can depend on many complex variables, including slang, social context, and regional dialects. The automatic extraction of structured data such as entities, entities relationships, and attributes describing entities from an unstructured source is called information extraction. Data mining refers to the extraction of useful data, hidden patterns from large data sets. Data mining tools can predict behaviors and future trends that allow businesses to make a better data-driven decision..
  • 7. INFORMATION RETRIEVAL Information retrieval is a method to retrieve information from a large number of text-based documents. Due to the abundance of text information, information retrieval has found many applications. There exist many information retrieval systems, such as : -on-line library catalog systems, -on-line document management systems, and -the more recently developed Web search engine  A typical information retrieval problem is to locate relevant documents in a document collection based on a user’s query, which is often some keywords describing an information need.
  • 8. INFORMATION RETRIEVAL…(CONTD) 1. BASIC MEASURES OF INFORMATION RETRIEVAL There are two basic measures for assessing the quality of text retrieval: Precision: This is the percentage of retrieved documents that are in fact relevant to the query (i.e., “correct” responses). It is formally defined as Recall: This is the percentage of documents that are relevant to the query and were, in fact, retrieved. It is formally defined as One commonly used trade-off is the F-score, which is defined as the harmonic mean of recall and precision: precision = |{Relevant} ∩ {Retrieved}|/ |{Retrieved}| recall = |{Relevant} ∩ {Retrieved}| /|{Relevant}| F score = recall × precision (recall + precision)/2
  • 9. INFORMATION RETRIEVAL…(CONTD) 2. TEXT RETRIEVAL METHODS Information retrieval of text documents can be done by the following methods: -Document selection method: In this method , the query is given by specifying constraints for selecting relevant documents. A typical method of this category is the “Boolean retrieval model”, in which a document is represented by a set of keywords and a user provides a Boolean expression of keywords, such as e.g: “car and repair shops” , “tea or coffee” -Document ranking method: In this method, the query is used to rank all documents in the order of relevance. The goal is to approximate the degree of relevance of a document with a score computed based on information such as the frequency of words in the document and the whole collection.
  • 10. INFORMATION RETRIEVAL…(CONTD)  The first step in most retrieval systems is to identify keywords for representing documents, a preprocessing step often called tokenization. To avoid indexing useless words, a text retrieval system often associates a “stop list” with a set of documents. Text Mining is a part of Data Mining text mining part data mining
  • 11. TEXT MINING PROCESS • Text preprocessing -Syntactic/Semantic -text analysis (Text cleanup, Tokenization) • Features Generation -Bag of words (words it contains and occurences) -Vector space • Features Selection -Simple counting -Statistics • Text/Data Mining -Classification(supervised) -Clustering(unsupervised) -Associations(relationships) • Analyzing results
  • 12. TEXT MINING APPROACHES  The text mining approaches are based on the inputs taken in the text mining system and the data mining tasks to be performed. In general, the major approaches, based on the kinds of data they take as input, are: (1) the keyword-based approach, where the input is a set of keywords or terms in the documents, (2) the tagging approach, where the input is a set of tags, and (3)the information-extraction approach, which inputs semantic information, such as events, facts, or entities uncovered by information extraction.
  • 13. 1) KEY WORD ASSOCIATION BASED ANALYSIS: It is an analysis which collects sets of keywords or terms that occur frequently together and then finds the association or correlation relationships among them. E.g. [Stanford, University] 2) DOCUMENT CLASSIFICATION ANALYSIS: Automated document classification is an important text mining task because, with the existence of a tremendous number of on-line documents, it is tedious yet essential to be able to automatically organize such documents into classes to facilitate document retrieval and subsequent analysis. E.g. Tagging 3) DOCUMENT CLUSTERING ANALYSIS: Document clustering is one of the most crucial techniques for organizing documents in an unsupervised manner. TEXT MINING APPROACHES…(CONTD)
  • 14. CHALLENGES OF TEXT MINING  Information is in unstructured textual form  Large textual database – Difficult to apply text mining  Complex and subtle relationships between concepts in text  Word ambiguity and context sensitivity e.g windows can be either operating system or opening in the wall to allow air flow in the house.  Noisy data Spelling mistakes and irrelevant data(outliers)
  • 15. REFERENCES [1]Jiawei Han University of Illinois at Urbana-Champaign Micheline Kamber “Data Mining: Concepts and Techniques Second Edition” [2] https://www.javatpoint.com/text-data-mining [3] https://paginas.fe.up.pt/~ec/files_0405/slides/07%20TextMining.pdf