SlideShare a Scribd company logo
Text Mining
Submitted to:
Ms. Mala Kalra
Dr. Rakesh Kumar
Assistant Professor
Department of CSE
NITTTR Chandigarh
Submitted by:
Pankaj Thakur
MECSE (Modular)
RN 171408
Contents
 Introduction & Need
 Information Retrieval and its Methods
 Approaches
 Process
 Techniques used
 Merits & Demerits
 Challenges
 Applications
 Text Mining Computer Programs
 Demo using python
 Latest Research work
 References
 Query
2
Introduction
3
• Data means known facts that can be recorded and that have implicit meaning.[1]
• Database means a collection of related data. [1]
• Data Warehouse is a repository of information collected from multiple sources, stored under a
unified schema, and that usually resides at a single site.[2]
• Data Mining knowledge mining from data .[2]
(extracting knowledge from large amounts of data)
• Text databases(Document databases)
Large collections of documents from various sources:
news articles, research papers, books, digital libraries, e-mail messages, and web pages etc.
(unstructured, semi structured, structured)
• May be highly unstructured (some web pages on www)
• May be semi structured (email messages)
• May be structured ( Library catalogue database)
• Text databases with highly regular structures typically can be implemented
using relational database systems.
• Text Mining is the analysis of data contained in natural language text.
• Regular data mining Vs. Text mining:- in text mining the patterns are extracted from
natural language text rather than from structured databases of facts.[3]
Diagram
4
Text Mining Vs. Data Mining
Data Mining Text Mining
Data Object Numerical & categorical
data
Textual data
Data structure Structured Unstructured &semi-
structured
Data representation Straightforward Complex
Space dimension < tens of thousands > tens of thousands
Methods Data analysis, machine
learning, Data mining,
information
Statistic, neural networks
retrieval, NLP, ...
Maturity Broad implementation
since1994
Broad implementation
starting 2000
Market 105 analysts at large and
mid size companies
108 analysts corporate
workers and individual
users
5
Need of Text Mining
• Massive amount of new information being created doubles every 18 months.
• 80-90% of all data is held in various unstructured formats.
• Useful information can be derived from this unstructured data.
Unstructured or semi-structured
information
Structured, Numerical or coded
information
(News articles, research papers, books, digital libraries, email messages, and web pages )
• Text databases are rapidly growing due to the increasing amount of information available in
Electronic forms, such as electronic publication, various kinds of electronic documents, emails,
and www.
• Most of the information in government, industry, business, and other institutions are stored
Electronically in the form of text databases.
Information Retrieval[2]
6
Information retrieval (IR) is a field that has been developing in parallel with database systems.
Concerned with retrieval of information from a large number of text based documents.
Precision and Recall are two basic measures for accessing the quality of text retrieval.
Precision is the percentage of retrieved documents
that are in fact relevant to the query.
Recall is the percentage of documents that are relevant
to the query and were, in fact, retrieved.
Where {Relevant} is set of documents relevant to a query,
{Retrieved} is the set of documents retrieved.
Information Retrieval Methods[2]
7
Two Categories
IR
Methods
Document
Selection
Methods
Document
Ranking
Methods
• Document Selection
Problem
• Boolean retrieval model
• Document Ranking
Problem
• Vector space model
Vector Space Model[2]
8
• Represent a document and a query both as vectors in a high-dimensional space
corresponding to all the keywords and use an appropriate similarity measure to
compute the similarity between the query vector and the document vector.
• The similarity values can then be used for ranking document.
• Let freq(d, t) = term frequency = no. of occurrences of term t in the document d
• TF(d, t) = term frequency matrix, measures the association of a term t with respect
to the given document d.
TF-IDF(d, t) = TF(d, t) X IDF(t)
0 if freq(d, t) = 0
TF(d, t) =
1+log(1+log(freq(d, t ))) OtherwiseTerm Frequency
Inverse Document Frequency
(represents scaling factor or the importance of term t)
Here, d is the document collection,
dt is the set of documents containing term t.
Vector Space Model[2]
(Example)
9
d/t t1 t2 t3 t4 t5 t6 t7
d1 0 4 10 8 0 5 0
d2 5 19 7 16 0 0 32
d3 15 0 0 4 9 0 17
d4 22 3 12 0 5 15 0
d5 0 7 0 9 2 4 12
A Term Frequency Matrix
For t6 in, d4 we have
TF(d4, t6 ) = 1 + log(1+log(15)) = 1.3374
IDF(t6 ) = log (1+5)/3 = 0.301
TF-IDF(d4, t6) = 1.3377 X 0.301 = 0.403
0 if freq(d, t) = 0
TF(d, t) =
1+log(1+log(freq(d, t ))) Otherwise
d/t t1 t2 t3 t4 t5 t6 t7
d1 0 4 10 8 0 5 0
d2 5 19 7 16 0 0 32
d3 15 0 0 4 9 0 17
d4 22 3 12 0 5 15 0
d5 0 7 0 9 2 4 12
A Term Frequency Matrix
Text Mining Approaches[2]
10
Text Mining
Approaches
Keyword based
approach Tagging approach
Information
extraction
approach
• set of keywords or
terms in the documents
• may only discover relationship
e.g “database” & “system”,
“terrorist” & “explosion”
• may not bring deep
understanding to the text
Input
• set of tags
• may rely on
manual tagging
(costly & not feasible
for large collection of
documents)
• semantic information
(events, facts etc.)
• more advanced
• may lead to the discovery of
some deep knowledge
Text Mining Process[4]
11
Preprocessing
Text Mining
Technique is
applied
Analysis of Text
Text document from
different sources
Discovery of
knowledge
The technologies like
Information extraction, categorization, Clustering, Visualization, Summarization
are used in the text mining process
Techniques Used in Text Mining[4]
1. Information Extraction:
tokenization, identification of named entities, sentence segmentation, and part-of-
speech assignment.
2. Text categorization
procedure of assigning a category to the text among categories predefined by users.
3. Text clustering
procedure of segmenting texts into several clusters, depending on the substantial
relevance.
4. Visualization
improve and simplify the discovery of relevant information.
5. Text summarization
procedure to extract its partial content reflecting its whole contents automatically.
12
Merits and Demerits of Text mining[4]
Merits:
i) The names of different entities and relationship between them can easily be
found from the corpus of documents set (using the technique such as
information extraction. )
ii) The challenging problem of managing great amount of unstructured
information for extracting pattern is solved by text mining.
Demerits:
i) The information which is initially needed is no where written.
ii) To mine the text for information or knowledge no programs can be made in
order to analyze the unstructured text directly.
13
Challenges in Text Mining
(Representation issues)
• Each word has a dictionary meaning, or meanings
Run – (1) the verb. (2) the noun, in cricket
Cricket – (1) The game. (2) The insect.
Apple (the company) or apple (the fruit)
• Ambiguity and context sensitivity - Each word is used in various “senses”
Tendulkar made 100 runs
Because of an injury, Tendulkar can not run and will need a runner between the
wickets
• Capturing the “meaning” of sentences is an important issue as well.
(Grammar, parts of speech, time sense could be easy!)
• Order of words in the query
hot dog stand in the amusement park
hot amusement stand in the dog park
14
Text Mining Applications[5]
15
1. Security applications
(monitoring and analysis of online plain text sources such as Internet news, blogs, etc.
for national security purposes.)
2. Biomedical applications
(studies in protein docking, protein interactions, and protein-disease associations)
3. Software applications
(Within public sector much effort has been concentrated on creating software for
tracking and monitoring terrorist activities.)
4. Online media applications
(The Tribune Company, uses text mining to clarify information and to provide readers
with greater search experiences, which in turn increases site "stickiness" and revenue. )
5. Business and marketing applications
(CRM, to improve predictive analytics models for customer, stock returns prediction)
6. Sentiment analysis
(analysis of movie reviews, used to detect emotions, etc.)
7. Scientific literature mining and academic applications
Text Mining Computer Programs[5]
16
Demo
17
• Text Mining using Python
(Tweeter, Whatsapp Chats)
Latest Research work on Text Mining[6]
1. Sunil Kumar ; Maninder Singh, “Big data analytics for healthcare industry: impact,
applications, and tools”, DOI: 10.26599/BDMA.2018.9020031
2. Bing Li, Xiaochun Yang, Rui Zhou, Bin Wang, Chengfei Liu, Yanchun Zhang, “An
Efficient Method for High Quality and Cohesive Topical Phrase Mining”, DOI:
0.1109/TKDE.2018.2823758
3. Steven H. H. Ding, Benjamin C. M. Fung, Farkhund Iqbal, William K. Cheung,
“Learning Stylometric Representations for Authorship Analysis”, DOI:
10.1109/TCYB.2017.2766189
4. Mohammed Nasri, Younes Jaafar, Karim Bouzoubaa, “Semantic Analysis of Arabic
Texts Within SAFAR Framework”, DOI: 10.1109/CIST.2018.8596491
5. Jayesh Choudhari, Anirban Dasgupta, Indrajit Bhattacharya, Srikanta Bedathur,
“Discovering Topical Interactions in Text-Based Cascades Using Hidden Markov
Hawkes Processes”, DOI: 10.1109/ICDM.2018.00112
6. Yong Luo, Huaizheng Zhang, Yongjie Wang, Yonggang Wen, Xinwen Zhang,
“ResumeNet: A Learning-Based Framework for Automatic Resume Quality
Assessment”, DOI: 10.1109/ICDM.2018.00046
7. Si-Yu Ding, Xu-Ying Liu, Min-Ling Zhang, “Imbalanced Augmented Class Learning with
Unlabeled Data by Label Confidence Propagation”, DOI: 10.1109/ICDM.2018.00023
18
References
[1] Ramez Elmasri and Shamkant B. Navathe, “Fundamentals of database systems”, 6th
edition.
[2] Jiawei Han and Micheline Kamber, “Data Mining, Concepts and Techniques”, 2nd
edition.
[3] http://people.ischool.berkeley.edu/~hearst/text-mining.html
[4] Sonali Vijay Gaikwad, Archana Chaugule, Pramod Patil, “Text Mining Methods and
Techniques”, International Journal of Computer Applications (0975 – 8887),
International Journal of Computer Applications (0975 – 8887), Volume 85 – No 17,
January 2014
[5] http://www.wikipedia.org
[6] https://ieeexplore.org
19
Questions
?
20
21
Thanks!
Data Warehouse
22
Data Source in Delhi
Data Source in Mumbai
Data Source in Kolkata
Data Source in Chennai
Clean
Integrate
Transform
Load
Refresh
Data
Warehouse
Query and
Analysis
Tools
Client
Client
Back
Ad

More Related Content

What's hot (20)

1.9.association mining 1
1.9.association mining 11.9.association mining 1
1.9.association mining 1
Krish_ver2
 
Data Science
Data ScienceData Science
Data Science
Amit Singh
 
Text Mining
Text MiningText Mining
Text Mining
Biniam Asnake
 
Natural Language Understanding in Healthcare
Natural Language Understanding in HealthcareNatural Language Understanding in Healthcare
Natural Language Understanding in Healthcare
David Talby
 
Orange Data Mining & Data Visualization Tool
Orange Data Mining & Data Visualization ToolOrange Data Mining & Data Visualization Tool
Orange Data Mining & Data Visualization Tool
Mithileysh Sathiyanarayanan
 
Multiclass classification of imbalanced data
Multiclass classification of imbalanced dataMulticlass classification of imbalanced data
Multiclass classification of imbalanced data
SaurabhWani6
 
Data Mining: Text and web mining
Data Mining: Text and web miningData Mining: Text and web mining
Data Mining: Text and web mining
DataminingTools Inc
 
String Matching Finite Automata & KMP Algorithm.
String Matching Finite Automata & KMP Algorithm.String Matching Finite Automata & KMP Algorithm.
String Matching Finite Automata & KMP Algorithm.
Malek Sumaiya
 
Text mining
Text miningText mining
Text mining
Malik Imran
 
Data Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data AnalysisData Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data Analysis
Eva Durall
 
Fp growth
Fp growthFp growth
Fp growth
Farah M. Altufaili
 
Presentation on K-Means Clustering
Presentation on K-Means ClusteringPresentation on K-Means Clustering
Presentation on K-Means Clustering
Pabna University of Science & Technology
 
Brute force method
Brute force methodBrute force method
Brute force method
priyankabhansali217
 
Clustering - K-Means, DBSCAN
Clustering - K-Means, DBSCANClustering - K-Means, DBSCAN
Clustering - K-Means, DBSCAN
Medicaps University
 
Introduction To Data Science
Introduction To Data ScienceIntroduction To Data Science
Introduction To Data Science
Spotle.ai
 
Big Data & Text Mining
Big Data & Text MiningBig Data & Text Mining
Big Data & Text Mining
Michel Bruley
 
Recursive algorithms
Recursive algorithmsRecursive algorithms
Recursive algorithms
subhashchandra197
 
Dempster Shafer Theory AI CSE 8th Sem
Dempster Shafer Theory AI CSE 8th SemDempster Shafer Theory AI CSE 8th Sem
Dempster Shafer Theory AI CSE 8th Sem
DigiGurukul
 
03 preprocessing
03 preprocessing03 preprocessing
03 preprocessing
JoonyoungJayGwak
 
PAC Learning
PAC LearningPAC Learning
PAC Learning
Sanghyuk Chun
 
1.9.association mining 1
1.9.association mining 11.9.association mining 1
1.9.association mining 1
Krish_ver2
 
Natural Language Understanding in Healthcare
Natural Language Understanding in HealthcareNatural Language Understanding in Healthcare
Natural Language Understanding in Healthcare
David Talby
 
Multiclass classification of imbalanced data
Multiclass classification of imbalanced dataMulticlass classification of imbalanced data
Multiclass classification of imbalanced data
SaurabhWani6
 
Data Mining: Text and web mining
Data Mining: Text and web miningData Mining: Text and web mining
Data Mining: Text and web mining
DataminingTools Inc
 
String Matching Finite Automata & KMP Algorithm.
String Matching Finite Automata & KMP Algorithm.String Matching Finite Automata & KMP Algorithm.
String Matching Finite Automata & KMP Algorithm.
Malek Sumaiya
 
Data Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data AnalysisData Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data Analysis
Eva Durall
 
Introduction To Data Science
Introduction To Data ScienceIntroduction To Data Science
Introduction To Data Science
Spotle.ai
 
Big Data & Text Mining
Big Data & Text MiningBig Data & Text Mining
Big Data & Text Mining
Michel Bruley
 
Dempster Shafer Theory AI CSE 8th Sem
Dempster Shafer Theory AI CSE 8th SemDempster Shafer Theory AI CSE 8th Sem
Dempster Shafer Theory AI CSE 8th Sem
DigiGurukul
 

Similar to Text mining (20)

Web_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibWeb_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_Habib
El Habib NFAOUI
 
Paper id 26201475
Paper id 26201475Paper id 26201475
Paper id 26201475
IJRAT
 
Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...
Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...
Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...
IJERA Editor
 
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
IRJET Journal
 
Applying Clustering Techniques for Efficient Text Mining in Twitter Data
Applying Clustering Techniques for Efficient Text Mining in Twitter DataApplying Clustering Techniques for Efficient Text Mining in Twitter Data
Applying Clustering Techniques for Efficient Text Mining in Twitter Data
ijbuiiir1
 
Decision Support for E-Governance: A Text Mining Approach
Decision Support for E-Governance: A Text Mining ApproachDecision Support for E-Governance: A Text Mining Approach
Decision Support for E-Governance: A Text Mining Approach
IJMIT JOURNAL
 
An Improved Annotation Based Summary Generation For Unstructured Data
An Improved Annotation Based Summary Generation For Unstructured DataAn Improved Annotation Based Summary Generation For Unstructured Data
An Improved Annotation Based Summary Generation For Unstructured Data
Melinda Watson
 
A novel approach for text extraction using effective pattern matching technique
A novel approach for text extraction using effective pattern matching techniqueA novel approach for text extraction using effective pattern matching technique
A novel approach for text extraction using effective pattern matching technique
eSAT Journals
 
Unit 1 Information Storage and Retrieval
Unit 1 Information Storage and RetrievalUnit 1 Information Storage and Retrieval
Unit 1 Information Storage and Retrieval
KishorMahale5
 
Ijetcas14 409
Ijetcas14 409Ijetcas14 409
Ijetcas14 409
Iasir Journals
 
Text mining
Text miningText mining
Text mining
ThejeswiniChivukula
 
Hci
HciHci
Hci
Er. Saurabh Singh
 
Applying ocr to extract information : Text mining
Applying ocr to extract information  : Text miningApplying ocr to extract information  : Text mining
Applying ocr to extract information : Text mining
Saurabh Singh
 
Structured and Unstructured Information Extraction Using Text Mining and Natu...
Structured and Unstructured Information Extraction Using Text Mining and Natu...Structured and Unstructured Information Extraction Using Text Mining and Natu...
Structured and Unstructured Information Extraction Using Text Mining and Natu...
rahulmonikasharma
 
An effective pre processing algorithm for information retrieval systems
An effective pre processing algorithm for information retrieval systemsAn effective pre processing algorithm for information retrieval systems
An effective pre processing algorithm for information retrieval systems
IJDMS
 
B0410206010
B0410206010B0410206010
B0410206010
ijceronline
 
Survey on Existing Text Mining Frameworks and A Proposed Idealistic Framework...
Survey on Existing Text Mining Frameworks and A Proposed Idealistic Framework...Survey on Existing Text Mining Frameworks and A Proposed Idealistic Framework...
Survey on Existing Text Mining Frameworks and A Proposed Idealistic Framework...
ijceronline
 
Post 1What is text analytics How does it differ from text mini.docx
Post 1What is text analytics How does it differ from text mini.docxPost 1What is text analytics How does it differ from text mini.docx
Post 1What is text analytics How does it differ from text mini.docx
stilliegeorgiana
 
Post 1What is text analytics How does it differ from text mini
Post 1What is text analytics How does it differ from text miniPost 1What is text analytics How does it differ from text mini
Post 1What is text analytics How does it differ from text mini
anhcrowley
 
Information Retrieval based on Cluster Analysis Approach
Information Retrieval based on Cluster Analysis ApproachInformation Retrieval based on Cluster Analysis Approach
Information Retrieval based on Cluster Analysis Approach
AIRCC Publishing Corporation
 
Web_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibWeb_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_Habib
El Habib NFAOUI
 
Paper id 26201475
Paper id 26201475Paper id 26201475
Paper id 26201475
IJRAT
 
Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...
Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...
Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...
IJERA Editor
 
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
IRJET Journal
 
Applying Clustering Techniques for Efficient Text Mining in Twitter Data
Applying Clustering Techniques for Efficient Text Mining in Twitter DataApplying Clustering Techniques for Efficient Text Mining in Twitter Data
Applying Clustering Techniques for Efficient Text Mining in Twitter Data
ijbuiiir1
 
Decision Support for E-Governance: A Text Mining Approach
Decision Support for E-Governance: A Text Mining ApproachDecision Support for E-Governance: A Text Mining Approach
Decision Support for E-Governance: A Text Mining Approach
IJMIT JOURNAL
 
An Improved Annotation Based Summary Generation For Unstructured Data
An Improved Annotation Based Summary Generation For Unstructured DataAn Improved Annotation Based Summary Generation For Unstructured Data
An Improved Annotation Based Summary Generation For Unstructured Data
Melinda Watson
 
A novel approach for text extraction using effective pattern matching technique
A novel approach for text extraction using effective pattern matching techniqueA novel approach for text extraction using effective pattern matching technique
A novel approach for text extraction using effective pattern matching technique
eSAT Journals
 
Unit 1 Information Storage and Retrieval
Unit 1 Information Storage and RetrievalUnit 1 Information Storage and Retrieval
Unit 1 Information Storage and Retrieval
KishorMahale5
 
Applying ocr to extract information : Text mining
Applying ocr to extract information  : Text miningApplying ocr to extract information  : Text mining
Applying ocr to extract information : Text mining
Saurabh Singh
 
Structured and Unstructured Information Extraction Using Text Mining and Natu...
Structured and Unstructured Information Extraction Using Text Mining and Natu...Structured and Unstructured Information Extraction Using Text Mining and Natu...
Structured and Unstructured Information Extraction Using Text Mining and Natu...
rahulmonikasharma
 
An effective pre processing algorithm for information retrieval systems
An effective pre processing algorithm for information retrieval systemsAn effective pre processing algorithm for information retrieval systems
An effective pre processing algorithm for information retrieval systems
IJDMS
 
Survey on Existing Text Mining Frameworks and A Proposed Idealistic Framework...
Survey on Existing Text Mining Frameworks and A Proposed Idealistic Framework...Survey on Existing Text Mining Frameworks and A Proposed Idealistic Framework...
Survey on Existing Text Mining Frameworks and A Proposed Idealistic Framework...
ijceronline
 
Post 1What is text analytics How does it differ from text mini.docx
Post 1What is text analytics How does it differ from text mini.docxPost 1What is text analytics How does it differ from text mini.docx
Post 1What is text analytics How does it differ from text mini.docx
stilliegeorgiana
 
Post 1What is text analytics How does it differ from text mini
Post 1What is text analytics How does it differ from text miniPost 1What is text analytics How does it differ from text mini
Post 1What is text analytics How does it differ from text mini
anhcrowley
 
Information Retrieval based on Cluster Analysis Approach
Information Retrieval based on Cluster Analysis ApproachInformation Retrieval based on Cluster Analysis Approach
Information Retrieval based on Cluster Analysis Approach
AIRCC Publishing Corporation
 
Ad

Recently uploaded (20)

211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
inmishra17121973
 
π0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalizationπ0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalization
NABLAS株式会社
 
Compiler Design_Lexical Analysis phase.pptx
Compiler Design_Lexical Analysis phase.pptxCompiler Design_Lexical Analysis phase.pptx
Compiler Design_Lexical Analysis phase.pptx
RushaliDeshmukh2
 
Reagent dosing (Bredel) presentation.pptx
Reagent dosing (Bredel) presentation.pptxReagent dosing (Bredel) presentation.pptx
Reagent dosing (Bredel) presentation.pptx
AlejandroOdio
 
Building Security Systems in Architecture.pdf
Building Security Systems in Architecture.pdfBuilding Security Systems in Architecture.pdf
Building Security Systems in Architecture.pdf
rabiaatif2
 
How to Make Material Space Qu___ (1).pptx
How to Make Material Space Qu___ (1).pptxHow to Make Material Space Qu___ (1).pptx
How to Make Material Space Qu___ (1).pptx
engaash9
 
five-year-soluhhhhhhhhhhhhhhhhhtions.pdf
five-year-soluhhhhhhhhhhhhhhhhhtions.pdffive-year-soluhhhhhhhhhhhhhhhhhtions.pdf
five-year-soluhhhhhhhhhhhhhhhhhtions.pdf
AdityaSharma944496
 
Fourth Semester BE CSE BCS401 ADA Module 3 PPT.pptx
Fourth Semester BE CSE BCS401 ADA Module 3 PPT.pptxFourth Semester BE CSE BCS401 ADA Module 3 PPT.pptx
Fourth Semester BE CSE BCS401 ADA Module 3 PPT.pptx
VENKATESHBHAT25
 
Development of MLR, ANN and ANFIS Models for Estimation of PCUs at Different ...
Development of MLR, ANN and ANFIS Models for Estimation of PCUs at Different ...Development of MLR, ANN and ANFIS Models for Estimation of PCUs at Different ...
Development of MLR, ANN and ANFIS Models for Estimation of PCUs at Different ...
Journal of Soft Computing in Civil Engineering
 
Level 1-Safety.pptx Presentation of Electrical Safety
Level 1-Safety.pptx Presentation of Electrical SafetyLevel 1-Safety.pptx Presentation of Electrical Safety
Level 1-Safety.pptx Presentation of Electrical Safety
JoseAlbertoCariasDel
 
Crack the Domain with Event Storming By Vivek
Crack the Domain with Event Storming By VivekCrack the Domain with Event Storming By Vivek
Crack the Domain with Event Storming By Vivek
Vivek Srivastava
 
Data Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptxData Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptx
RushaliDeshmukh2
 
ELectronics Boards & Product Testing_Shiju.pdf
ELectronics Boards & Product Testing_Shiju.pdfELectronics Boards & Product Testing_Shiju.pdf
ELectronics Boards & Product Testing_Shiju.pdf
Shiju Jacob
 
"Heaters in Power Plants: Types, Functions, and Performance Analysis"
"Heaters in Power Plants: Types, Functions, and Performance Analysis""Heaters in Power Plants: Types, Functions, and Performance Analysis"
"Heaters in Power Plants: Types, Functions, and Performance Analysis"
Infopitaara
 
Dust Suppressants: A Sustainable Approach to Dust Pollution Control
Dust Suppressants: A Sustainable Approach to Dust Pollution ControlDust Suppressants: A Sustainable Approach to Dust Pollution Control
Dust Suppressants: A Sustainable Approach to Dust Pollution Control
Janapriya Roy
 
Taking AI Welfare Seriously, In this report, we argue that there is a realist...
Taking AI Welfare Seriously, In this report, we argue that there is a realist...Taking AI Welfare Seriously, In this report, we argue that there is a realist...
Taking AI Welfare Seriously, In this report, we argue that there is a realist...
MiguelMarques372250
 
Engineering Chemistry First Year Fullerenes
Engineering Chemistry First Year FullerenesEngineering Chemistry First Year Fullerenes
Engineering Chemistry First Year Fullerenes
5g2jpd9sp4
 
Upstream_processing of industrial products.pptx
Upstream_processing of industrial products.pptxUpstream_processing of industrial products.pptx
Upstream_processing of industrial products.pptx
KshitijJayswal2
 
DT REPORT by Tech titan GROUP to introduce the subject design Thinking
DT REPORT by Tech titan GROUP to introduce the subject design ThinkingDT REPORT by Tech titan GROUP to introduce the subject design Thinking
DT REPORT by Tech titan GROUP to introduce the subject design Thinking
DhruvChotaliya2
 
Basic Principles for Electronics Students
Basic Principles for Electronics StudentsBasic Principles for Electronics Students
Basic Principles for Electronics Students
cbdbizdev04
 
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
inmishra17121973
 
π0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalizationπ0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalization
NABLAS株式会社
 
Compiler Design_Lexical Analysis phase.pptx
Compiler Design_Lexical Analysis phase.pptxCompiler Design_Lexical Analysis phase.pptx
Compiler Design_Lexical Analysis phase.pptx
RushaliDeshmukh2
 
Reagent dosing (Bredel) presentation.pptx
Reagent dosing (Bredel) presentation.pptxReagent dosing (Bredel) presentation.pptx
Reagent dosing (Bredel) presentation.pptx
AlejandroOdio
 
Building Security Systems in Architecture.pdf
Building Security Systems in Architecture.pdfBuilding Security Systems in Architecture.pdf
Building Security Systems in Architecture.pdf
rabiaatif2
 
How to Make Material Space Qu___ (1).pptx
How to Make Material Space Qu___ (1).pptxHow to Make Material Space Qu___ (1).pptx
How to Make Material Space Qu___ (1).pptx
engaash9
 
five-year-soluhhhhhhhhhhhhhhhhhtions.pdf
five-year-soluhhhhhhhhhhhhhhhhhtions.pdffive-year-soluhhhhhhhhhhhhhhhhhtions.pdf
five-year-soluhhhhhhhhhhhhhhhhhtions.pdf
AdityaSharma944496
 
Fourth Semester BE CSE BCS401 ADA Module 3 PPT.pptx
Fourth Semester BE CSE BCS401 ADA Module 3 PPT.pptxFourth Semester BE CSE BCS401 ADA Module 3 PPT.pptx
Fourth Semester BE CSE BCS401 ADA Module 3 PPT.pptx
VENKATESHBHAT25
 
Level 1-Safety.pptx Presentation of Electrical Safety
Level 1-Safety.pptx Presentation of Electrical SafetyLevel 1-Safety.pptx Presentation of Electrical Safety
Level 1-Safety.pptx Presentation of Electrical Safety
JoseAlbertoCariasDel
 
Crack the Domain with Event Storming By Vivek
Crack the Domain with Event Storming By VivekCrack the Domain with Event Storming By Vivek
Crack the Domain with Event Storming By Vivek
Vivek Srivastava
 
Data Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptxData Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptx
RushaliDeshmukh2
 
ELectronics Boards & Product Testing_Shiju.pdf
ELectronics Boards & Product Testing_Shiju.pdfELectronics Boards & Product Testing_Shiju.pdf
ELectronics Boards & Product Testing_Shiju.pdf
Shiju Jacob
 
"Heaters in Power Plants: Types, Functions, and Performance Analysis"
"Heaters in Power Plants: Types, Functions, and Performance Analysis""Heaters in Power Plants: Types, Functions, and Performance Analysis"
"Heaters in Power Plants: Types, Functions, and Performance Analysis"
Infopitaara
 
Dust Suppressants: A Sustainable Approach to Dust Pollution Control
Dust Suppressants: A Sustainable Approach to Dust Pollution ControlDust Suppressants: A Sustainable Approach to Dust Pollution Control
Dust Suppressants: A Sustainable Approach to Dust Pollution Control
Janapriya Roy
 
Taking AI Welfare Seriously, In this report, we argue that there is a realist...
Taking AI Welfare Seriously, In this report, we argue that there is a realist...Taking AI Welfare Seriously, In this report, we argue that there is a realist...
Taking AI Welfare Seriously, In this report, we argue that there is a realist...
MiguelMarques372250
 
Engineering Chemistry First Year Fullerenes
Engineering Chemistry First Year FullerenesEngineering Chemistry First Year Fullerenes
Engineering Chemistry First Year Fullerenes
5g2jpd9sp4
 
Upstream_processing of industrial products.pptx
Upstream_processing of industrial products.pptxUpstream_processing of industrial products.pptx
Upstream_processing of industrial products.pptx
KshitijJayswal2
 
DT REPORT by Tech titan GROUP to introduce the subject design Thinking
DT REPORT by Tech titan GROUP to introduce the subject design ThinkingDT REPORT by Tech titan GROUP to introduce the subject design Thinking
DT REPORT by Tech titan GROUP to introduce the subject design Thinking
DhruvChotaliya2
 
Basic Principles for Electronics Students
Basic Principles for Electronics StudentsBasic Principles for Electronics Students
Basic Principles for Electronics Students
cbdbizdev04
 
Ad

Text mining

  • 1. Text Mining Submitted to: Ms. Mala Kalra Dr. Rakesh Kumar Assistant Professor Department of CSE NITTTR Chandigarh Submitted by: Pankaj Thakur MECSE (Modular) RN 171408
  • 2. Contents  Introduction & Need  Information Retrieval and its Methods  Approaches  Process  Techniques used  Merits & Demerits  Challenges  Applications  Text Mining Computer Programs  Demo using python  Latest Research work  References  Query 2
  • 3. Introduction 3 • Data means known facts that can be recorded and that have implicit meaning.[1] • Database means a collection of related data. [1] • Data Warehouse is a repository of information collected from multiple sources, stored under a unified schema, and that usually resides at a single site.[2] • Data Mining knowledge mining from data .[2] (extracting knowledge from large amounts of data) • Text databases(Document databases) Large collections of documents from various sources: news articles, research papers, books, digital libraries, e-mail messages, and web pages etc. (unstructured, semi structured, structured) • May be highly unstructured (some web pages on www) • May be semi structured (email messages) • May be structured ( Library catalogue database) • Text databases with highly regular structures typically can be implemented using relational database systems. • Text Mining is the analysis of data contained in natural language text. • Regular data mining Vs. Text mining:- in text mining the patterns are extracted from natural language text rather than from structured databases of facts.[3] Diagram
  • 4. 4 Text Mining Vs. Data Mining Data Mining Text Mining Data Object Numerical & categorical data Textual data Data structure Structured Unstructured &semi- structured Data representation Straightforward Complex Space dimension < tens of thousands > tens of thousands Methods Data analysis, machine learning, Data mining, information Statistic, neural networks retrieval, NLP, ... Maturity Broad implementation since1994 Broad implementation starting 2000 Market 105 analysts at large and mid size companies 108 analysts corporate workers and individual users
  • 5. 5 Need of Text Mining • Massive amount of new information being created doubles every 18 months. • 80-90% of all data is held in various unstructured formats. • Useful information can be derived from this unstructured data. Unstructured or semi-structured information Structured, Numerical or coded information (News articles, research papers, books, digital libraries, email messages, and web pages ) • Text databases are rapidly growing due to the increasing amount of information available in Electronic forms, such as electronic publication, various kinds of electronic documents, emails, and www. • Most of the information in government, industry, business, and other institutions are stored Electronically in the form of text databases.
  • 6. Information Retrieval[2] 6 Information retrieval (IR) is a field that has been developing in parallel with database systems. Concerned with retrieval of information from a large number of text based documents. Precision and Recall are two basic measures for accessing the quality of text retrieval. Precision is the percentage of retrieved documents that are in fact relevant to the query. Recall is the percentage of documents that are relevant to the query and were, in fact, retrieved. Where {Relevant} is set of documents relevant to a query, {Retrieved} is the set of documents retrieved.
  • 7. Information Retrieval Methods[2] 7 Two Categories IR Methods Document Selection Methods Document Ranking Methods • Document Selection Problem • Boolean retrieval model • Document Ranking Problem • Vector space model
  • 8. Vector Space Model[2] 8 • Represent a document and a query both as vectors in a high-dimensional space corresponding to all the keywords and use an appropriate similarity measure to compute the similarity between the query vector and the document vector. • The similarity values can then be used for ranking document. • Let freq(d, t) = term frequency = no. of occurrences of term t in the document d • TF(d, t) = term frequency matrix, measures the association of a term t with respect to the given document d. TF-IDF(d, t) = TF(d, t) X IDF(t) 0 if freq(d, t) = 0 TF(d, t) = 1+log(1+log(freq(d, t ))) OtherwiseTerm Frequency Inverse Document Frequency (represents scaling factor or the importance of term t) Here, d is the document collection, dt is the set of documents containing term t.
  • 9. Vector Space Model[2] (Example) 9 d/t t1 t2 t3 t4 t5 t6 t7 d1 0 4 10 8 0 5 0 d2 5 19 7 16 0 0 32 d3 15 0 0 4 9 0 17 d4 22 3 12 0 5 15 0 d5 0 7 0 9 2 4 12 A Term Frequency Matrix For t6 in, d4 we have TF(d4, t6 ) = 1 + log(1+log(15)) = 1.3374 IDF(t6 ) = log (1+5)/3 = 0.301 TF-IDF(d4, t6) = 1.3377 X 0.301 = 0.403 0 if freq(d, t) = 0 TF(d, t) = 1+log(1+log(freq(d, t ))) Otherwise d/t t1 t2 t3 t4 t5 t6 t7 d1 0 4 10 8 0 5 0 d2 5 19 7 16 0 0 32 d3 15 0 0 4 9 0 17 d4 22 3 12 0 5 15 0 d5 0 7 0 9 2 4 12 A Term Frequency Matrix
  • 10. Text Mining Approaches[2] 10 Text Mining Approaches Keyword based approach Tagging approach Information extraction approach • set of keywords or terms in the documents • may only discover relationship e.g “database” & “system”, “terrorist” & “explosion” • may not bring deep understanding to the text Input • set of tags • may rely on manual tagging (costly & not feasible for large collection of documents) • semantic information (events, facts etc.) • more advanced • may lead to the discovery of some deep knowledge
  • 11. Text Mining Process[4] 11 Preprocessing Text Mining Technique is applied Analysis of Text Text document from different sources Discovery of knowledge The technologies like Information extraction, categorization, Clustering, Visualization, Summarization are used in the text mining process
  • 12. Techniques Used in Text Mining[4] 1. Information Extraction: tokenization, identification of named entities, sentence segmentation, and part-of- speech assignment. 2. Text categorization procedure of assigning a category to the text among categories predefined by users. 3. Text clustering procedure of segmenting texts into several clusters, depending on the substantial relevance. 4. Visualization improve and simplify the discovery of relevant information. 5. Text summarization procedure to extract its partial content reflecting its whole contents automatically. 12
  • 13. Merits and Demerits of Text mining[4] Merits: i) The names of different entities and relationship between them can easily be found from the corpus of documents set (using the technique such as information extraction. ) ii) The challenging problem of managing great amount of unstructured information for extracting pattern is solved by text mining. Demerits: i) The information which is initially needed is no where written. ii) To mine the text for information or knowledge no programs can be made in order to analyze the unstructured text directly. 13
  • 14. Challenges in Text Mining (Representation issues) • Each word has a dictionary meaning, or meanings Run – (1) the verb. (2) the noun, in cricket Cricket – (1) The game. (2) The insect. Apple (the company) or apple (the fruit) • Ambiguity and context sensitivity - Each word is used in various “senses” Tendulkar made 100 runs Because of an injury, Tendulkar can not run and will need a runner between the wickets • Capturing the “meaning” of sentences is an important issue as well. (Grammar, parts of speech, time sense could be easy!) • Order of words in the query hot dog stand in the amusement park hot amusement stand in the dog park 14
  • 15. Text Mining Applications[5] 15 1. Security applications (monitoring and analysis of online plain text sources such as Internet news, blogs, etc. for national security purposes.) 2. Biomedical applications (studies in protein docking, protein interactions, and protein-disease associations) 3. Software applications (Within public sector much effort has been concentrated on creating software for tracking and monitoring terrorist activities.) 4. Online media applications (The Tribune Company, uses text mining to clarify information and to provide readers with greater search experiences, which in turn increases site "stickiness" and revenue. ) 5. Business and marketing applications (CRM, to improve predictive analytics models for customer, stock returns prediction) 6. Sentiment analysis (analysis of movie reviews, used to detect emotions, etc.) 7. Scientific literature mining and academic applications
  • 16. Text Mining Computer Programs[5] 16
  • 17. Demo 17 • Text Mining using Python (Tweeter, Whatsapp Chats)
  • 18. Latest Research work on Text Mining[6] 1. Sunil Kumar ; Maninder Singh, “Big data analytics for healthcare industry: impact, applications, and tools”, DOI: 10.26599/BDMA.2018.9020031 2. Bing Li, Xiaochun Yang, Rui Zhou, Bin Wang, Chengfei Liu, Yanchun Zhang, “An Efficient Method for High Quality and Cohesive Topical Phrase Mining”, DOI: 0.1109/TKDE.2018.2823758 3. Steven H. H. Ding, Benjamin C. M. Fung, Farkhund Iqbal, William K. Cheung, “Learning Stylometric Representations for Authorship Analysis”, DOI: 10.1109/TCYB.2017.2766189 4. Mohammed Nasri, Younes Jaafar, Karim Bouzoubaa, “Semantic Analysis of Arabic Texts Within SAFAR Framework”, DOI: 10.1109/CIST.2018.8596491 5. Jayesh Choudhari, Anirban Dasgupta, Indrajit Bhattacharya, Srikanta Bedathur, “Discovering Topical Interactions in Text-Based Cascades Using Hidden Markov Hawkes Processes”, DOI: 10.1109/ICDM.2018.00112 6. Yong Luo, Huaizheng Zhang, Yongjie Wang, Yonggang Wen, Xinwen Zhang, “ResumeNet: A Learning-Based Framework for Automatic Resume Quality Assessment”, DOI: 10.1109/ICDM.2018.00046 7. Si-Yu Ding, Xu-Ying Liu, Min-Ling Zhang, “Imbalanced Augmented Class Learning with Unlabeled Data by Label Confidence Propagation”, DOI: 10.1109/ICDM.2018.00023 18
  • 19. References [1] Ramez Elmasri and Shamkant B. Navathe, “Fundamentals of database systems”, 6th edition. [2] Jiawei Han and Micheline Kamber, “Data Mining, Concepts and Techniques”, 2nd edition. [3] http://people.ischool.berkeley.edu/~hearst/text-mining.html [4] Sonali Vijay Gaikwad, Archana Chaugule, Pramod Patil, “Text Mining Methods and Techniques”, International Journal of Computer Applications (0975 – 8887), International Journal of Computer Applications (0975 – 8887), Volume 85 – No 17, January 2014 [5] http://www.wikipedia.org [6] https://ieeexplore.org 19
  • 22. Data Warehouse 22 Data Source in Delhi Data Source in Mumbai Data Source in Kolkata Data Source in Chennai Clean Integrate Transform Load Refresh Data Warehouse Query and Analysis Tools Client Client Back