SlideShare a Scribd company logo
Introduction to
Data Mining
Kai Koenig
@AgentK
Web/Mobile Developer since the late 1990s
Interested in: Java & JVM, CFML, Functional
Programming, Go, Android, Data Science
And this is my view of the world…
Me
Introduction to Data Mining
1.What is Data Mining?

2. Concepts and Terminology

3.Weka

4.Algorithms

5. Dealing with Text

6. Java integration
Agenda
Introduction to Data Mining
We are overwhelmed
with data.
Introduction to Data Mining
1.What is Data Mining?
Fundamentals
Why do we nowadays have SO MUCH data?
Reasons include:
- Cheap storage and better processing power
- Legal & Business requirements
- Digital hoarding
Fundamentals
Data Mining is all about going from data to useful
and meaningful information.
- Recommendation in online shops
- Finding an “optimal” partner
- Weather prediction
- Judgement decisions (credit applications)
Fundamentals
A better definition
“Data Mining is defined as the process of
discovering patterns in data.The process must be
automatic or (more usually) semiautomatic.The
patterns discovered must be meaningful in that
they lead to some advantage, often an economic
one.”
(Prof. Dr. Ian Witten)
How can you express patterns?
Finding and applying rules
Tear Production
Rate == reduced
none
Finding and applying rules
Age == young &&
Astigmatism == no
soft
Age == young &&
Astigmatism == no
soft
A Result: Decision lists
If outlook = sunny and humidity = high then play = no

If outlook = rainy and windy = true then play = no

If outlook = overcast then play = yes

If humidity = normal then play = yes

If none of the above then play = yes
Not all rules are equal
Classification rules: predict an outcome
Association rules: rules that strongly associate
different attribute values
If temperature = cool then humidity = normal

If humidity = normal and windy = false then play = yes 

If outlook = sunny and play = no then humidity = high

2. Concepts and
Terminology
Learning
What is Learning? And what is Machine Learning?
A good approach is:
“Things learn when they change their
behaviour in a way that makes them perform
better in the future”
Learning types
Classification learning
Association learning
Clustering
Numerical Prediction
Some basic terminology
The thing to be learned is the concept.
The output of a learning scheme is the
concept description.
Classification learning is sometimes called
supervised learning. The outcome is the
class.
Examples are called instances.
Introduction to Data Mining
Introduction to Data Mining
Some more basic terminology
Discrete attribute values are usually called
nominal values, continuous attribute values are
called just numeric values.
Algorithms used to process data and find
patterns are often called classifiers.There are
lots of them and all of them can be heavily
configured.
Introduction to Data Mining
3.Weka
What is Weka?
Waikato Environment for Knowledge Analysis
Developed by a group in the Dept. of Computer
Science at the University of Waikato in New
Zealand.


Also,Weka is a New Zealand-only bird.
What is Weka?
Download for Mac OS X, Linux and Windows:
http://www.cs.waikato.ac.nz/~ml/weka/
index.html

Weka is written in Java, comes either as native
applications or executable .jar file and is licensed
under GPL v3.
Getting data into Weka
Easiest and common for experimenting: .arff
Also supported: CSV, JSON, XML, JDBC
connections etc.
Filters in Weka can then be used to preprocess
data.
Features
50+ Preprocessing tools
75+ Classification/Regression algorithms
~10 clustering algorithms
… and a packet manager to load and install
more if you want.
4.Algorithms
Classifiers
There are literally hundreds with lots of tuning
options.
Main Categories:
- Rule-based (ZeroR, OneR, PART etc.)
- Tree-based (J48, J48graft, CART etc.)
- Bayes-based (NaiveBayes etc.)
- Functions-based (LR, Logistic etc.)
- Lazy (IB1, IBk etc.)
OneR
Very simplistic classifier and based on a single
attribute.
For each attribute,
For each value of that attribute, make a rule as follows:
count how often each class appears
find the most frequent class
make the rule assign that class to this attribute value.
Calculate the error rate of the rules.
Choose the rules with the smallest error rate.
C4.5 (J48)
Produces a decision tree, derived from divide-
and-conquer tree building techniques.
Decision trees are often verbose and need to be
pruned - J48 uses post-pruning, pruning can in
some instances be costly.
J48 usually provides a good balance re quality vs.
cost (execution times etc.)
NaiveBayes
Very good and popular for document (text)
classification.
Based on statistical modelling (Bayes formula of
conditional probability)
In document classification we treat the existence
or absence of a word as a Boolean attribute.
Introduction to Data Mining
Training and Testing
We implicitly trained and tested our classifiers in
the previous examples using Cross-Validation.
Training and Testing
Test data and Training data NEED to be different.
If you have only one dataset, split it up.
n-fold Cross-Validation:
- Divides your dataset into n parts, holds out
each part in turn
- Trains with n-1 parts, tests with the held out
part
- Stratified CV is even better
Introduction to Data Mining
5. Dealing with Text
Bag of Words
Generally for document classification we treat a
document as a bag of words and the existence
or absence of a word is a Boolean attribute.
This results in problems with very many
attributes having 2 values each.
This is quite a bit different from the usual
classification problem.
Introduction to Data Mining
Filtered Classifiers
First step: use Filtered classifier with J48 and
StringToWordVector filter.
Example: Reuters Corn datasets (train/test)
We get 97% accuracy, but there’s still an issue
here -> investigate the confusion matrix
Is accuracy the best way to evaluate quality?
Better approaches to evaluation
Accuracy: (a+d)/(a+b+c+d)
Recall: R = d/(c+d)
Precision: P = d/(b+d)
F-Measure: 2PR/(P+R)
False positive rate FP: b/(a+b)
True negative rate TN: a/(a+b)
False negative rate FN: c/(c+d)
predicted
– +
true
– a b
+ c d
ROC (threshold) curves
Area under the threshold curve determines the
overall quality of a classifier.
Introduction to Data Mining
NaiveBayesMultinomial
Often the best classifier for document
classification. In particular:
- good ROC
- good results on minority class (often what we
want)
NaiveBayesMultinomial
J48: 96% accuracy, 38/57 on grain docs, 544/547
on non-grain docs, ROC 0.91
NaiveBayes: 80% accuracy, 46/57 on grain docs,
439/547 on non-grain docs, ROC 0.885
NaiveBayesMultinomial: 91% accuracy, 52/57 on
grain docs, 496/547 on non-grain docs, ROC
0.973
Introduction to Data Mining
NaiveBayesMultinomial
NaiveBayesMultinomial with stoplist, lowerCase
and outputWords: 94% accuracy, 56/57 on grain
docs, 504/547 on non-grain docs, ROC 0.978
Why? NBM is designed for text:
- based solely on word appearance
- can deal with multiple repetitions of a word
- faster than NB
6. Java integration
Weka is written in Java
The UI is essentially making use of a vast
underlying data mining and machine learning
API.
Obviously this fact
invites us to use the
API directly :)
Setting up a project (IntelliJ IDEA)
Create new Java project in IntelliJ
Import weka.jar
Import weka-src.jar
Off you go!
The main classes/packages you need…
import weka.classifiers.Evaluation;

import weka.classifiers.trees.J48;

import weka.core.Instances;
Getting stuff done
Instances train = new Instances(bReader);

train.setClassIndex(train.numAttributes()-1);
J48 j48 = new J48();

j48.buildClassifier(train);
Evaluation eval = new Evaluation(train);

eval.crossValidateModel(
j48,
train,
10,
new Random(1));
You can also grab Java code off Weka UI
Photo Credits
https://www.flickr.com/photos/johnnystiletto/3339808858/
https://www.flickr.com/photos/theequinest/5056055144/
https://www.flickr.com/photos/flyingkiwigirl/17385243168
https://www.flickr.com/photos/x6e38/3440973490/
https://www.flickr.com/photos/42931449@N07/5418402840/
https://www.flickr.com/photos/gerardstolk/12194108005/
https://www.flickr.com/photos/zzpza/3269784239/in/
https://www.flickr.com/photos/internationaltransportforum/
14258907973/


Get in touch
Kai Koenig
Email: kai@ventego-creative.co.nz
www.ventego-creative.co.nz
Blog: www.bloginblack.de
Twitter: @AgentK

More Related Content

What's hot (20)

PPTX
Data Analytics Life Cycle
Dr. C.V. Suresh Babu
 
PDF
Introduction to data science
Tharushi Ruwandika
 
PPTX
Exploratory data analysis with Python
Davis David
 
PPT
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Salah Amean
 
PPTX
Data science & data scientist
VijayMohan Vasu
 
PPT
K mean-clustering algorithm
parry prabhu
 
PDF
Machine Learning: Applications, Process and Techniques
Rui Pedro Paiva
 
PDF
K means clustering
Thomas K T
 
PPTX
R programming
Pooja Sharma
 
PDF
Classification Based Machine Learning Algorithms
Md. Main Uddin Rony
 
PPTX
Exploratory data analysis
Gramener
 
PPTX
Machine learning clustering
CosmoAIMS Bassett
 
PDF
Introduction To Data Science
Spotle.ai
 
PDF
Introduction to R Programming
izahn
 
PDF
Introduction to Pandas and Time Series Analysis [PyCon DE]
Alexander Hendorf
 
PPTX
Statistics and data science
Mohammad Azharuddin
 
PPTX
Introduction to pandas
Piyush rai
 
PDF
Hierarchical Clustering
Carlos Castillo (ChaTo)
 
PPTX
Classification techniques in data mining
Kamal Acharya
 
PDF
Machine Learning for dummies!
ZOLLHOF - Tech Incubator
 
Data Analytics Life Cycle
Dr. C.V. Suresh Babu
 
Introduction to data science
Tharushi Ruwandika
 
Exploratory data analysis with Python
Davis David
 
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Salah Amean
 
Data science & data scientist
VijayMohan Vasu
 
K mean-clustering algorithm
parry prabhu
 
Machine Learning: Applications, Process and Techniques
Rui Pedro Paiva
 
K means clustering
Thomas K T
 
R programming
Pooja Sharma
 
Classification Based Machine Learning Algorithms
Md. Main Uddin Rony
 
Exploratory data analysis
Gramener
 
Machine learning clustering
CosmoAIMS Bassett
 
Introduction To Data Science
Spotle.ai
 
Introduction to R Programming
izahn
 
Introduction to Pandas and Time Series Analysis [PyCon DE]
Alexander Hendorf
 
Statistics and data science
Mohammad Azharuddin
 
Introduction to pandas
Piyush rai
 
Hierarchical Clustering
Carlos Castillo (ChaTo)
 
Classification techniques in data mining
Kamal Acharya
 
Machine Learning for dummies!
ZOLLHOF - Tech Incubator
 

Similar to Introduction to Data Mining (20)

PPS
Brief Tour of Machine Learning
butest
 
DOC
Presentation on Machine Learning and Data Mining
butest
 
PPTX
Data mining approaches and methods
sonangrai
 
PDF
Barga Data Science lecture 9
Roger Barga
 
PPT
Data Mining in Market Research
butest
 
PPT
Data Mining In Market Research
kevinlan
 
PPT
Data Mining In Market Research
jim
 
PDF
Barga Data Science lecture 2
Roger Barga
 
PPTX
Unit 3.pptx
AdwaitLaud
 
PDF
Dbm630 lecture06
Tokyo Institute of Technology
 
PPTX
Using the Machine to predict Testability
Miguel Lopez
 
PPTX
lecture13-DTrees-textcat.pptxnnnnnnnnnnnnnnnnnnnnnn
RAtna29
 
PPT
Machine Learning presentation.
butest
 
PPT
Ensemble Learning Featuring the Netflix Prize Competition and ...
butest
 
PDF
Machinr Learning and artificial_Lect1.pdf
SaketBansal9
 
PDF
Barga Data Science lecture 10
Roger Barga
 
PPTX
3 classification
Mahmoud Alfarra
 
PDF
Barga Data Science lecture 4
Roger Barga
 
PDF
machinecanthink-160226155704.pdf
PranavPatil822557
 
PPTX
Mis End Term Exam Theory Concepts
Vidya sagar Sharma
 
Brief Tour of Machine Learning
butest
 
Presentation on Machine Learning and Data Mining
butest
 
Data mining approaches and methods
sonangrai
 
Barga Data Science lecture 9
Roger Barga
 
Data Mining in Market Research
butest
 
Data Mining In Market Research
kevinlan
 
Data Mining In Market Research
jim
 
Barga Data Science lecture 2
Roger Barga
 
Unit 3.pptx
AdwaitLaud
 
Using the Machine to predict Testability
Miguel Lopez
 
lecture13-DTrees-textcat.pptxnnnnnnnnnnnnnnnnnnnnnn
RAtna29
 
Machine Learning presentation.
butest
 
Ensemble Learning Featuring the Netflix Prize Competition and ...
butest
 
Machinr Learning and artificial_Lect1.pdf
SaketBansal9
 
Barga Data Science lecture 10
Roger Barga
 
3 classification
Mahmoud Alfarra
 
Barga Data Science lecture 4
Roger Barga
 
machinecanthink-160226155704.pdf
PranavPatil822557
 
Mis End Term Exam Theory Concepts
Vidya sagar Sharma
 
Ad

More from Kai Koenig (20)

PDF
Why a whole country skipped a day - Fun with Timezones
Kai Koenig
 
PDF
Android 103 - Firebase and Architecture Components
Kai Koenig
 
PDF
Android 102 - Flow, Layouts and other things
Kai Koenig
 
PDF
Android 101 - Building a simple app with Kotlin in 90 minutes
Kai Koenig
 
PDF
Kotlin Coroutines and Android sitting in a tree - 2018 version
Kai Koenig
 
PDF
Kotlin Coroutines and Android sitting in a tree
Kai Koenig
 
PDF
Improving your CFML code quality
Kai Koenig
 
PDF
Summer of Tech 2017 - Kotlin/Android bootcamp
Kai Koenig
 
PDF
2017: Kotlin - now more than ever
Kai Koenig
 
PDF
Anko - The Ultimate Ninja of Kotlin Libraries?
Kai Koenig
 
PDF
Coding for Android on steroids with Kotlin
Kai Koenig
 
PDF
API management with Taffy and API Blueprint
Kai Koenig
 
PDF
Little Helpers for Android Development with Kotlin
Kai Koenig
 
PDF
Garbage First and you
Kai Koenig
 
PDF
Real World Lessons in jQuery Mobile
Kai Koenig
 
PDF
The JVM is your friend
Kai Koenig
 
PDF
Regular Expressions 101
Kai Koenig
 
PDF
There's a time and a place
Kai Koenig
 
KEY
Clojure - an introduction (and some CFML)
Kai Koenig
 
KEY
AngularJS for designers and developers
Kai Koenig
 
Why a whole country skipped a day - Fun with Timezones
Kai Koenig
 
Android 103 - Firebase and Architecture Components
Kai Koenig
 
Android 102 - Flow, Layouts and other things
Kai Koenig
 
Android 101 - Building a simple app with Kotlin in 90 minutes
Kai Koenig
 
Kotlin Coroutines and Android sitting in a tree - 2018 version
Kai Koenig
 
Kotlin Coroutines and Android sitting in a tree
Kai Koenig
 
Improving your CFML code quality
Kai Koenig
 
Summer of Tech 2017 - Kotlin/Android bootcamp
Kai Koenig
 
2017: Kotlin - now more than ever
Kai Koenig
 
Anko - The Ultimate Ninja of Kotlin Libraries?
Kai Koenig
 
Coding for Android on steroids with Kotlin
Kai Koenig
 
API management with Taffy and API Blueprint
Kai Koenig
 
Little Helpers for Android Development with Kotlin
Kai Koenig
 
Garbage First and you
Kai Koenig
 
Real World Lessons in jQuery Mobile
Kai Koenig
 
The JVM is your friend
Kai Koenig
 
Regular Expressions 101
Kai Koenig
 
There's a time and a place
Kai Koenig
 
Clojure - an introduction (and some CFML)
Kai Koenig
 
AngularJS for designers and developers
Kai Koenig
 
Ad

Recently uploaded (20)

PDF
Unlocking Insights: Introducing i-Metrics Asia-Pacific Corporation and Strate...
Janette Toral
 
PDF
Data Science Course Certificate by Sigma Software University
Stepan Kalika
 
PPTX
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
PDF
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
PDF
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
PPTX
Presentation.pptx hhgihyugyygyijguuffddfffffff
abhiruppal2007
 
PDF
IT GOVERNANCE 4-1 - Information System Security (1).pdf
mdirfanuddin1322
 
PPTX
covid 19 data analysis updates in our municipality
RhuAyungon1
 
PDF
5- Global Demography Concepts _ Population Pyramids .pdf
pkhadka824
 
PDF
IT GOVERNANCE 4-2 - Information System Security (1).pdf
mdirfanuddin1322
 
PPTX
How to Add Columns and Rows in an R Data Frame
subhashenia
 
PPTX
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
PDF
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
PPTX
Generative AI Boost Data Governance and Quality- Tejasvi Addagada
Tejasvi Addagada
 
PPTX
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
PDF
2025 Global Data Summit - FOM with AI.pdf
Marco Wobben
 
PDF
GOOGLE ADS (1).pdf THE ULTIMATE GUIDE TO
kushalkeshwanisou
 
PPTX
办理学历认证InformaticsLetter新加坡英华美学院毕业证书,Informatics成绩单
Taqyea
 
PDF
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
Unlocking Insights: Introducing i-Metrics Asia-Pacific Corporation and Strate...
Janette Toral
 
Data Science Course Certificate by Sigma Software University
Stepan Kalika
 
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
Presentation.pptx hhgihyugyygyijguuffddfffffff
abhiruppal2007
 
IT GOVERNANCE 4-1 - Information System Security (1).pdf
mdirfanuddin1322
 
covid 19 data analysis updates in our municipality
RhuAyungon1
 
5- Global Demography Concepts _ Population Pyramids .pdf
pkhadka824
 
IT GOVERNANCE 4-2 - Information System Security (1).pdf
mdirfanuddin1322
 
How to Add Columns and Rows in an R Data Frame
subhashenia
 
SHREYAS25 INTERN-I,II,III PPT (1).pptx pre
swapnilherage
 
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
Generative AI Boost Data Governance and Quality- Tejasvi Addagada
Tejasvi Addagada
 
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
2025 Global Data Summit - FOM with AI.pdf
Marco Wobben
 
GOOGLE ADS (1).pdf THE ULTIMATE GUIDE TO
kushalkeshwanisou
 
办理学历认证InformaticsLetter新加坡英华美学院毕业证书,Informatics成绩单
Taqyea
 
Business implication of Artificial Intelligence.pdf
VishalChugh12
 

Introduction to Data Mining