SlideShare a Scribd company logo
23
Most read
25
Most read
26
Most read
Romi Satria Wahono
romi@romisatriawahono.net
http://romisatriawahono.net
08118228331
Data Mining
Romi Satria Wahono
2
• SMA Taruna Nusantara Magelang (1993)
• B.Eng, M.Eng and Ph.D in Software Engineering
Saitama University Japan (1994-2004)
Universiti Teknikal Malaysia Melaka (2014)
• Core Competency in Enterprise Architecture,
Software Engineering and Machine Learning
• LIPI Researcher (2004-2007)
• Founder, CoFounder and CEO:
• PT Brainmatics Cipta Informatika (2005)
• PT Imani Prima (2007)
• PT IlmuKomputerCom Braindevs Sistema (2014)
• PT Brainmatics Indonesia Cendekia (2020)
• Professional Member of IEEE, ACM and PMI
• IT and Research Award Winners from WSIS (United Nations), Kemdikbud,
Ristekdikti, LIPI, etc
• SCOPUS/ISI Indexed Journal Reviewer: Information and Software Technology,
Journal of Systems and Software, Software: Practice and Experience, etc
• Industrial IT Certifications: TOGAF, ITIL, CCAI, CCNA, etc
• Enterprise Architecture & Digital Transformation Expert: KPK, INSW, BPPT, LIPI,
Kemenkeu, RistekDikti, Pertamina EP, PLN, PJB, PJBI, IP, FIF, Kemlu, ESDM, etc.
3
Educational
Objectives
(Benjamin Bloom)
Cognitive
Affective
Psychomotor
Criterion Referenced
Instruction
(Robert Mager)
Competencies
Performance
Evaluation
Minimalism
(John Carroll)
Start Immediately
Minimize the
Reading
Error Recognition
Self-Contained
Learning Design
4
Textbooks
5
1. Jelaskan perbedaan antara data, informasi dan pengetahuan!
2. Jelaskan apa yang anda ketahui tentang data mining!
3. Sebutkan peran utama data mining!
4. Sebutkan pemanfaatan dari data mining di berbagai bidang!
5. Pengetahuan apa yang bisa kita dapatkan dari data di bawah?
Pre-Test
6
NIM Gender Nilai
UN
Asal
Sekolah
IPS1 IPS2 IPS3 IPS 4 ... Lulus Tepat
Waktu
10001 L 28 SMAN 2 3.3 3.6 2.89 2.9 Ya
10002 P 27 SMAN 7 4.0 3.2 3.8 3.7 Tidak
10003 P 24 SMAN 1 2.7 3.4 4.0 3.5 Tidak
10004 L 26.4 SMAN 3 3.2 2.7 3.6 3.4 Ya
...
11000 L 23.4 SMAN 5 3.3 2.8 3.1 3.2 Ya
Course
Outline • 1.1 Apa dan Mengapa Data Mining?
• 1.2 Peran Utama dan Metode Data Mining
• 1.3 Sejarah dan Penerapan Data Mining
1.1. Pengantar
• 2.1 Proses dan Tools Data Mining
• 2.2 Penerapan Proses Data Mining
• 2.3 Evaluasi Model Data Mining
• 2.4 Proses Data Mining berbasis CRISP-DM
2. Proses
• 3.1 Data Cleaning
• 3.2 Data Reduction
• 3.3 Data Transformation
• 3.4 Data Integration
3. Persiapan Data
• 4.1 Algoritma Klasifikasi
• 4.2 Algoritma Klastering
• 4.3 Algoritma Asosiasi
• 4.4 Algoritma Estimasi dan Forecasting
4. Algoritma
• 5.1 Text Mining Concepts
• 5.2 Text Clustering
• 5.3 Text Classification
• 5.4 Data Mining Laws
5. Text Mining
7
1. Pengantar Data Mining
1.1 Apa dan Mengapa Data Mining?
1.2 Peran Utama dan Metode Data Mining
1.3 Sejarah dan Penerapan Data Mining
8
1.1 Apa dan Mengapa Data Mining?
9
Manusia memproduksi beragam
data yang jumlah dan ukurannya
sangat besar
• Astronomi
• Bisnis
• Kedokteran
• Ekonomi
• Olahraga
• Cuaca
• Financial
• …
Manusia Memproduksi Data
10
Astronomi
• Sloan Digital Sky Survey
• New Mexico, 2000
• 140TB over 10 years
• Large Synoptic Survey Telescope
• Chile, 2016
• Will acquire 140TB every five days
Biologi dan Kedokteran
• European Bioinformatics Institute (EBI)
• 20PB of data (genomic data doubles in size each year)
• A single sequenced human genome can be around 140GB in size
Pertumbuhan Data
11
kilobyte (kB) 103
megabyte (MB) 106
gigabyte (GB) 109
terabyte (TB) 1012
petabyte (PB) 1015
exabyte (EB) 1018
zettabyte (ZB) 1021
yottabyte (YB) 1024
Perubahan Kultur dan Perilaku
12
• Mobile Electronics market
• 7B smartphone subscriptions in 2015
• Web & Social Networks generates amount of data
• Google processes 100 PB per day, 3 million servers
• Facebook has 300 PB of user data per day
• Youtube has 1000PB video storage
Datangnya Tsunami Data
13
kilobyte (kB) 103
megabyte (MB) 106
gigabyte (GB) 109
terabyte (TB) 1012
petabyte (PB) 1015
exabyte (EB) 1018
zettabyte (ZB) 1021
yottabyte (YB) 1024
We are drowning in data,
but starving for knowledge!
(John Naisbitt, Megatrends, 1988)
Kebanjiran Data tapi Miskin Pengetahuan
14
Mengubah Data Menjadi Pengetahuan
15
• Data harus kita olah menjadi
pengetahuan supaya bisa bermanfaat
bagi manusia
• Dengan pengetahuan
tersebut, manusia dapat:
• Melakukan estimasi dan prediksi
apa yang terjadi di depan
• Melakukan analisis tentang asosiasi, korelasi dan
pengelompokan antar data dan atribut
• Membantu pengambilan keputusan dan
pembuatan kebijakan
16
Data Kehadiran Pegawai
Data - Informasi – Pengetahuan - Kebijakan
17
NIP TGL DATANG PULANG
1103 02/12/2004 07:20 15:40
1142 02/12/2004 07:45 15:33
1156 02/12/2004 07:51 16:00
1173 02/12/2004 08:00 15:15
1180 02/12/2004 07:01 16:31
1183 02/12/2004 07:49 17:00
Informasi Akumulasi Bulanan Kehadiran Pegawai
Data - Informasi – Pengetahuan - Kebijakan
18
NIP Masuk Alpa Cuti Sakit Telat
1103 22
1142 18 2 2
1156 10 1 11
1173 12 5 5
1180 10 12
Pola Kebiasaan Kehadiran Mingguan Pegawai
Data - Informasi – Pengetahuan - Kebijakan
19
Senin Selasa Rabu Kamis Jumat
Terlambat 7 0 1 0 5
Pulang
Cepat
0 1 1 1 8
Izin 3 0 0 1 4
Alpa 1 0 2 0 2
• Kebijakan penataan jam kerja karyawan khusus
untuk hari senin dan jumat
• Peraturan jam kerja:
• Hari Senin dimulai jam 10:00
• Hari Jumat diakhiri jam 14:00
• Sisa jam kerja dikompensasi ke hari lain
Data - Informasi – Pengetahuan - Kebijakan
20
Data - Informasi – Pengetahuan - Kebijakan
21
Kebijakan
Pengetahuan
Informasi
Data
Kebijakan Penataan Jam
Kerja Pegawai
Pola Kebiasaan Datang-
Pulang Pegawai
Informasi Rekap
Kehadiran Pegawai
Data Absensi Pegawai
Data - Informasi – Pengetahuan - Kebijakan
22
Apa itu Data Mining?
23
Disiplin ilmu yang mempelajari metode untuk mengekstrak
pengetahuan atau menemukan pola dari suatu data yang besar
• Disiplin ilmu yang mempelajari metode untuk
mengekstrak pengetahuan atau menemukan pola dari
suatu data yang besar
• Ekstraksi dari data ke pengetahuan:
1. Data: fakta yang terekam dan tidak membawa arti
2. Informasi: Rekap, rangkuman, penjelasan dan statistik dari
data
3. Pengetahuan: pola, rumus, aturan atau model yang
muncul dari data
• Nama lain data mining:
• Knowledge Discovery in Database (KDD)
• Big data
• Business intelligence
• Knowledge extraction
• Pattern analysis
• Information harvesting
24
Terminologi dan Nama Lain Data Mining
Konsep Proses Data Mining
25
Himpunan
Data
Metode Data
Mining
Pengetahuan
• Melakukan ekstraksi untuk mendapatkan informasi
penting yang sifatnya implisit dan sebelumnya tidak
diketahui, dari suatu data (Witten et al., 2011)
• Kegiatan yang meliputi pengumpulan, pemakaian
data historis untuk menemukan keteraturan, pola
dan hubungan dalam set data berukuran besar
(Santosa, 2007)
• Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful)
patterns or knowledge from huge amount of data
(Han et al., 2011)
Definisi Data Mining
26
• Puluhan ribu data mahasiswa di kampus yang
diambil dari sistem informasi akademik
• Apakah pernah kita ubah menjadi pengetahuan
yang lebih bermanfaat? TIDAK!
• Seperti apa pengetahuan itu? Rumus, Pola, Aturan
Contoh Data di Kampus
27
Prediksi Kelulusan Mahasiswa
28
• Puluhan ribu data calon anggota legislatif di KPU
• Apakah pernah kita ubah menjadi pengetahuan
yang lebih bermanfaat? TIDAK!
Contoh Data di Komisi Pemilihan Umum
29
Prediksi Calon Legislatif DKI Jakarta
30
Penentuan Kelayakan Kredit
31
0
5
10
15
20
2003 2004
Jumlah kredit
macet
Deteksi Pencucian Uang
32
Prediksi Kebakaran Hutan
33
FFMC DMC DC ISI temp RH wind rain ln(area+1)
93.5 139.4 594.2 20.3 17.6 52 5.8 0 0
92.4 124.1 680.7 8.5 17.2 58 1.3 0 0
90.9 126.5 686.5 7 15.6 66 3.1 0 0
85.8 48.3 313.4 3.9 18 42 2.7 0 0.307485
91 129.5 692.6 7 21.7 38 2.2 0 0.357674
90.9 126.5 686.5 7 21.9 39 1.8 0 0.385262
95.5 99.9 513.3 13.2 23.3 31 4.5 0 0.438255
SVM SVM+GA
C 4.3 1,840
Gamma (𝛾) 5.9 9,648
Epsilon (𝜀) 3.9 5,615
RMSE 1.391 1.379
4.3
5.9
3.9
1.391
1.840
9.648
5.615
1.379
0
2
4
6
8
10
12
C Gamma Epsilon RMSE
SVM SVM+GA
Prediksi dan klastering
calon tersangka koruptor
Profiling dan Prediksi Koruptor
34
Data
Data
Data
Data
Aktivitas Penindakan
Aktivitas Pencegahan
Pengetahuan
Asosiasi atribut
tersangka koruptor
Prediksi pencucian uang
Estimasi jenis dan
jumlah tahun hukuman
Pola Profil Tersangka Koruptor
35
Profiling dan Deteksi Kasus TKI
36
Klasterisasi Tingkat Kemiskinan
37
Pola Aturan Asosiasi dari Data Transaksi
38
Pola Aturan Asosiasi di Amazon.com
39
Stupid
Applications
• Sistem Informasi
Akademik
• Sistem Pencatatan
Pemilu
• Sistem Laporan
Kekayaan Pejabat
• Sistem Pencatatan
Kredit
Smart
Applications
• Sistem Prediksi
Kelulusan Mahasiswa
• Sistem Prediksi Hasil
Pemilu
• Sistem Prediksi
Koruptor
• Sistem Penentu
Kelayakan Kredit
From Stupid Apps to Smart Apps
40
Revolusi Industri 4.0
41
• Uber - the world’s largest taxi company,
owns no vehicles
• Google - world’s largest
media/advertising company, creates no
content
• Alibaba - the most valuable retailer, has
no inventory
• Airbnb - the world’s largest
accommodation provider, owns no real
estate
• Gojek - perusahaan angkutan umum,
tanpa memiliki kendaraan
Perusahaan Pengolah Pengetahuan
42
Data Mining Tasks and Roles
in General
43
Increasing potential
values to support
business decisions
End User
Business Analyst
Data Scientist
DBA/
DBE
Decision
Making
Data Presentation
Visualization Techniques
Data Mining
Information Discovery and Modeling
Data Exploration
Statistical Summary, Metadata, and Description
Data Preprocessing, Data Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
• Menganalisis data untuk dijadikan pola pengetahuan
(model/rule/formula/tree)
• Pola pengetahuan (model/rule/formula/tree)
dimasukkan ke dalam sistem (software)
• Sistem (software) menjadi cerdas dan bermanfaat
signifikan dalam meningkatkan value dan benefit
dari perusahaan/organisasi
• Dimana peran data scientist dalam perusahaan
pengembang teknologi (startup business atau
GAFAM)?
• Data Scientist? Software Engineer? Researcher? IT
Infrastructure Engineer?
Data Mining Tasks and Roles
in Product Development
44
45
How
COMPUTING
SOLUTION
WORKS?
JUALAN CD
MUSIK GA
LAKU?
BUAT produk
aplikasi
langganan
musik?
SOFTWARE ENGINEERING
DATA SCIENCE Infrastructure &
Security
Service
operation
Computing
research
Product management
It governance
Data
Mining
Pattern
Recognition
Machine
Learning
Statistics
Computing
Algorithms
Database
Technology
Hubungan Data Mining dan Bidang Lain
46
1. Tremendous amount of data
• Algorithms must be highly scalable to handle such as tera-bytes
of data
2. High-dimensionality of data
• Micro-array may have tens of thousands of dimensions
3. High complexity of data
• Data streams and sensor data
• Time-series data, temporal data, sequence data
• Structure data, graphs, social networks and multi-linked data
• Heterogeneous databases and legacy databases
• Spatial, spatiotemporal, multimedia, text and Web data
• Software programs, scientific simulations
4. New and sophisticated applications
Masalah-Masalah di Data Mining
47
1. Jelaskan dengan kalimat sendiri apa
yang dimaksud dengan data mining?
2. Sebutkan konsep alur proses data
mining!
Latihan
48
1.2 Peran Utama dan Metode Data
Mining
49
Peran Utama Data Mining
50
1. Estimasi
2. Forecasting
3. Klasifikasi
4. Klastering
5. Asosiasi
Data Mining Roles
(Larose, 2005)
Dataset (Himpunan Data)
51
Class/Label/Target
Attribute/Feature/Dimension
Nominal
Numerik
Record/
Object/
Sample/
Tuple/
Data
Tipe Data
52
(Kontinyu)
(Diskrit)
Tipe Data Deskripsi Contoh Operasi
Ratio
(Mutlak)
• Data yang diperoleh dengan cara
pengukuran, dimana jarak dua titik
pada skala sudah diketahui
• Mempunyai titik nol yang absolut
(*, /)
• Umur
• Berat badan
• Tinggi badan
• Jumlah uang
geometric mean,
harmonic mean,
percent variation
Interval
(Jarak)
• Data yang diperoleh dengan cara
pengukuran, dimana jarak dua titik
pada skala sudah diketahui
• Tidak mempunyai titik nol yang
absolut
(+, - )
• Suhu 0°c-100°c,
• Umur 20-30 tahun
mean, standard
deviation,
Pearson's
correlation, t and
F tests
Ordinal
(Peringkat)
• Data yang diperoleh dengan cara
kategorisasi atau klasifikasi
• Tetapi diantara data tersebut
terdapat hubungan atau berurutan
(<, >)
• Tingkat kepuasan
pelanggan (puas,
sedang, tidak puas)
median,
percentiles, rank
correlation, run
tests, sign tests
Nominal
(Label)
• Data yang diperoleh dengan cara
kategorisasi atau klasifikasi
• Menunjukkan beberapa object
yang berbeda
(=, )
• Kode pos
• Jenis kelamin
• Nomer id karyawan
• Nama kota
mode, entropy,
contingency
correlation, 2
test
53
Peran Utama Data Mining
54
1. Estimasi
2. Forecasting
3. Klasifikasi
4. Klastering
5. Asosiasi
Data Mining Roles
(Larose, 2005)
Customer Jumlah Pesanan (P) Jumlah Traffic Light (TL) Jarak (J) Waktu Tempuh (T)
1 3 3 3 16
2 1 7 4 20
3 2 4 6 18
4 4 6 8 36
...
1000 2 4 2 12
1. Estimasi Waktu Pengiriman Pizza
55
Waktu Tempuh (T) = 0.48P + 0.23TL + 0.5J
Pengetahuan
Pembelajaran dengan
Metode Estimasi (Regresi Linier)
Label
• Example: 209 different computer configurations
• Linear regression function
PRP = -55.9 + 0.0489 MYCT + 0.0153 MMIN + 0.0056 MMAX
+ 0.6410 CACH - 0.2700 CHMIN + 1.480 CHMAX
Contoh: Estimasi Performansi CPU
56
0
0
32
128
CHMAX
0
0
8
16
CHMIN
Channels Performance
Cache
(Kb)
Main memory
(Kb)
Cycle time
(ns)
45
0
4000
1000
480
209
67
32
8000
512
480
208
…
269
32
32000
8000
29
2
198
256
6000
256
125
1
PRP
CACH
MMAX
MMIN
MYCT
1. Formula/Function (Rumus atau Fungsi Regresi)
• WAKTU TEMPUH = 0.48 + 0.6 JARAK + 0.34 LAMPU + 0.2 PESANAN
2. Decision Tree (Pohon Keputusan)
3. Korelasi dan Asosiasi
4. Rule (Aturan)
• IF ips3=2.8 THEN lulustepatwaktu
5. Cluster (Klaster)
Output/Pola/Model/Knowledge
57
Dataset harga saham
dalam bentuk time
series (rentet waktu)
2. Forecasting Harga Saham
58
Pembelajaran dengan
Metode Forecasting (Neural Network)
Label Time Series
Pengetahuan berupa
Rumus Neural Network
59
Prediction Plot
Forecasting Cuaca
60
Exchange Rate Forecasting
61
Inflation Rate Forecasting
62
NIM Gender Nilai
UN
Asal
Sekolah
IPS1 IPS2 IPS3 IPS 4 ... Lulus Tepat
Waktu
10001 L 28 SMAN 2 3.3 3.6 2.89 2.9 Ya
10002 P 27 SMA DK 4.0 3.2 3.8 3.7 Tidak
10003 P 24 SMAN 1 2.7 3.4 4.0 3.5 Tidak
10004 L 26.4 SMAN 3 3.2 2.7 3.6 3.4 Ya
...
...
11000 L 23.4 SMAN 5 3.3 2.8 3.1 3.2 Ya
3. Klasifikasi Kelulusan Mahasiswa
63
Pembelajaran dengan
Metode Klasifikasi (C4.5)
Label
Pengetahuan Berupa Pohon Keputusan
64
• Input:
• Output (Rules):
If outlook = sunny and humidity = high then play = no
If outlook = rainy and windy = true then play = no
If outlook = overcast then play = yes
If humidity = normal then play = yes
If none of the above then play = yes
Contoh: Rekomendasi Main Golf
65
• Output (Tree):
Contoh: Rekomendasi Main Golf
66
• Input:
Contoh: Rekomendasi Contact Lens
67
• Output/Model (Tree):
Contoh: Rekomendasi Contact Lens
68
Klasifikasi Sentimen Analisis
69
Bankruptcy Prediction
70
4. Klastering Bunga Iris
71
Pembelajaran dengan
Metode Klastering (K-Means)
Dataset Tanpa Label
Pengetahuan (Model) Berupa Klaster
72
Klastering Jenis Pelanggan
73
Klastering Sentimen Warga
74
Poverty Rate Clustering
75
5. Aturan Asosiasi Pembelian Barang
76
Pembelajaran dengan
Metode Asosiasi (FP-Growth)
Pengetahuan Berupa Aturan Asosiasi
77
• Algoritma association rule (aturan asosiasi) adalah
algoritma yang menemukan atribut yang “muncul
bersamaan”
• Contoh, pada hari kamis malam, 1000 pelanggan
telah melakukan belanja di supermaket ABC, dimana:
• 200 orang membeli Sabun Mandi
• dari 200 orang yang membeli sabun mandi, 50 orangnya
membeli Fanta
• Jadi, association rule menjadi, “Jika membeli sabun
mandi, maka membeli Fanta”, dengan nilai support =
200/1000 = 20% dan nilai confidence = 50/200 = 25%
• Algoritma association rule diantaranya adalah: A
priori algorithm, FP-Growth algorithm, GRI algorithm
Contoh Aturan Asosiasi
78
Aturan Asosiasi di Amazon.com
79
Korelasi antara jumlah konsumsi minyak
pemanas dengan faktor-faktor di bawah:
1. Insulation: Ketebalan insulasi rumah
2. Temperatur: Suhu udara sekitar rumah
3. Heating Oil: Jumlah konsumsi minyak
pertahun perrumah
4. Number of Occupant: Jumlah penghuni rumah
5. Average Age: Rata-rata umur penghuni rumah
6. Home Size: Ukuran rumah
Heating Oil Consumption
80
81
82
Korelasi 4 Variable terhadap Konsumsi Minyak
83
Jumlah
Penghuni
Rumah
Rata-Rata
Umur
Ketebalan
Insulasi
Rumah
Temperatur
Konsumsi
Minyak
0.848
-0.774
0.736
0.381
Data mining amplifies perception in the
business domain
• How does data mining produce insight? This law
approaches the heart of data mining – why it must be a
business process and not a technical one
• Business problems are solved by people, not by algorithms
• The data miner and the business expert “see” the
solution to a problem, that is the patterns in the domain
that allow the business objective to be achieved
• Thus data mining is, or assists as part of, a perceptual process
• Data mining algorithms reveal patterns that are not normally
visible to human perception
• Within the data mining process, the human problem
solver interprets the results of data mining algorithms
and integrates them into their business understanding
Insight Law (Data Mining Law 6)
84
1. Estimation (Estimasi):
Linear Regression (LR), Neural Network (NN), Deep Learning (DL),
Support Vector Machine (SVM), Generalized Linear Model (GLM), etc
2. Forecasting (Prediksi/Peramalan):
Linear Regression (LR), Neural Network (NN), Deep Learning (DL),
Support Vector Machine (SVM), Generalized Linear Model (GLM), etc
3. Classification (Klasifikasi):
Decision Tree (CART, ID3, C4.5, Credal DT, Credal C4.5, Adaptative
Credal C4.5), Naive Bayes (NB), K-Nearest Neighbor (kNN), Linear
Discriminant Analysis (LDA), Logistic Regression (LogR), etc
4. Clustering (Klastering):
K-Means, K-Medoids, Self-Organizing Map (SOM), Fuzzy C-Means
(FCM), etc
5. Association (Asosiasi):
FP-Growth, A Priori, Coefficient of Correlation, Chi Square, etc
Metode Data Mining
85
1. Formula/Function (Rumus atau Fungsi Regresi)
• WAKTU TEMPUH = 0.48 + 0.6 JARAK + 0.34 LAMPU + 0.2 PESANAN
2. Decision Tree (Pohon Keputusan)
3. Tingkat Korelasi
4. Rule (Aturan)
• IF ips3=2.8 THEN lulustepatwaktu
5. Cluster (Klaster)
Output/Pola/Model/Knowledge
86
Kategorisasi Algoritma Data Mining
87
Supervised
Learning
Unsupervised
Learning
Semi-
Supervised
Learning
Association based Learning
• Pembelajaran dengan guru, data set memiliki
target/label/class
• Sebagian besar algoritma data mining
(estimation, prediction/forecasting,
classification) adalah supervised learning
• Algoritma melakukan proses belajar
berdasarkan nilai dari variabel target yang
terasosiasi dengan nilai dari variable prediktor
1. Supervised Learning
88
Dataset dengan Class
89
Class/Label/Target
Attribute/Feature/Dimension
Nominal
Numerik
• Algoritma data mining mencari pola dari
semua variable (atribut)
• Variable (atribut) yang menjadi
target/label/class tidak ditentukan (tidak ada)
• Algoritma clustering adalah algoritma
unsupervised learning
2. Unsupervised Learning
90
Dataset tanpa Class
91
Attribute/Feature/Dimension
• Semi-supervised learning
adalah metode data mining
yang menggunakan data
dengan label dan tidak
berlabel sekaligus dalam
proses pembelajarannya
• Data yang memiliki kelas
digunakan untuk
membentuk model
(pengetahuan), data tanpa
label digunakan untuk
membuat batasan antara
kelas
3. Semi-Supervised Learning
92
1. Sebutkan 5 peran utama data mining!
2. Jelaskan perbedaan estimasi dan forecasting!
3. Jelaskan perbedaan forecasting dan klasifikasi!
4. Jelaskan perbedaan klasifikasi dan klastering!
5. Jelaskan perbedaan klastering dan association!
6. Jelaskan perbedaan estimasi dan klasifikasi!
7. Jelaskan perbedaan estimasi dan klastering!
8. Jelaskan perbedaan supervised dan unsupervised
learning!
9. Sebutkan tahapan utama proses data mining!
Latihan
93
1.3 Sejarah dan Penerapan
Data Mining
94
• Sebelum 1600: Empirical science
• Disebut sains kalau bentuknya kasat mata
• 1600-1950: Theoretical science
• Disebut sains kalau bisa dibuktikan secara matematis atau
eksperimen
• 1950s-1990: Computational science
• Seluruh disiplin ilmu bergerak ke komputasi
• Lahirnya banyak model komputasi
• 1990-sekarang: Data science
• Kultur manusia menghasilkan data besar
• Kemampuan komputer untuk mengolah data besar
• Datangnya data mining sebagai arus utama sains
(Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science,
Communication of ACM, 45(11): 50-54, Nov. 2002)
Evolution of Sciences
95
96
Revolusi Industri 4.0
97
98
99
Business
Knowledge
Methods
Technology
Business objectives are the origin of every data
mining solution
• This defines the field of data mining: data mining is
concerned with solving business problems and
achieving business goals
• Data mining is not primarily a technology; it is a
process, which has one or more business objectives
at its heart
• Without a business objective, there is no data
mining
• The maxim: “Data Mining is a Business Process”
Business Goals Law (Data Mining Law 1)
100
Business Knowledge Law (Data Mining Law 2)
101
Business knowledge is central to every step of the
data mining process
• A naive reading of CRISP-DM would see business
knowledge used at the start of the process in
defining goals, and at the end of the process in
guiding deployment of results
• This would be to miss a key property of the data
mining process, that business knowledge has a
central role in every step
• Marketing: product recommendation, market basket
analysis, product targeting, customer retention
• Finance: investment support, portfolio management,
price forecasting
• Banking and Insurance: credit and policy approval,
money laundry detection
• Security: fraud detection, access control, intrusion
detection, virus detection
• Manufacturing: process modeling, quality control,
resource allocation
• Web and Internet: smart search engines, web marketing
• Software Engineering: effort estimation, fault prediction
• Telecommunication: network monitoring, customer
churn prediction, user behavior analysis
Private and Commercial Sector
102
Use Case: Product Recommendation
103
0
500,000
1,000,000
1,500,000
2,000,000
2,500,000
3,000,000
3,500,000
4,000,000
0 5 10 15 20 25 30 35
Tot.Belanja
Jml.Pcs
Jml.Item
Cluster - 2 Cluster - 3 Cluster - 1
• The cost of capturing and
correcting defects is expensive
• $14,102 per defect in post-release
phase (Boehm & Basili 2008)
• $60 billion per year (NIST 2002)
• Industrial methods of manual
software reviews activities can
find only 60% of defects
(Shull et al. 2002)
• The probability of detection of
software fault prediction
models is higher (71%) than
software reviews (60%)
Use Case: Software Fault Prediction
104
• Finance: exchange rate forecasting, sentiment
analysis
• Taxation: adaptive monitoring, fraud detection
• Medicine and Healt Care: hypothesis discovery,
disease prediction and classification, medical
diagnosis
• Education: student allocation, resource forecasting
• Insurance: worker’s compensation analysis
• Security: bomb, iceberg detection
• Transportation: simulation and analysis, load
estimation
• Law: legal patent analysis, law and rule analysis
• Politic: election prediction
Public and Government Sector
105
• Penentuan kelayakan kredit pemilihan rumah di bank
• Penentuan pasokan listrik PLN untuk wilayah Jakarta
• Prediksi profile tersangka koruptor dari data pengadilan
• Perkiraan harga saham dan tingkat inflasi
• Analisis pola belanja pelanggan
• Memisahkan minyak mentah dan gas alam
• Penentuan pola pelanggan yang loyal pada perusahaan
operator telepon
• Deteksi pencucian uang dari transaksi perbankan
• Deteksi serangan (intrusion) pada suatu jaringan
Contoh Penerapan Data Mining
106
• 1989 IJCAI Workshop on Knowledge Discovery in Databases
• Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991)
• 1991-1994 Workshops on Knowledge Discovery in Databases
• Advances in Knowledge Discovery and Data Mining (U. Fayyad, G.
Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)
• 1995-1998 International Conferences on Knowledge Discovery
in Databases and Data Mining (KDD’95-98)
• Journal of Data Mining and Knowledge Discovery (1997)
• ACM SIGKDD conferences since 1998 and SIGKDD Explorations
• More conferences on data mining
• PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM
(2001), WSDM (2008), etc.
• ACM Transactions on KDD (2007)
Data Mining Society
107
Conferences
• ACM SIGKDD Int. Conf. on
Knowledge Discovery in Databases
and Data Mining (KDD)
• SIAM Data Mining Conf. (SDM)
• (IEEE) Int. Conf. on Data Mining
(ICDM)
• European Conf. on Machine
Learning and Principles and
practices of Knowledge Discovery
and Data Mining (ECML-PKDD)
• Pacific-Asia Conf. on Knowledge
Discovery and Data Mining
(PAKDD)
• Int. Conf. on Web Search and Data
Mining (WSDM)
Journals
• ACM Transactions on Knowledge
Discovery from Data (TKDD)
• ACM Transactions on
Information Systems (TOIS)
• IEEE Transactions on Knowledge
and Data Engineering
• Springer Data Mining and
Knowledge Discovery
• International Journal of Business
Intelligence and Data Mining
(IJBIDM)
Conferences dan Journals Data Mining
108
2. Proses Data Mining
2.1 Proses dan Tools Data Mining
2.2 Penerapan Proses Data Mining
2.3 Evaluasi Model Data Mining
2.4 Proses Data Mining berbasis CRISP-DM
109
2.1 Proses dan Tools Data Mining
110
1. Himpunan
Data
(Pahami dan
Persiapkan Data)
2. Metode
Data Mining
(Pilih Metode
Sesuai Karakter Data)
3. Pengetahuan
(Pahami Model dan
Pengetahuan yg Sesuai )
4. Evaluation
(Analisis Model dan
Kinerja Metode)
Proses Data Mining
111
DATA PREPROCESSING
Data Cleaning
Data Integration
Data Reduction
Data Transformation
MODELING
Estimation
Prediction
Classification
Clustering
Association
MODEL
Formula
Tree
Cluster
Rule
Correlation
KINERJA
Akurasi
Tingkat Error
Jumlah Cluster
MODEL
Atribute/Faktor
Korelasi
Bobot
• Atribut adalah faktor atau parameter yang menyebabkan
class/label/target terjadi
• Jenis dataset ada dua: Private dan Public
• Private Dataset: data set dapat diambil dari organisasi
yang kita jadikan obyek penelitian
• Bank, Rumah Sakit, Industri, Pabrik, Perusahaan Jasa, etc
• Public Dataset: data set dapat diambil dari repositori
pubik yang disepakati oleh para peneliti data mining
• UCI Repository (http://www.ics.uci.edu/~mlearn/MLRepository.html)
• ACM KDD Cup (http://www.sigkdd.org/kddcup/)
• PredictionIO (http://docs.prediction.io/datacollection/sample/)
• Trend penelitian data mining saat ini adalah menguji
metode yang dikembangkan oleh peneliti dengan public
dataset, sehingga penelitian dapat bersifat: comparable,
repeatable dan verifiable
1. Himpunan Data (Dataset)
112
Public Data Set (UCI Repository)
113
Dataset (Himpunan Data)
114
Class/Label/Target
Attribute/Feature/Dimension
Nominal
Numerik
Record/
Object/
Sample/
Tuple/
Data
1. Estimation (Estimasi):
Linear Regression (LR), Neural Network (NN), Deep Learning (DL),
Support Vector Machine (SVM), Generalized Linear Model (GLM), etc
2. Forecasting (Prediksi/Peramalan):
Linear Regression (LR), Neural Network (NN), Deep Learning (DL),
Support Vector Machine (SVM), Generalized Linear Model (GLM), etc
3. Classification (Klasifikasi):
Decision Tree (CART, ID3, C4.5, Credal DT, Credal C4.5, Adaptative
Credal C4.5), Naive Bayes (NB), K-Nearest Neighbor (kNN), Linear
Discriminant Analysis (LDA), Logistic Regression (LogR), etc
4. Clustering (Klastering):
K-Means, K-Medoids, Self-Organizing Map (SOM), Fuzzy C-Means
(FCM), etc
5. Association (Asosiasi):
FP-Growth, A Priori, Coefficient of Correlation, Chi Square, etc
2. Metode Data Mining
115
1. Formula/Function (Rumus atau Fungsi Regresi)
• WAKTU TEMPUH = 0.48 + 0.6 JARAK + 0.34 LAMPU + 0.2 PESANAN
2. Decision Tree (Pohon Keputusan)
3. Tingkat Korelasi
4. Rule (Aturan)
• IF ips3=2.8 THEN lulustepatwaktu
5. Cluster (Klaster)
3. Pengetahuan (Pola/Model)
116
1. Estimation:
Error: Root Mean Square Error (RMSE), MSE, MAPE, etc
2. Prediction/Forecasting (Prediksi/Peramalan):
Error: Root Mean Square Error (RMSE) , MSE, MAPE, etc
3. Classification:
Confusion Matrix: Accuracy
ROC Curve: Area Under Curve (AUC)
4. Clustering:
Internal Evaluation: Davies–Bouldin index, Dunn index,
External Evaluation: Rand measure, F-measure, Jaccard index,
Fowlkes–Mallows index, Confusion matrix
5. Association:
Lift Charts: Lift Ratio
Precision and Recall (F-measure)
4. Evaluasi (Akurasi, Error, etc)
117
1. Akurasi
• Ukuran dari seberapa baik model mengkorelasikan antara hasil
dengan atribut dalam data yang telah disediakan
• Terdapat berbagai model akurasi, tetapi semua model akurasi
tergantung pada data yang digunakan
2. Kehandalan
• Ukuran di mana model data mining diterapkan pada dataset
yang berbeda
• Model data mining dapat diandalkan jika menghasilkan pola
umum yang sama terlepas dari data testing yang disediakan
3. Kegunaan
• Mencakup berbagai metrik yang mengukur apakah model
tersebut memberikan informasi yang berguna
Kriteria Evaluasi dan Validasi Model
118
Keseimbangan diantaranya ketiganya diperlukan karena belum tentu model
yang akurat adalah handal, dan yang handal atau akurat belum tentu berguna
Magic Quadrant for Data Science Platform
(Gartner, 2017)
119
Magic Quadrant for Data Science Platform
(Gartner, 2018)
120
• KNIME (Konstanz Information Miner)
adalah platform data mining untuk analisis,
pelaporan, dan integrasi data yang
termasuk perangkat lunak bebas dan
sumber terbuka
• KNIME mulai dikembangkan tahun 2004
oleh tim pengembang perangkat lunak dari
Universitas Konstanz, yang dipimpin oleh
Michael Berthold, yang awalnya digunakan
untuk penelitian di industri farmasi
• Mulai banyak digunakan orang sejak tahun
2006, dan setelah itu berkembang pesat
sehingga tahun 2017 masuk ke Magic
Quadrant for Data Science Platform
(Gartner Group)
KNIME
121
KNIME
122
• Pengembangan dimulai pada 2001 oleh Ralf
Klinkenberg, Ingo Mierswa, dan Simon Fischer di
Artificial Intelligence Unit dari University of
Dortmund, ditulis dalam bahasa Java
• Open source berlisensi AGPL (GNU Affero General
Public License) versi 3
• Meraih penghargaan sebagai software data mining
dan data analytics terbaik di berbagai lembaga
kajian, termasuk IDC, Gartner, KDnuggets, dsb
Rapidminer
123
Rapidminer
124
1. Atribut: karakteristik atau fitur dari data
yang menggambarkan sebuah proses
atau situasi
• ID, atribut biasa
2. Atribut target: atribut yang menjadi
tujuan untuk diisi oleh proses data
mining
• Label, cluster, weight
Role Atribut Pada Rapidminer
125
1. nominal: nilai secara kategori
2. binominal: nominal dua nilai
3. polynominal: nominal lebih dari dua nilai
4. numeric: nilai numerik secara umum
5. integer: bilangan bulat
6. real: bilangan nyata
7. text: teks bebas tanpa struktur
8. date_time: tanggal dan waktu
9. date: hanya tanggal
10.time: hanya waktu
Tipe Nilai Atribut pada Rapidminer
126
1. Perspektif Selamat Datang
(Welcome perspective)
2. Perspektif Desain
(Design perspective)
3. Perspektif Hasil
(Result perspective)
Perspektif dan View
127
• Perspektif dimana semua proses dibuat dan dikelola
• Pindah ke Perspektif Desain dengan Klik:
Perspektif Desain
128
• Process Control
Untuk mengontrol aliran proses,
seperti loop atau conditional branch
• Utility
Untuk mengelompokkan subprocess,
juga macro dan logger
• Repository Access
Untuk membaca dan menulis repositori
• Import
Untuk membaca data dari berbagai
format eksternal
• Export
Untuk menulis data ke berbagai format eksternal
• Data Transformation
Untuk transformasi data dan metadata
• Modelling
Untuk proses data mining seperti klasifikasi, regresi, clustering, asosiasi, dll
• Evaluation
Untuk mengukur kualitas dan perfomansi dari model
View Operator
129
View Proses
130
• Proses data mining pada dasarnya adalah proses analisis yang
berisi alur kerja dari komponen data mining
• Komponen dari proses ini disebut operator, yang memiliki:
1. Input
2. Output
3. Aksi yang dilakukan
4. Parameter yang diperlukan
• Sebuah operator bisa disambungkan melalui port masukan
(kiri) dan port keluaran (kanan)
• Indikator status dari operator:
• Lampu status: merah (tak tersambung), kuning (lengkap tetapi belum
dijalankan), hijau (sudah behasil dijalankan)
• Segitiga warning: bila ada pesan status
• Breakpoint: bila ada breakpoint sebelum/sesudahnya
• Comment: bila ada komentar
Operator dan Proses
131
• Operator kadang memerlukan parameter untuk bisa
berfungsi
• Setelah operator dipilih di view Proses, parameternya
ditampilkan di view ini
View Parameter
132
• View Help menampilkan deskripsi dari operator
• View Comment menampilkan komentar yang dapat
diedit terhadap operator
View Help dan View Comment
133
Kumpulan dan rangkaian fungsi-fungsi (operator)
yang bisa disusun secara visual (visual programming)
Mendesain Proses
134
Proses dapat dijalankan dengan:
• Menekan tombol Play
• Memilih menu Process → Run
• Menekan kunci F11
Menjalankan Proses
135
Melihat Hasil
136
View Problems dan View Log
137
• Instal Rapidminer versi 9
• Registrasi account di rapidminer.com dan dapatkan lisensi
Educational Program untuk mengolah data tanpa batasan record
Instalasi dan Registrasi Lisensi Rapidminer
138
139
2.2 Penerapan Proses Data Mining
140
1. Himpunan
Data
(Pahami dan
Persiapkan Data)
2. Metode
Data Mining
(Pilih Metode
Sesuai Karakter Data)
3. Pengetahuan
(Pahami Model dan
Pengetahuan yg Sesuai )
4. Evaluation
(Analisis Model dan
Kinerja Metode)
Proses Data Mining
141
DATA PREPROCESSING
Data Cleaning
Data Integration
Data Reduction
Data Transformation
MODELING
Estimation
Prediction
Classification
Clustering
Association
MODEL
Formula
Tree
Cluster
Rule
Correlation
KINERJA
Akurasi
Tingkat Error
Jumlah Cluster
MODEL
Atribute/Faktor
Korelasi
Bobot
1. Lakukan training pada data golf (ambil
dari repositories rapidminer) dengan
menggunakan algoritma decision tree
2. Tampilkan himpunan data (dataset) dan
pengetahuan (model tree) yang
terbentuk
Latihan: Rekomendasi Main Golf
142
143
144
145
146
147
148
149
150
151
Latihan: Penentuan Jenis Bunga Iris
152
1. Lakukan training pada data Bunga Iris (ambil dari
repositories rapidminer) dengan menggunakan algoritma
decision tree
2. Tampilkan himpunan data (dataset) dan pengetahuan
(model tree) yang terbentuk
Latihan: Klastering Jenis Bunga Iris
153
1. Lakukan training pada data Bunga Iris (ambil dari
repositories rapidminer) dengan algoritma k-Means
2. Tampilkan himpunan data (dataset) dan pengetahuan
(model tree) yang terbentuk
3. Tampilkan grafik dari cluster yang terbentuk
Latihan: Penentuan Mine/Rock
154
1. Lakukan training pada data Sonar (ambil dari
repositories rapidminer) dengan menggunakan
algoritma decision tree (C4.5)
2. Tampilkan himpunan data (dataset) dan pengetahuan
(model tree) yang terbentuk
1. Lakukan training pada data Contact Lenses (contact-
lenses.xls) dengan menggunakan algoritma decision tree
2. Gunakan operator Read Excel (on the fly) atau langsung
menggunakan fitur Import Data (persistent)
3. Tampilkan himpunan data (dataset) dan pengetahuan
(model tree) yang terbentuk
Latihan: Rekomendasi Contact Lenses
155
Read Excel Operator
156
Import Data Function
157
1. Lakukan training pada data CPU (cpu.xls) dengan
menggunakan algoritma linear regression
2. Lakukan pengujian terhadap data baru (cpu-
testing.xls), untuk model yang dihasilkan dari
tahapan 1. Data baru berisi 10 setting konfigurasi,
yang belum diketahui berapa performancenya
3. Amati hasil estimasi performance dari 10 setting
konfigurasi di atas
Latihan: Estimasi Performance CPU
158
Estimasi Performace cpu-testing.xls
159
cpu-testing.xls
cpu.xls
Performance CPU = 0.038 * MYCT
+ 0.017 * MMIN
+ 0.004 * MMAX
+ 0.603 * CACH
+ 1.291 * CHMIN
+ 0.906 * CHMAX
- 43.975
1. Lakukan training pada data pemilu
(datapemilukpu.xls) dengan algoritma yang
tepat
2. Data bisa ditarik dari Import Data atau
operator Read Excel
3. Tampilkan himpunan data (dataset) dan
pengetahuan (pola/model) yang terbentuk
4. Gunakan model yang dihasilkan untuk
memprediksi datapemilukpu-testing.xls
Latihan: Prediksi Elektabilitas Caleg
160
Proses Prediksi Elektabilitas Caleg
161
1. Lakukan training pada data konsumsi minyak
(HeatingOil.csv)
• Dataset jumlah konsumsi minyak untuk alat pemanas ruangan di
rumah pertahun perrumah
• Atribut:
• Insulation: Ketebalan insulasi rumah
• Temperatur: Suhu udara sekitar rumah
• Heating Oil: Jumlah konsumsi minyak pertahun perrumah
• Number of Occupant: Jumlah penghuni rumah
• Average Age: Rata-rata umur penghuni rumah
• Home Size: Ukuran rumah
2. Gunakan operator Set Role untuk memilih Label (Heating
Oil), tidak langsung dipilih pada saat Import Data
3. Pilih metode yang tepat supaya menghasilkan model
4. Apply model yang dihasilkan ke data pelanggan baru di file
HeatingOil-Scoring.csv, supaya kita bisa mengestimasi
berapa kebutuhan konsumsi minyak mereka, untuk
mengatur stok penjualan minyak
Latihan: Estimasi Konsumsi Minyak
162
Heating Oil = 3.323 * Insulation - 0.869 * Temperature + 1.968 * Avg_Age
+ 3.173 * Home_Size + 134.511
Proses Estimasi Konsumsi Minyak
163
1. Lakukan training pada data konsumsi minyak
(HeatingOil.csv)
• Dataset jumlah konsumsi minyak untuk alat pemanas
ruangan di rumah pertahun perrumah
• Atribut:
• Insulation: Ketebalan insulasi rumah
• Temperatur: Suhu udara sekitar rumah
• Heating Oil: Jumlah konsumsi minyak pertahun perrumah
• Number of Occupant: Jumlah penghuni rumah
• Average Age: Rata-rata umur penghuni rumah
• Home Size: Ukuran rumah
2. Tujuannya ingin mendapatkan informasi tentang
atribut apa saja yang paling berpengaruh pada
konsumsi minyak
Latihan: Matrix Correlation Konsumsi Minyak
164
165
Tingkat Korelasi 4 Atribut terhadap
Konsumsi Minyak
166
Jumlah
Penghuni
Rumah
Rata-Rata
Umur
Ketebalan
Insulasi
Rumah
Temperatur
Konsumsi
Minyak
0.848
-0.774
0.736
0.381
1. Lakukan training pada data transaksi
(transaksi.xlsx)
2. Pilih metode yang tepat supaya
menghasilkan pola
Latihan: Aturan Asosiasi Data Transaksi
167
168
1. Lakukan training pada data kelulusan
mahasiswa (datakelulusanmahasiswa.xls)
2. Gunakan operator Split Data untuk
memecah data secara otomatis menjadi
dua dengan perbandingan 0.9:0.1, di mana
0.9 untuk training dan 0.1 untuk testing
3. Pilih metode yang tepat supaya
menghasilkan pola yang bisa menguji data
testing 10%
Latihan: Klasifikasi Data Kelulusan Mahasiswa
169
170
1. Lakukan training pada data Harga Saham
(hargasaham-training.xls) dengan
menggunakan algoritma yang tepat
2. Tampilkan himpunan data (dataset) dan
pengetahuan (model regresi) yang
terbentuk
3. Lakukan pengujian terhadap data baru
(hargasaham-testing.xls), untuk model
yang dihasilkan dari tahapan 1
4. Lakukan visualisasi berupa grafik dari data
yang terbentuk dengan menggunakan Line
atau Spline
Latihan: Forecasting Harga Saham
171
172
173
Latihan: Forecasting Harga Saham (Univariat)
174
• Window size: Determines how many “attributes”
are created for the cross-sectional data
• Each row of the original time series within the window
width will become a new attribute
• We choose w = 6
• Step size: Determines how to advance the window
• Let us use s = 1
• Horizon: Determines how far out to make the
forecast
• If the window size is 6 and the horizon is 1, then the
seventh row of the original time series becomes the first
sample for the “label” variable
• Let us use h = 1
Parameter dari Windowing
175
• Lakukan training dengan menggunakan
linear regression pada dataset hargasaham-
training-uni.xls
• Gunakan Split Data untuk memisahkan
dataset di atas, 90% training dan 10% untuk
testing
• Harus dilakukan proses Windowing pada
dataset
• Plot grafik antara label dan hasil prediksi
dengan menggunakan chart
Latihan
176
Forecasting Harga Saham (Data Lampau)
177
Forecasting Harga Saham (Data Masa Depan)
178
179
1. Lakukan training dengan algoritma yang
tepat pada dataset: creditapproval-
training.xls
2. Ujicoba model yang dibentuk dari training di
atas ke dataset di bawah: creditapproval-
testing.xls
Latihan: Penentuan Kelayakan Kredit
180
1. Lakukan training pada data kanker
payudara (breasttissue.xls)
2. Dataset adalah di sheet 2, sedangkan sheet
1 berisi penjelasan tentangd ata
3. Bagi dataset dengan menggunakan
operator Split Data, 90% untuk training dan
10% untuk testing
4. Pilih metode yang tepat supaya
menghasilkan pola, analisis pola yang
dihasilkan
Latihan: Deteksi Kanker Payudara
181
1. Lakukan training pada data serangan
jaringan (intrusion-training.xls)
2. Pilih metode yang tepat supaya
menghasilkan pola
Latihan: Deteksi Serangan Jaringan
182
1. Lakukan training pada data resiko kredit
(CreditRisk.csv)
(http://romisatriawahono.net/lecture/dm/dataset/)
2. Pilih metode yang tepat supaya
menghasilkan pola
Latihan: Klasifikasi Resiko Kredit
183
1. Lakukan training pada data Music Genre
(musicgenre-small.csv)
2. Pilih metode yang tepat supaya
menghasilkan pola
Latihan: Klasifikasi Music Genre
184
NIP Gender Univers
itas
Progra
m Studi
IPK Usia Hasil
Penjual
an
Status
Keluarg
a
Jumlah
Anak
Kota
Tinggal
1001 L UI Komuni
kasi
3.1 21 100jt Single 0 Jakarta
1002 P UNDIP Inform
atika
2.9 26 50jt menika
h
1 Bekasi
… … … … … … … … … …
Data Profile dan Kinerja Marketing
185
NIP Gender Hasil
Penjualan
Produk A
Hasil
Penjualan
Produk B
Hasil
Penjualan
Layanan C
Hasil
Penjualan
Layanan D
Total Hasil
Penjualan
1001 L 10 20 50 30 100jt
1002 P 10 10 5 25 50jt
… … … … … … …
Mana Atribut yang Layak jadi Class dan Tidak?
Mana Atribut yang Layak jadi Class dan Tidak?
Data Profile dan Kinerja Marketing
186
Tahun Total Hasil
Penjualan
Total
Pengeluaran
Marketing
Total
Keuntungan
1990 100jt 98jt 2jt
1991 120jt 100 20jt
… …
NIP Gender Univers
itas
Progra
m Studi
Absens
i
Usia Jumlah
Peneliti
an
Status
Keluarg
a
Disiplin Kota
Tinggal
1001 L UI Komuni
kasi
98% 21 3 Single Baik Jakarta
1002 P UNDIP Inform
atika
50% 26 4 menika
h
Buruk Bekasi
… … … … … … … … … …
Data Profil dan Kinerja Dosen
187
NIP Gender Universitas Program
Studi
Jumlah
Publikasi
Jurnal
Jumlah
Publikasi
Konferensi
Total
Publikasi
Penelitian
1001 L UI Komunikasi 5 3 8
1002 P UNDIP Informatika 2 1 3
… … … … … … …
1. Dataset – Methods – Knowledge
1. Dataset Main Golf (Klasifikasi)
2. Dataset Iris (Klasifikasi)
3. Dataset Iris (Klastering)
4. Dataset CPU (Estimasi)
5. Dataset Pemilu (Klasifikasi)
6. Dataset Heating Oil (Asosiasi, Estimasi)
7. Dataset Transaksi (Association)
8. Dataset Harga Saham (Forecasting) (Uni dan Multi)
Competency Check
188
• Pahami berbagai dataset yang ada di folder
dataset
• Gunakan rapidminer untuk mengolah
dataset tersebut sehingga menjadi
pengetahuan
• Pilih algoritma yang sesuai dengan jenis data
pada dataset
Tugas: Mencari dan Mengolah Dataset
189
1. Pahami dan kuasai satu metode data mining dari berbagai
literature:
1. Naïve Bayes 2. k Nearest Neighbor
3. k-Means 4. C4.5
5. Neural Network 6. Logistic Regression
7. FP Growth 8. Fuzzy C-Means
9. Self-Organizing Map 0. Support Vector Machine
2. Rangkumkan dengan detail dalam bentuk slide,
dengan format:
1. Definisi (Solid dan Sistematis)
2. Tahapan Algoritma (lengkap dengan formulanya)
3. Penerapan Tahapan Algoritma untuk Studi Kasus Dataset Main
Golf, Iris, Transaksi, CPU, dsb
(hitung manual (gunakan excel) dan tidak dengan menggunakan
rapidminer, harus sinkron dengan tahapan algoritma)
3. Kirimkan slide dan excel ke romi@brainmatics.com, sehari
sebelum mata kuliah berikutnya
4. Presentasikan di depan kelas pada mata kuliah berikutnya
dengan bahasa manusia yang baik dan benar
Tugas: Menguasai Satu Metode DM
190
1. Kembangkan Java Code dari algoritma yang dipilih
2. Gunakan hanya 1 class (file) dan beri nama sesuai
nama algoritma, boleh membuat banyak method
dalam class tersebut
3. Buat account di Trello.Com dan register ke
https://trello.com/b/ZOwroEYg/course-assignment
4. Buat card dengan nama sendiri dan upload semua
file (pptx, xlsx, pdf, etc) laporan ke card tersebut
5. Deadline: sehari sebelum pertemuan berikutnya
Tugas: Kembangkan Code dari Algoritma DM
191
Algoritma k-Means
Format Template Tugas
192
• Rangkuman Definisi:
• K-means adalah ..... (John, 2016)
• K-Means adalah …. (Wedus, 2020)
• Kmeans adalah … (Telo, 2017)
• Kesimpulan makna dari k-means:
• asoidhjaihdiahdoisjhoi
Definisi
193
1. Siapkan dataset
2. Tentukan A dengan rumus A = x + y
3. Tentukan B dengan rumus B = d + e
4. Ulangi proses 1-2-3 sampai tidak ada perubahan
Tahapan Algoritma k-Means
194
1. Siapkan dataset
195
• blablabla
2. Tentukan A
196
• blablabla
3. Tentukan B
197
• blablabla
4. Iterasi 1
198
• blablabla
4. Iterasi 2 ... dst
199
2.3 Evaluasi Model Data Mining
200
1. Himpunan
Data
(Pahami dan
Persiapkan Data)
2. Metode
Data Mining
(Pilih Metode
Sesuai Karakter Data)
3. Pengetahuan
(Pahami Model dan
Pengetahuan yg Sesuai )
4. Evaluation
(Analisis Model dan
Kinerja Metode)
Proses Data Mining
201
DATA PREPROCESSING
Data Cleaning
Data Integration
Data Reduction
Data Transformation
MODELING
Estimation
Prediction
Classification
Clustering
Association
MODEL
Formula
Tree
Cluster
Rule
Correlation
KINERJA
Akurasi
Tingkat Error
Jumlah Cluster
MODEL
Atribute/Faktor
Korelasi
Bobot
1. Estimation:
• Error: Root Mean Square Error (RMSE), MSE, MAPE, etc
2. Prediction/Forecasting (Prediksi/Peramalan):
• Error: Root Mean Square Error (RMSE) , MSE, MAPE, etc
3. Classification:
• Confusion Matrix: Accuracy
• ROC Curve: Area Under Curve (AUC)
4. Clustering:
• Internal Evaluation: Davies–Bouldin index, Dunn index,
• External Evaluation: Rand measure, F-measure, Jaccard index,
Fowlkes–Mallows index, Confusion matrix
5. Association:
• Lift Charts: Lift Ratio
• Precision and Recall (F-measure)
Evaluasi Kinerja Model Data Mining
202
• Pembagian dataset, perbandingan 90:10 atau
80:20:
• Data Training
• Data Testing
• Data training untuk pembentukan model, dan
data testing digunakan untuk pengujian model
• Pemisahan data training dan testing
1. Data dipisahkan secara manual
2. Data dipisahkan otomatis dengan operator Split Data
3. Data dipisahkan otomatis dengan X Validation
Evaluasi Model Data Mining
203
2.3.1 Pemisahan Data Manual
204
• Gunakan dataset di bawah:
• creditapproval-training.xls: untuk membuat model
• creditapproval-testing.xls: untuk menguji model
• Data di atas terpisah dengan perbandingan:
data testing (10%) dan data training (90%)
• Data training sebagai pembentuk model, dan data
testing untuk pengujian model, ukur performancenya
Latihan: Penentuan Kelayakan Kredit
205
• pred MACET- true MACET: Jumlah data yang diprediksi
macet dan kenyataannya macet (TP)
• pred LANCAR-true LANCAR: Jumlah data yang diprediksi
lancar dan kenyataannya lancar (TN)
• pred MACET-true LANCAR: Jumlah data yang diprediksi
macet tapi kenyataannya lancer (FP)
• pred LANCAR-true MACET: Jumlah data yang diprediksi
lancar tapi kenyataanya macet (FN)
Confusion Matrix  Accuracy
206
Accuracy =
𝐓𝐏 + 𝐓𝐍
𝐓𝐏 + 𝐓𝐍 + 𝐅𝐏 + 𝐅𝐍
=
53 + 37
53 + 37 + 4 + 6
=
90
100
= 90%
• Precision: exactness – what % of tuples
that the classifier labeled as positive are
actually positive
• Recall: completeness – what % of positive
tuples did the classifier label as positive?
• Perfect score is 1.0
• Inverse relationship between precision & recall
• F measure (F1 or F-score): harmonic
mean of precision and recall,
• Fß: weighted measure of precision
and recall
• assigns ß times as much weight
to recall as to precision
Precision and Recall, and F-measures
207
Binary classification should be both sensitive and
specific as much as possible:
1. Sensitivity measures the proportion of true
’positives’ that are correctly identified (True
Positive Rate (TP Rate) or Recall)
2. Specificity measures the proportion of true
’negatives’ that are correctly identified (False
Negative Rate (FN Rate or Precision)
Sensitivity and Specificity
208
We need to know the probability that the classifier will
give the correct diagnosis, but the sensitivity and
specificity do not give us this information
• Positive Predictive Value (PPV) is the proportion of cases
with ’positive’ test results that are correctly diagnosed
• Negative Predictive Value (NPV) is the proportion of cases
with ’negative’ test results that are correctly diagnosed
PPV and NPV
209
• ROC (Receiver Operating Characteristics) curves: for visual
comparison of classification models
• Originated from signal detection theory
• ROC curves are two-dimensional graphs in which the TP rate is
plotted on the Y-axis and the FP rate is plotted on the X-axis
• ROC curve depicts relative trade-offs between benefits (’true
positives’) and costs (’false positives’)
• Two types of ROC curves: discrete and continuous
Kurva ROC - AUC (Area Under Curve)
210
Kurva ROC - AUC (Area Under Curve)
211
1. 0.90 - 1.00 = excellent classification
2. 0.80 - 0.90 = good classification
3. 0.70 - 0.80 = fair classification
4. 0.60 - 0.70 = poor classification
5. 0.50 - 0.60 = failure
(Gorunescu, 2011)
Guide for Classifying the AUC
212
• Gunakan dataset: breasttissue.xls
• Split data dengan perbandingan:
data testing (10%) dan data training (90%)
• Ukur performance
(Accuracy dan Kappa)
Latihan: Prediksi Kanker Payudara
213
• The (Cohen’s) Kappa statistics is a more vigorous
measure than the ‘percentage correct prediction’
calculation, because Kappa considers the correct
prediction that is occurring by chance
• Kappa is essentially a measure of how well the
classifier performed as compared to how well it
would have performed simply by chance
• A model has a high Kappa score if there is a big
difference between the accuracy and the null error
rate (Markham, K., 2014)
• Kappa is an important measure on classifier
performance, especially on imbalanced data set
Kappa Statistics
214
• Gunakan dataset di bawah:
• hargasaham-training.xls: untuk membuat model
• hargasaham-testing.xls: untuk menguji model
• Data di atas terpisah dengan perbandingan:
data testing (10%) dan data training (90%)
• Jadikan data training sebagai pembentuk
model/pola/knowledge, dan data testing untuk
pengujian model
• Ukur performance
Latihan: Prediksi Harga Saham
215
216
Root Mean Square Error
217
• The square root of the mean/average of the square of all of
the error
• The use of RMSE is very common and it makes an excellent
general purpose error metric for numerical predictions
• To construct the RMSE, we first need to determine the
residuals
• Residuals are the difference between the actual values and the
predicted values
• We denoted them by
• where is the observed value for the ith observation and
• is the predicted value
• They can be positive or negative as the predicted value under
or over estimates the actual value
• You then use the RMSE as a measure of the spread of the y
values about the predicted y value
Latihan: Klastering Jenis Bunga Iris
218
k DBI
3 0.666
4 0.764
5 0.806
6 0.910
7 0.999
1. Lakukan training pada data iris
(ambil dari repositories
rapidminer) dengan menggunakan
algoritma clustering k-means
2. Ukur performance-nya dengan
Cluster Distance Performance, cek
dan analisis nilai yang keluar
Davies Bouldin Indeks (DBI)
3. Lakukan pengubahan pada nilai k
pada parameter k-means dengan
memasukkan nilai: 3, 4, 5, 6, 7
219
• The Davies–Bouldin index (DBI) (introduced by David L. Davies
and Donald W. Bouldin in 1979) is a metric for evaluating
clustering algorithms
• This is an internal evaluation scheme, where the validation of
how well the clustering has been done is made using
quantities and features inherent to the dataset
• As a function of the ratio of the within cluster scatter, to the
between cluster separation, a lower value will mean that the
clustering is better
• This affirms the idea that no cluster has to be similar to
another, and hence the best clustering scheme essentially
minimizes the Davies–Bouldin index
• This index thus defined is an average over all the i clusters,
and hence a good measure of deciding how many clusters
actually exists in the data is to plot it against the number of
clusters it is calculated over
• The number i for which this value is the lowest is a good
measure of the number of clusters the data could be ideally
classified into
Davies–Bouldin index (DBI)
220
2.3.2 Pemisahan Data Otomatis dengan
Operator Split Data
221
• The Split Data operator takes a dataset as its input
and delivers the subsets of that dataset through its
output ports
• The sampling type parameter decides how the
examples should be shuffled in the resultant
partitions:
1. Linear sampling: Divides the dataset into partitions
without changing the order of the examples
2. Shuffled sampling: Builds random subsets of the
dataset
3. Stratified sampling: Builds random subsets and
ensures that the class distribution in the subsets is
the same as in the whole dataset
Split Data Otomatis
222
223
1. Dataset: datakelulusanmahasiswa.xls
2. Pisahkan data menjadi dua secara otomatis
(Split Data): data testing (10%) dan data training
(90%)
3. Ujicoba parameter pemisahan data baik
menggunakan Linear Sampling, Shuffled
Sampling dan Stratified Sampling
4. Jadikan data training sebagai pembentuk
model/pola/knowledge, dan data testing untuk
pengujian model
5. Terapkan algoritma yang sesuai dan ukur
performance dari model yang dibentuk
Latihan: Prediksi Kelulusan Mahasiswa
224
Proses Prediksi Kelulusan Mahasiswa
225
1. Dataset: HeatingOil.csv
2. Pisahkan data menjadi dua secara otomatis
(Split Data): data testing (10%) dan data
training (90%)
3. Jadikan data training sebagai pembentuk
model/pola/knowledge, dan data testing
untuk pengujian model
4. Terapkan algoritma yang sesuai dan ukur
performance dari model yang dibentuk
Latihan: Estimasi Konsumsi Minyak
226
2.3.3 Pemisahan Data dan Evaluasi Model
Otomatis dengan Cross-Validation
227
• Metode cross-validation digunakan untuk
menghindari overlapping pada data testing
• Tahapan cross-validation:
1. Bagi data menjadi k subset yg berukuran sama
2. Gunakan setiap subset untuk data testing dan sisanya
untuk data training
• Disebut juga dengan k-fold cross-validation
• Seringkali subset dibuat stratified (bertingkat)
sebelum cross-validation dilakukan, karena
stratifikasi akan mengurangi variansi dari estimasi
Metode Cross-Validation
228
Orange: k-subset (data testing)
10 Fold Cross-Validation
229
Eksperimen Dataset Akurasi
1 93%
2 91%
3 90%
4 93%
5 93%
6 91%
7 94%
8 93%
9 91%
10 90%
Akurasi Rata-Rata 92%
• Metode evaluasi standard: stratified 10-fold
cross-validation
• Mengapa 10? Hasil dari berbagai percobaan
yang ekstensif dan pembuktian teoritis,
menunjukkan bahwa 10-fold cross-validation
adalah pilihan terbaik untuk mendapatkan
hasil validasi yang akurat
• 10-fold cross-validation akan mengulang
pengujian sebanyak 10 kali dan hasil
pengukuran adalah nilai rata-rata dari 10 kali
pengujian
10 Fold Cross-Validation
230
1. Lakukan training pada data pemilu (datapemilukpu.xls)
2. Lakukan pengujian dengan menggunakan 10-fold X
Validation
3. Ukur performance-nya dengan confusion matrix dan
ROC Curve
4. Lakukan ujicoba, ubah algoritma menjadi Naive Bayes,
k-NN, Random Forest (RF), Logistic Regression (LogR),
analisis mana algoritma yang menghasilkan model
yang lebih baik (akurasi tinggi)
Latihan: Prediksi Elektabilitas Caleg
231
C4.5 NB k-NN LogR
Accuracy 92.87% 79.34% 88.7%
AUC 0.934 0.849 0.5
Latihan: Komparasi Prediksi Harga Saham
232
• Gunakan dataset harga saham
(hargasaham-training.xls)
• Lakukan pengujian dengan 10-fold X
Validation
• Lakukan ujicoba dengan mengganti
algoritma (GLM, LR, NN, DL, SVM),
catat hasil RMSE yang keluar
GLM LR NN DL SVM
RMSE
2.3.4 Komparasi Algoritma Data Mining
233
1. Estimation (Estimasi):
Linear Regression (LR), Neural Network (NN), Deep Learning (DL),
Support Vector Machine (SVM), Generalized Linear Model (GLM), etc
2. Forecasting (Prediksi/Peramalan):
Linear Regression (LR), Neural Network (NN), Deep Learning (DL),
Support Vector Machine (SVM), Generalized Linear Model (GLM), etc
3. Classification (Klasifikasi):
Decision Tree (CART, ID3, C4.5, Credal DT, Credal C4.5, Adaptative
Credal C4.5), Naive Bayes (NB), K-Nearest Neighbor (kNN), Linear
Discriminant Analysis (LDA), Logistic Regression (LogR), etc
4. Clustering (Klastering):
K-Means, K-Medoids, Self-Organizing Map (SOM), Fuzzy C-Means
(FCM), etc
5. Association (Asosiasi):
FP-Growth, A Priori, Coefficient of Correlation, Chi Square, etc
Metode Data Mining
234
1. Lakukan training pada data pemilu
(datapemilukpu.xls) dengan menggunakan
algoritma
1. Decision Tree (C4.5)
2. Naïve Bayes (NB)
3. K-Nearest Neighbor (K-NN)
2. Lakukan pengujian dengan menggunakan 10-fold
X Validation
Latihan: Prediksi Elektabilitas Caleg
235
DT NB K-NN
Accuracy 92.45% 77.46% 88.72%
AUC 0.851 0.840 0.5
236
1. Lakukan training pada data pemilu
(datapemilukpu.xls) dengan menggunakan
algoritma C4.5, NB dan K-NN
2. Lakukan pengujian dengan menggunakan 10-fold
X Validation
3. Ukur performance-nya dengan confusion matrix
dan ROC Curve
4. Uji beda dengan t-Test untuk mendapatkan
model terbaik
Latihan: Prediksi Elektabilitas Caleg
237
238
• Komparasi Accuracy dan AUC
• Uji Beda (t-Test)
Values with a colored background are smaller than alpha=0.050, which
indicate a probably significant difference between the mean values
• Urutan model terbaik: 1. C4.5 2. k-NN 3. NB
Hasil Prediksi Elektabilitas Caleg
239
C4.5 NB K-NN
Accuracy 92.45% 77.46% 88.72%
AUC 0.851 0.840 0.5
C4.5
NB
kNN
C4.5 NB kNN
• Komparasi Accuracy dan AUC
• Uji Beda (t-Test)
Values with a white background are higher than alpha=0.050, which
indicate a probably NO significant difference between the mean values
• Urutan model terbaik: 1. C4.5 1. kNN 2. NB
Hasil Prediksi Elektabilitas Caleg
240
C4.5 NB K-NN
Accuracy 93.41% 79.72% 91.76%
AUC 0.921 0.826 0.885
C4.5
NB
kNN
C4.5 NB kNN
Latihan: Komparasi Prediksi Harga Saham
241
1. GLM
2. LR
3. NN
4. DL dan SVM
• Gunakan dataset harga saham
(hargasaham-training.xls)
• Lakukan pengujian dengan 10-
fold X Validation
• Lakukan ujicoba dengan
mengganti algoritma (GLM,
LR, NN, DL, SVM), catat hasil
RMSE yang keluar
• Uji beda dengan t-Test
1. Statistik Deskriptif
• Nilai mean (rata-rata), standar deviasi,
varians, data maksimal, data minimal, dsb
2. Statistik Inferensi
• Perkiraan dan estimasi
• Pengujian Hipotesis
Analisis Statistik
242
Penggunaan Parametrik Non Parametrik
Dua sampel saling
berhubungan
(Two Dependent samples)
T Test
Z Test
Sign test
Wilcoxon Signed-Rank
Mc Nemar Change test
Dua sampel tidak berhubungan
(Two Independent samples)
T Test
Z Test
Mann-Whitney U test
Moses Extreme reactions
Chi-Square test
Kolmogorov-Smirnov test
Walt-Wolfowitz runs
Beberapa sampel berhubungan
(Several Dependent Samples)
Friedman test
Kendall W test
Cochran’s Q
Beberapa sampel tidak
Berhubungan
(Several Independent Samples)
Anova test (F test) Kruskal-Wallis test
Chi-Square test
Median test
Statistik Inferensi (Pengujian Hipotesis)
243
• Metode parametrik dapat dilakukan jika
beberapa persyaratan dipenuhi, yaitu:
• Sampel yang dianalisis haruslah berasal dari
populasi yang berdistribusi normal
• Jumlah data cukup banyak
• Jenis data yang dianalisis adalah biasanya
interval atau rasio
Metode Parametrik
244
• Metode ini dapat dipergunakan secara lebih luas,
karena tidak mengharuskan datanya berdistribusi
normal
• Dapat dipakai untuk data nominal dan ordinal sehingga
sangat berguna bagi para peneliti sosial untuk meneliti
perilaku konsumen, sikap manusia, dsb
• Cenderung lebih sederhana dibandingkan dengan metode
parametrik
• Selain keuntungannya, berikut kelemahan metode non
parametrik:
• Tidak adanya sistematika yang jelas seperti metode
parametrik
• Terlalu sederhana sehingga sering meragukan
• Memakai tabel-tabel yang lebih bervariasi dibandingkan
dengan tabel-tabel standar pada metode parametrik
Metode Non Parametrik
245
•Ho = tidak ada perbedaan signifikan
•Ha = ada perbedaan signifikan
alpha=0.05
Bila p < 0.05, maka Ho ditolak
•Contoh: kasus p=0.03, maka dapat
ditarik kesimpulan?
Interpretasi Statistik
246
1. Lakukan training pada data mahasiswa
(datakelulusanmahasiswa.xls) dengan
menggunakan C4.5, ID3, NB, K-NN, RF dan
LogR
2. Lakukan pengujian dengan menggunakan
10-fold X Validation
3. Uji beda dengan t-Test untuk mendapatkan
model terbaik
Latihan: Prediksi Kelulusan Mahasiswa
247
• Komparasi Accuracy dan AUC
• Uji Beda (t-Test)
• Urutan model terbaik: 1. C4.5 2. NB 2. k-NN 2. LogR
Hasil Prediksi Kelulusan Mahasiswa
248
C4.5 NB K-NN LogR
Accuracy 91.55% 82.58% 83.63% 77.47%
AUC 0.909 0.894 0.5 0.721
C4.5
NB
kNN
LogR
C4.5 NB kNN LogR
1. Lakukan training pada data cpu (cpu.xls) dengan
menggunakan algoritma linear regression, neural
network dan support vector machine
2. Lakukan pengujian dengan XValidation
(numerical)
3. Ukur performance-nya dengan menggunakan
RMSE (Root Mean Square Error)
4. Urutan model terbaik: 1. LR 2. NN 3. SVM
Latihan: Estimasi Performance CPU
249
LR NN SVM
RMSE 54.676 55.192 94.676
250
1. Lakukan training pada data minyak pemanas
(HeatingOil.csv) dengan menggunakan algoritma
linear regression, neural network dan support
vector machine, Deep Learning
2. Lakukan pengujian dengan XValidation
(numerical) dan Uji beda dengan t-Test
3. Ukur performance-nya dengan menggunakan
RMSE (Root Mean Square Error)
Latihan: Estimasi Konsumsi Minyak
251
LR NN SVM DL
RMSE
252
LR NN DL SVM
LR
NN
DL
SVM
Urutan model terbaik:
1. NN dan DL
2. LR dan SVM
1. Lakukan training pada data pemilu (datapemilukpu.xls)
dengan menggunakan algoritma Decision Tree, Naive
Bayes, K-Nearest Neighbor, RandomForest, Logistic
Regression
2. Lakukan pengujian dengan menggunakan XValidation
3. Ukur performance-nya dengan confusion matrix dan
ROC Curve
4. Masukkan setiap hasil percobaan ke dalam file Excel
Latihan: Prediksi Elektabilitas Caleg
253
DT NB K-NN RandFor LogReg
Accuracy 92.21% 76.89% 89.63%
AUC 0.851 0.826 0.5
1. Lakukan training pada data harga saham
(hargasaham-training.xls) dengan neural network,
linear regression, support vector machine
2. Lakukan pengujian dengan menggunakan
XValidation
Latihan: Prediksi Harga Saham
254
LR NN SVM
RMSE
1. Lakukan training pada data iris (ambil dari
repositories rapidminer) dengan menggunakan
algoritma clustering k-means
2. Gunakan pilihan nilai untuk k, isikan dengan 3, 4, 5,
6, 7
3. Ukur performance-nya dengan Cluster Distance
Performance, dari analisis Davies Bouldin Indeks
(DBI), tentukan nilai k yang paling optimal
Latihan: Klastering Jenis Bunga Iris
255
k=3 k=4 k=5 k=6 k=7
DBI 0.666 0.764 0.806 0.910 0.99
• The Davies–Bouldin index (DBI) (introduced by David L. Davies
and Donald W. Bouldin in 1979) is a metric for evaluating
clustering algorithms
• This is an internal evaluation scheme, where the validation of
how well the clustering has been done is made using quantities
and features inherent to the dataset
• As a function of the ratio of the within cluster scatter, to the
between cluster separation, a lower value will mean that the
clustering is better
• This affirms the idea that no cluster has to be similar to another,
and hence the best clustering scheme essentially minimizes the
Davies–Bouldin index
• This index thus defined is an average over all the i clusters, and
hence a good measure of deciding how many clusters actually
exists in the data is to plot it against the number of clusters it is
calculated over
• The number i for which this value is the lowest is a good measure
of the number of clusters the data could be ideally classified into
Davies–Bouldin index (DBI)
256
1. Estimation:
• Error: Root Mean Square Error (RMSE), MSE, MAPE, etc
2. Prediction/Forecasting (Prediksi/Peramalan):
• Error: Root Mean Square Error (RMSE) , MSE, MAPE, etc
3. Classification:
• Confusion Matrix: Accuracy
• ROC Curve: Area Under Curve (AUC)
4. Clustering:
• Internal Evaluation: Davies–Bouldin index, Dunn index,
• External Evaluation: Rand measure, F-measure, Jaccard index,
Fowlkes–Mallows index, Confusion matrix
5. Association:
• Lift Charts: Lift Ratio
• Precision and Recall (F-measure)
Evaluasi Model Data Mining
257
1. Lakukan ujicoba terhadap semua dataset yang ada di
folder datasets, dengan menggunakan berbagai metode
data mining yang sesuai (estimasi, prediksi, klasifikasi,
clustering, association)
2. Kombinasikan pengujian dengan pemecahan data
training-testing, dan pengujian dengan menggunakan
metode X validation
3. Ukur performance dari model yang terbentuk dengan
menggunakan metode pengukuran sesuai dengan
metode data mining yang dipilih
4. Jelaskan secara mendetail tahapan ujicoba yang
dilakukan, kemudian lakukan analisis dan sintesis, dan
buat laporan dalam bentuk slide
5. Presentasikan di depan kelas
Tugas: Mengolah Semua Dataset
258
• Technical Paper:
• Judul: Application and Comparison of Classification
Techniques in Controlling Credit Risk
• Author: Lan Yu, Guoqing Chen, Andy Koronios, Shiwu
Zhu, and Xunhua Guo
• Download:
http://romisatriawahono.net/lecture/dm/paper/
• Baca dan pahami paper di atas dan jelaskan apa
yang dilakukan peneliti pada paper tersebut:
1. Object Penelitian
2. Masalah Penelitian
3. Tujuan Penelitian
4. Metode Penelitian
5. Hasil Penelitian
Tugas: Mereview Paper
259
• Technical Paper:
• Judul: A Comparison Framework of Classification Models for
Software Defect Prediction
• Author: Romi Satria Wahono, Nanna Suryana Herman,
Sabrina Ahmad
• Publications: Adv. Sci. Lett. Vol. 20, No. 10-12, 2014
• Download: http://romisatriawahono.net/lecture/dm/paper
• Baca dan pahami paper di atas dan jelaskan apa yang
dilakukan peneliti pada paper tersebut:
1. Object Penelitian
2. Masalah Penelitian
3. Tujuan Penelitian
4. Metode Penelitian
5. Hasil Penelitian
Tugas: Mereview Paper
260
• Technical Paper:
• Judul: An experimental comparison of classification
algorithms for imbalanced credit scoring data sets
• Author: Iain Brown and Christophe Mues
• Publications: Expert Systems with Applications 39 (2012)
3446–3453
• Download: http://romisatriawahono.net/lecture/dm/paper
• Baca dan pahami paper di atas dan jelaskan apa yang
dilakukan peneliti pada paper tersebut:
1. Object Penelitian
2. Masalah Penelitian
3. Tujuan Penelitian
4. Metode Penelitian
5. Hasil Penelitian
Tugas Mereview Paper
261
• Cari dataset yang ada di sekitar kita
• Lakukan penelitian berupa komparasi dari (minimal)
5 algoritma machine learning untuk memining
knowledge dari dataset tersebut
• Gunakan uji beda (baik parametrik dan non
parametric) untuk analisis dan pembuatan ranking
dari algoritma machine learning
• Tulis makalah tentang penelitian yang kita buat
• Contoh-contoh makalah komparasi ada di:
http://romisatriawahono.net/lecture/dm/paper/method%20comparison/
• Upload seluruh file laporan ke Card di Trello.Com
• Deadline: sehari sebelum mata kuliah berikutnya
Tugas: Menulis Paper Penelitian
262
• Ikuti template dan contoh paper dari:
http://journal.ilmukomputer.org
• Isi paper:
• Abstract: Harus berisi obyek-masalah-metode-hasil
• Introduction: Latar belakang masalah penelitian dan struktur paper
• Related Work: Penelitian yang berhubungan
• Theoretical Foundation: Landasan dari berbagai teori yang digunakan
• Proposed Method: Metode yang diusulkan
• Experimental Results: Hasil eksperimen
• Conclusion: Kesimpulan dan future works
Paper Formatting
263
1. Dataset – Methods – Knowledge
1. Dataset Main Golf (Klasifikasi)
2. Dataset Iris (Klasifikasi)
3. Dataset Iris (Klastering)
4. Dataset CPU (Estimasi)
5. Dataset Pemilu (Klasifikasi)
6. Dataset Heating Oil (Association)
7. Dataset Transaksi (Association)
8. Dataset Harga Saham (Forecasting)
2. Dataset – Methods – Knowledge – Evaluation
1. Manual
2. Data Split Operator
3. Cross Validation
3. Methods Comparison
• Uji t-Test
4. Paper Reading
1. Lan Yu (DeLong Pearson Test)
2. Wahono (Friedman Test)
Competency Check
264
2.4 Proses Data Mining berbasis
Metodologi CRISP-DM
265
• Dunia industri yang beragam bidangnya
memerlukan proses yang standard yang mampu
mendukung penggunaan data mining untuk
menyelesaikan masalah bisnis
• Proses tersebut harus dapat digunakan di lintas
industry (cross-industry) dan netral secara bisnis,
tool dan aplikasi yang digunakan, serta mampu
menangani strategi pemecahan masalah bisnis
dengan menggunakan data mining
• Pada tahun 1996, lahirlah salah satu standard
proses di dunia data mining yang kemudian
disebut dengan: the Cross-Industry Standard
Process for Data Mining (CRISP–DM) (Chapman,
2000)
Data Mining Standard Process
266
CRISP-DM
267
• Enunciate the project objectives and
requirements clearly in terms of the business
or research unit as a whole
• Translate these goals and restrictions into
the formulation of a data mining problem
definition
• Prepare a preliminary strategy for achieving
these objectives
• Designing what you are going to build
1. Business Understanding
268
• Collect the data
• Use exploratory data analysis to familiarize
yourself with the data and discover initial
insights
• Evaluate the quality of the data
• If desired, select interesting subsets that may
contain actionable patterns
2. Data Understanding
269
• Prepare from the initial raw data the final
data set that is to be used for all subsequent
phases
• Select the cases and variables you want to
analyze and that are appropriate for your
analysis
• Perform data cleaning, integration, reduction
and transformation, so it is ready for the
modeling tools
3. Data Preparation
270
• Select and apply appropriate modeling
techniques
• Calibrate model settings to optimize results
• Remember that often, several different
techniques may be used for the same data
mining problem
• If necessary, loop back to the data
preparation phase to bring the form of the
data into line with the specific requirements
of a particular data mining technique
4. Modeling
271
• Evaluate the one or more models delivered in
the modeling phase for quality and
effectiveness before deploying them for use in
the field
• Determine whether the model in fact achieves
the objectives set for it in the first phase
• Establish whether some important facet of the
business or research problem has not been
accounted for sufficiently
• Come to a decision regarding use of the data
mining results
5. Evaluation
272
• Make use of the models created:
• model creation does not signify the completion of a
project
• Example of a simple deployment:
• Generate a report
• Example of a more complex deployment:
• Implement a parallel data mining process in another
department
• For businesses, the customer often carries
out the deployment based on your model
6. Deployment
273
CRISP-DM: Detail Flow
274
Studi Kasus CRISP-DM
Heating Oil Consumption – Correlational Methods
(Matthew North, Data Mining for the Masses 2nd Edition, 2016,
Chapter 4 Correlational Methods, pp. 69-76)
Dataset: HeatingOil.csv
275
CRISP-DM
276
• Problems:
• Sarah is a regional sales manager for a nationwide
supplier of fossil fuels for home heating
• Marketing performance is very poor and decreasing,
while marketing spending is increasing
• She feels a need to understand the types of behaviors and
other factors that may influence the demand for heating
oil in the domestic market
• She recognizes that there are many factors that influence
heating oil consumption, and believes that by
investigating the relationship between a number of those
factors, she will be able to better monitor and respond to
heating oil demand, and also help her to design marketing
strategy in the future
• Objective:
• To investigate the relationship between a number of
factors that influence heating oil consumption
1. Business Understanding
277
• In order to investigate her question, Sarah has enlisted our
help in creating a correlation matrix of six attributes
• Using employer’s data resources which are primarily drawn
from the company’s billing database, we create a data set
comprised of the following attributes:
1. Insulation: This is a density rating, ranging from one to ten,
indicating the thickness of each home’s insulation. A home
with a density rating of one is poorly insulated, while a home
with a density of ten has excellent insulation
2. Temperature: This is the average outdoor ambient
temperature at each home for the most recent year, measure
in degree Fahrenheit
3. Heating_Oil: This is the total number of units of heating oil
purchased by the owner of each home in the most recent year
4. Num_Occupants: This is the total number of occupants living
in each home
5. Avg_Age: This is the average age of those occupants
6. Home_Size: This is a rating, on a scale of one to eight, of the
home’s overall size. The higher the number, the larger the
home
2. Data Understanding
278
Data set: HeatingOil.csv
3. Data Preparation
279
• Data set appears to be very clean with:
• No missing values in any of the six attributes
• No inconsistent data apparent in our ranges (Min-Max)
or other descriptive statistics
3. Data Preparation
280
4. Modeling
281
• Hasil correlation matrix berupa tabel
• Semakin tinggi nilainya (semakin tebal warna
ungu), semakin tinggi tingkat korelasinya
4. Modeling
282
5. Evaluation
283
Positive
Correlation
Negative
Correlation
• Atribut (faktor) yang paling signifikan berpengaruh (hubungan positif)
pada konsumsi minyak pemanas (Heating Oil) adalah Average Age (Rata-
Rata Umur) penghuni rumah
• Atribut (faktor) kedua yang paling berpengaruh adalah Temperature
(hubungan negatif)
• Atribut (faktor) ketiga yang paling berpengaruh adalah Insulation
(hubungan positif)
• Atribut Home Size, pengaruhnya sangat kecil, sedangkan Num_Occupant
boleh dikatakan tidak ada pengaruh ke konsumsi minyak pemanas
5. Evaluation
284
• Grafik menunjukkan bahwa konsumsi minyak memiliki korelasi
positif dengan rata-rata usia
• Meskipun ada beberapa anomali juga terjadi:
1. Ada beberapa orang yang rata-rata usia tinggi, tapi kebutuhan
minyaknya rendah (warna biru muda di kolom kiri bagian atas)
2. Ada beberapa orang yang rata-rata usia rendah, tapi kebutuhan
minyaknya tinggi (warna merah di kolom kanan bagian bawah)
5. Evaluation
285
1
2
1. Grafik menunjukkan hubungan antara temperature dan insulation, dengan warna adalah konsumsi minyak
(semakin merah kebutuhan minyak semakin tinggi)
2. Secara umum dapat dikatakan bahwa hubungan temperatur dengan insulation dan konsumsi minyak adalah
negatif. Jadi temperatur semakin rendah, kebutuhan minyak semakin tinggi (kolom kiri bagian atas)
ditunjukkan dengan banyak yang berwarna kuning dan merah
3. Insulation juga berhubungan negatif dengan temperatur, sehingga makin rendah temperatur, semakin
butuh insulation
4. Beberapa anomali terdapat pada Insulation yang rendah nilainya, ada beberapa yang masih memerlukan
minyak yang tinggi
5. Evaluation
286
2 dan 3
2 dan 3
4
1. Grafik tiga dimensi menunjukkan hubungan antara temperatur, rata-rata
usia dan insulation
2. Warna menunjukkan kebutuhan minyak, semakin memerah maka
semakin tinggi
3. Temperatur semakin tinggi semakin tidak butuh minyak (warna biru tua
4. Rata-rata usia dan insulation semakin tinggi semakin butuh minyak
5. Evaluation
287
4
2
Dropping the Num_Occupants attribute
• While the number of people living in a home might
logically seem like a variable that would influence
energy usage, in our model it did not correlate in any
significant way with anything else
• Sometimes there are attributes that don’t turn out to
be very interesting
6. Deployment
288
Adding additional attributes to the data set
• It turned out that the number of occupants in the
home didn’t correlate much with other attributes,
but that doesn’t mean that other attributes would
be equally uninteresting
• For example, what if Sarah had access to the
number of furnaces and/or boilers in each home?
• Home_size was slightly correlated with Heating_Oil
usage, so perhaps the number of instruments that
consume heating oil in each home would tell an
interesting story, or at least add to her insight
6. Deployment
289
Investigating the role of home insulation
• The Insulation rating attribute was fairly strongly
correlated with a number of other attributes
• There may be some opportunity there to partner
with a company that specializes in adding insulation
to existing homes
6. Deployment
290
Focusing the marketing efforts to the city with low
temperature and high average age of citizen
• The temperature attribute was fairly strongly negative
correlated with a heating oil consumption
• The average age attribute was strongest positive
correlated with a heating oil consumption
6. Deployment
291
Adding greater granularity in the data set
• This data set has yielded some interesting results, but it’s
pretty general
• We have used average yearly temperatures and total
annual number of heating oil units in this model
• But we also know that temperatures fluctuate
throughout the year in most areas of the world, and thus
monthly, or even weekly measures would not only be
likely to show more detailed results of demand and usage
over time, but the correlations between attributes would
probably be more interesting
• From our model, Sarah now knows how certain attributes
interact with one another, but in the day-to-day business
of doing her job, she’ll probably want to know about
usage over time periods shorter than one year
6. Deployment
292
Studi Kasus CRISP-DM
Heating Oil Consumption – Linear Regression
(Matthew North, Data Mining for the Masses 2nd Edition, 2016,
Chapter 8 Linear Regression, pp. 159-171)
Dataset: HeatingOil.csv
Dataset: HeatingOil-scoring.csv
http://romisatriawahono.net/lecture/dm/dataset/
293
CRISP-DM
294
CRISP-DM: Detail Flow
295
• Problems:
• Business is booming, her sales team is signing up thousands of
new clients, and she wants to be sure the company will be able
to meet this new level of demand
• Sarah’s new data mining objective is pretty clear: she wants to
anticipate demand for a consumable product
• We will use a linear regression model to help her with her
desired predictions. She has data, 1,218 observations that give
an attribute profile for each home, along with those homes’
annual heating oil consumption
• She wants to use this data set as training data to predict the
usage that 42,650 new clients will bring to her company
• She knows that these new clients’ homes are similar in nature
to her existing client base, so the existing customers’ usage
behavior should serve as a solid gauge for predicting future
usage by new customers
• Objective:
• to predict the usage that 42,650 new clients will bring to her
company
1. Business Understanding
296
• Sarah has assembled separate Comma Separated Values file
containing all of these same attributes, for her 42,650 new
clients
• She has provided this data set to us to use as the scoring
data set in our model
• Data set comprised of the following attributes:
• Insulation: This is a density rating, ranging from one to ten,
indicating the thickness of each home’s insulation. A home with
a density rating of one is poorly insulated, while a home with a
density of ten has excellent insulation
• Temperature: This is the average outdoor ambient temperature
at each home for the most recent year, measure in degree
Fahrenheit
• Heating_Oil: This is the total number of units of heating oil
purchased by the owner of each home in the most recent year
• Num_Occupants: This is the total number of occupants living in
each home
• Avg_Age: This is the average age of those occupants
• Home_Size: This is a rating, on a scale of one to eight, of the
home’s overall size. The higher the number, the larger the home
2. Data Understanding
297
• Filter Examples: attribute value filter or custom filter
• Avg_Age>=15.1
• Avg_Age<=72.2
• Deleted Records= 42650-42042 = 508
3. Data Preparation
298
299
3. Modeling
300
4. Evaluation – Model Regresi
301
4. Evaluation – Hasil Prediksi
302
5. Deployment
303
304
• Karena bantuan data mining sebelumnya, Sarah akhirnya mendapatkan
promosi menjadi VP marketing, yang mengelola ratusan marketer
• Sarah ingin para marketer dapat memprediksi pelanggan potensial
mereka masing-masing secara mandiri. Masalahnya, data
HeatingOil.csv hanya boleh diakses oleh level VP (Sarah), dan tidak
diperbolehkan diakses oleh marketer secara langsung
• Sarah ingin masing-masing marketer membuat proses yang dapat
mengestimasi kebutuhan konsumsi minyak dari client yang mereka
approach, dengan menggunakan model yang sebelumnya dihasilkan
oleh Sarah, meskipun tanpa mengakses data training (HeatingOil.csv)
• Asumsikan bahwa data HeatingOil-Marketing.csv adalah data calon
pelanggan yang berhasil di approach oleh salah satu marketingnya
• Yang harus dilakukan Sarah adalah membuat proses untuk:
1. Mengkomparasi algoritma yang menghasilkan model yang memiliki akurasi
tertinggi (LR, NN, SVM), gunakan 10 Fold X Validation
2. Menyimpan model terbaik ke dalam suatu file (operator Store)
• Yang harus dilakukan Marketer adalah membuat proses untuk:
1. Membaca model yang dihasilkan Sarah (operator Retrieve)
2. Menerapkannya di data HeatingOil-Marketing.csv yang mereka miliki
• Mari kita bantu Sarah dan Marketer membuat dua proses tersebut
Latihan
305
Proses Komparasi Algoritma (Sarah)
306
Proses Pengujian Data (Marketer)
307
Studi Kasus CRISP-DM
Kelulusan Mahasiswa di Universitas Suka Belajar
Dataset: datakelulusanmahasiswa.xls
308
CRISP-DM
309
• Problems:
• Budi adalah Rektor di Universitas Suka Belajar
• Universitas Suka Belajar memiliki masalah besar karena
rasio kelulusan mahasiswa tiap angkatan sangat rendah
• Budi ingin memahami dan membuat pola dari profile
mahasiswa yang bisa lulus tepat waktu dan yang tidak
lulus tepat waktu
• Dengan pola tersebut, Budi bisa melakukan konseling,
terapi, dan memberi peringatan dini kepada mahasiswa
kemungkinan tidak lulus tepat waktu untuk
memperbaiki diri, sehingga akhirnya bisa lulus tepat
waktu
• Objective:
• Menemukan pola dari mahasiswa yang lulus tepat waktu
dan tidak
1. Business Understanding
310
• Untuk menyelesaikan masalah, Budi mengambil data dari
sistem informasi akademik di universitasnya
• Data-data dikumpulkan dari data profil mahasiswa dan
indeks prestasi semester mahasiswa, dengan atribut seperti
di bawah
1. NAMA
2. JENIS KELAMIN: Laki-Laki atau Perempuan
3. STATUS MAHASISWA: Mahasiswa atau Bekerja
4. UMUR:
5. STATUS NIKAH: Menikah atau Belum Menikah
6. IPS 1: Indeks Prestasi Semester 1
7. IPS 2: Indeks Prestasi Semester 1
8. IPS 3: Indeks Prestasi Semester 1
9. IPS 4: Indeks Prestasi Semester 1
10. IPS 5: Indeks Prestasi Semester 1
11. IPS 6: Indeks Prestasi Semester 1
12. IPS 7: Indeks Prestasi Semester 1
13. IPS 8: Indeks Prestasi Semester 1
14. IPK: Indeks Prestasi Kumulatif
15. STATUS KELULUSAN: Terlambat atau Tepat
2. Data Understanding
311
Data set: datakelulusanmahasiswa.xls
3. Data Preparation
312
• Terdapat 379 data mahasiswa dengan 15 atribut
• Missing Value sebayak 10 data, dan tidak terdapat
data noise
3. Data Preparation
313
• Missing Value dipecahkan dengan menambahkan data
dengan nilai rata-rata
• Hasilnya adalah data bersih tanpa missing value
3. Data Preparation
314
• Modelkan dataset dengan Decision Tree
• Pola yang dihasilkan bisa berbentuk tree atau if-then
4. Modeling
315
Hasil pola dari data berupa berupa decision tree
(pohon keputusan)
4. Modeling
316
Hasil pola dari data berupa berupa peraturan if-then
5. Evaluation
317
• Atribut atau faktor yang paling berpengaruh adalah
Status Mahasiswa, IPS2, IPS5, IPS1
• Atribut atau faktor yang tidak berpengaruh adalah
Nama, Jenis Kelamin, Umur, IPS6, IPS7, IPS8
5. Evaluation
318
• Budi membuat program peningkatan disiplin dan pendampingan
ke mahasiswa di semester awal (1-2) dan semester 5, karena
faktor yang paling menentukan kelulusan mahasiswa ada di dua
semester itu
• Budi membuat peraturan melarang mahasiswa bekerja paruh
waktu di semester awal perkuliahan, karena beresiko tinggi di
kelulusan tepat waktu
• Budi membuat program kerja paruh waktu di dalam kampus,
sehingga banyak pekerjaan kampus yang bisa intens ditangani,
sambil mendidik mahasiswa supaya memiliki pengalaman kerja.
Dan yang paling penting mahasiswa tidak meninggalkan kuliah
karena pekerjaan
• Budi memasukkan pola dan model yang terbentuk ke dalam
sistem informasi akademik, secara berkala diupdate setiap
semester. Sistem informasi akademik dibuat cerdas, sehingga bisa
mengirimkan email analisis pola kelulusan secara otomatis ke
mahasiswa sesuai profilnya
6. Deployment
319
• Pahami dan lakukan eksperimen berdasarkan
seluruh studi kasus yang ada di buku Data
Mining for the Masses (Matthew North)
• Pahami bahwa metode CRISP-DM membantu
kita memahami penggunaan metode data
mining yang lebih sesuai dengan kebutuhan
organisasi
Latihan
320
• Analisis masalah dan kebutuhan yang ada di organisasi
lingkungan sekitar anda
• Kumpulkan dan review dataset yang tersedia, dan hubungkan
masalah dan kebutuhan tadi dengan data yang tersedia
(analisis dari 5 peran data mining)
• Bila memungkinkan pilih beberapa peran sekaligus untuk mengolah
data tersebut, misalnya: lakukan association (analisis faktor), sekaligus
estimation atau clustering
• Lakukan proses CRISP-DM untuk menyelesaikan masalah yang
ada di organisasi sesuai dengan data yang didapatkan
• Pada proses data preparation, lakukan data cleaning (replace missing
value, replace, filter attribute) sehingga data siap dimodelkan
• Lakukan juga komparasi algoritma untuk memilih algoritma terbaik
• Rangkumkan dalam bentuk slide dengan contoh studi kasus
Sarah yang menggunakan data mining untuk:
• Menganalisis faktor yang berhubungan (matrix correlation)
• Mengestimasi jumlah stok minyak (linear regression)
Tugas Menyelesaikan Masalah Organisasi
321
Studi Kasus CRISP-DM
Profiling Tersangka Koruptor
Dataset: Data LHKPN KPK
322
1. Prediksi Profil Tersangka Koruptor
(Klasifikasi, Decision Tree)
2. Forecasting Jumlah Wajib Lapor di suatu
Instansi atau suatu propinsi
(Forecasting, Neural Network)
3. Prediksi Rekomendasi Hasil Pemeriksaan
LHKPN
(Klasifikasi, Decision Tree)
Contoh Kasus Pengolahan Data LHKPN
323
Prediksi Profil Tersangka Koruptor
324
Pola Profil Tersangka Koruptor
325
Forecasting Jumlah Wajib Lapor
326
Forecasting Jumlah Wajib Lapor
327
Rekomendasi Hasil Pemeriksaan LHKPN
328
Pola Rekomendasi Hasil Pemeriksaan LHKPN
329
3. Persiapan Data
3.1 Data Cleaning
3.2 Data Reduction
3.3 Data Transformation and Data Discretization
3.4 Data Integration
330
CRISP-DM
331
Measures for data quality: A multidimensional view
• Accuracy: correct or wrong, accurate or not
• Completeness: not recorded, unavailable, …
• Consistency: some modified but some not, …
• Timeliness: timely update?
• Believability: how trustable the data are correct?
• Interpretability: how easily the data can be
understood?
Why Preprocess the Data?
332
1. Data cleaning
• Fill in missing values
• Smooth noisy data
• Identify or remove outliers
• Resolve inconsistencies
2. Data reduction
• Dimensionality reduction
• Numerosity reduction
• Data compression
3. Data transformation and data discretization
• Normalization
• Concept hierarchy generation
4. Data integration
• Integration of multiple databases or files
Major Tasks in Data Preprocessing
333
Data preparation is more than half of every data
mining process
• Maxim of data mining: most of the effort in a data
mining project is spent in data acquisition and
preparation, and informal estimates vary from 50 to
80 percent
• The purpose of data preparation is:
1. To put the data into a form in which the data mining
question can be asked
2. To make it easier for the analytical techniques (such as
data mining algorithms) to answer it
Data Preparation Law (Data Mining Law 3)
334
3.1 Data Cleaning
335
Data in the Real World Is Dirty: Lots of potentially
incorrect data, e.g., instrument faulty, human or computer
error, transmission error
• Incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
• e.g., Occupation=“ ” (missing data)
• Noisy: containing noise, errors, or outliers
• e.g., Salary=“−10” (an error)
• Inconsistent: containing discrepancies in codes or names
• e.g., Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• Discrepancy between duplicate records
• Intentional (e.g., disguised missing data)
• Jan. 1 as everyone’s birthday?
Data Cleaning
336
• Data is not always available
• E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
• Missing data may be due to
• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the
time of entry
• not register history or changes of the data
• Missing data may need to be inferred
Incomplete (Missing) Data
337
• Dataset: MissingDataSet.csv
Contoh Missing Data
338
• Jerry is the marketing manager for a small Internet design
and advertising firm
• Jerry’s boss asks him to develop a data set containing
information about Internet users
• The company will use this data to determine what kinds of
people are using the Internet and how the firm may be able
to market their services to this group of users
• To accomplish his assignment, Jerry creates an online survey
and places links to the survey on several popular Web sites
• Within two weeks, Jerry has collected enough data to begin
analysis, but he finds that his data needs to be
denormalized
• He also notes that some observations in the set are missing
values or they appear to contain invalid values
• Jerry realizes that some additional work on the data needs
to take place before analysis begins.
MissingDataSet.csv
339
Relational Data
340
View of Data (Denormalized Data)
341
• Dataset: MissingDataSet.csv
Contoh Missing Data
342
• Ignore the tuple:
• Usually done when class label is missing (when doing
classification)—not effective when the % of missing values
per attribute varies considerably
• Fill in the missing value manually:
• Tedious + infeasible?
• Fill in it automatically with
• A global constant: e.g., “unknown”, a new class?!
• The attribute mean
• The attribute mean for all samples belonging to the same
class: smarter
• The most probable value: inference-based such as
Bayesian formula or decision tree
How to Handle Missing Data?
343
• Lakukan eksperimen mengikuti buku
Matthew North, Data Mining for the Masses
2nd Edition, 2016, Chapter 3 Data
Preparation
1. Handling Missing Data, pp. 34-48 (replace)
2. Data Reduction, pp. 48-51 (delete/filter)
• Dataset: MissingDataSet.csv
• Analisis metode preprocessing apa saja yang
digunakan dan mengapa perlu dilakukan
pada dataset tersebut?
Latihan
344
Missing Value Detection
345
Missing Value Replace
346
Missing Value Filtering
347
• Noise: random error or variance in a measured
variable
• Incorrect attribute values may be due to
• Faulty data collection instruments
• Data entry problems
• Data transmission problems
• Technology limitation
• Inconsistency in naming convention
• Other data problems which require data cleaning
• Duplicate records
• Incomplete data
• Inconsistent data
Noisy Data
348
• Binning
• First sort data and partition into (equal-frequency) bins
• Then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
• Regression
• Smooth by fitting the data into regression functions
• Clustering
• Detect and remove outliers
• Combined computer and human inspection
• Detect suspicious values and check by human (e.g., deal
with possible outliers)
How to Handle Noisy Data?
349
• Data discrepancy detection
• Use metadata (e.g., domain, range, dependency, distribution)
• Check field overloading
• Check uniqueness rule, consecutive rule and null rule
• Use commercial tools
• Data scrubbing: use simple domain knowledge (e.g., postal code,
spell-check) to detect errors and make corrections
• Data auditing: by analyzing data to discover rules and relationship
to detect violators (e.g., correlation and clustering to find outliers)
• Data migration and integration
• Data migration tools: allow transformations to be specified
• ETL (Extraction/Transformation/Loading) tools: allow users to
specify transformations through a graphical user interface
• Integration of the two processes
• Iterative and interactive (e.g., Potter’s Wheels)
Data Cleaning as a Process
350
• Lakukan eksperimen mengikuti buku
Matthew North, Data Mining for the Masses
2nd Edition, 2016, Chapter 3 Data
Preparation, pp. 52-54 (Handling
Inconsistence Data)
• Dataset: MissingDataSet.csv
• Analisis metode preprocessing apa saja yang
digunakan dan mengapa perlu dilakukan
pada dataset tersebut!
Latihan
351
352
353
Setting Regex
Ujicoba Regex
• Impor data MissingDataValue-Noisy.csv
• Gunakan Regular Expression (operator Replace)
untuk mengganti semua noisy data pada atribut
nominal menjadi “N”
Latihan
354
1. Impor data MissingDataValue-Noisy-Multiple.csv
2. Gunakan operator Replace Missing Value untuk mengisi data kosong
3. Gunakan Regular Expression (operator Replace) untuk mengganti semua noisy
data pada atribut nominal menjadi “N”
4. Gunakan operator Map untuk mengganti semua isian Face, FB dan Fesbuk
menjadi Facebook
Latihan
355
356
1 2 3 4
357
1. Impor data MissingDataValue-Noisy-Multiple.csv
2. operator Replace Missing Value untuk mengisi data
kosong
3. operator Replace untuk mengganti semua noisy data
pada atribut nominal menjadi “N”
4. operator Map untuk mengganti semua isian Face, FB
dan Fesbuk menjadi Facebook
1
2 3 4
Studi Kasus CRISP-DM
Sport Skill – Discriminant Analysis
(Matthew North, Data Mining for the Masses 2nd Edition, 2016,
Chapter 7 Discriminant Analysis, pp. 123-143)
Dataset: SportSkill-Training.csv
Dataset: SportSkill-Scoring.csv
358
• Motivation:
• Gill runs a sports academy designed to help high school aged
athletes achieve their maximum athletic potential. He focuses
on four major sports: Football, Basketball, Baseball and
Hockey
• He has found that while many high school athletes enjoy
participating in a number of sports in high school, as they
begin to consider playing a sport at the college level, they
would prefer to specialize in one sport
• As he’s worked with athletes over the years, Gill has
developed an extensive data set, and he now is wondering if
he can use past performance from some of his previous
clients to predict prime sports for up-and-coming high school
athletes
• By evaluating each athlete’s performance across a battery of
test, Gill hopes we can help him figure out for which sport
each athlete has the highest aptitude
• Objective:
• Ultimately, he hopes he can make a recommendation to each
athlete as to the sport in which they should most likely
choose to specialize
1. Business Understanding
359
• Every athlete that has enrolled at Gill’s
academy over the past several years has
taken a battery test, which tested for a
number of athletic and personal traits
• Because the academy has been operating for
some time, Gill has the benefit of knowing
which of his former pupils have gone on to
specialize in a single sport, and which sport it
was for each of them
2. Data Understanding
360
• Working with Gill, we gather the results of the
batteries for all former clients who have gone on to
specialize
• Gill adds the sport each person specialized in, and we
have a data set comprised of 493 observations
containing the following attributes:
1. Age: ....
2. Strength: ....
3. Quickness: ....
4. Injury: ....
5. Vision: ....
6. Endurance: ....
7. Agility: ....
8. Decision Making: ....
9. Prime Sport: ....
2. Data Understanding
361
• Filter Examples: attribute value filter
• Decision_Making>=3
• Decision_Making<=100
• Deleted Records= 493-482=11
3. Data Preparation
362
1. Lakukan training pada data SportSkill-
Training.csv dengan menggunakan C4.5,
NB, K-NN dan LDA
2. Lakukan pengujian dengan menggunakan
10-fold X Validation
3. Uji beda dengan t-Test untuk mendapatkan
model terbaik
4. Simpan model terbaik dari komparasi di
atas dengan operator Write Model, dan
kemudian Apply Model pada dataset
SportSkill-Scoring.csv
Latihan
363
364
DT NB k-NN LDA
DT
NB
k-NN
LDA
365
366
3.2 Data Reduction
367
• Data Reduction
• Obtain a reduced representation of the data set that is much smaller in
volume but yet produces the same analytical results
• Why Data Reduction?
• A database/data warehouse may store terabytes of data
• Complex data analysis take a very long time to run on the complete dataset
• Data Reduction Methods
1. Dimensionality Reduction
1. Feature Extraction
2. Feature Selection
1. Filter Approach
2. Wrapper Approach
3. Embedded Approach
2. Numerosity Reduction (Data Reduction)
• Regression and Log-Linear Models
• Histograms, clustering, sampling
Data Reduction Methods
368
• Curse of dimensionality
• When dimensionality increases, data becomes increasingly sparse
• Density and distance between points, which is critical to
clustering, outlier analysis, becomes less meaningful
• The possible combinations of subspaces will grow exponentially
• Dimensionality reduction
• Avoid the curse of dimensionality
• Help eliminate irrelevant features and reduce noise
• Reduce time and space required in data mining
• Allow easier visualization
• Dimensionality Reduction Methods:
1. Feature Extraction: Wavelet transforms, Principal Component
Analysis (PCA)
2. Feature Selection: Filter, Wrapper, Embedded
1. Dimensionality Reduction
369
• Given N data vectors from n-dimensions, find k ≤ n
orthogonal vectors (principal components) that can be
best used to represent data
1. Normalize input data: Each attribute falls within the same range
2. Compute k orthonormal (unit) vectors, i.e., principal components
3. Each input data (vector) is a linear combination of the k principal
component vectors
4. The principal components are sorted in order of decreasing
“significance” or strength
5. Since the components are sorted, the size of the data can be
reduced by eliminating the weak components, i.e., those with low
variance
• Works for numeric data only
Principal Component Analysis (Steps)
370
• Lakukan eksperimen mengikuti buku Markus
Hofmann (Rapid Miner - Data Mining Use
Case) Chapter 4 (k-Nearest Neighbor
Classification II) pp. 45-51
• Dataset: glass.data
• Analisis metode preprocessing apa saja yang
digunakan dan mengapa perlu dilakukan
pada dataset tersebut!
• Bandingkan akurasi dari k-NN dan PCA+k-NN
Latihan
371
372
373
Data Awal Sebelum PCA
374
Data Setelah PCA
375
• Susun ulang proses yang mengkomparasi
model yang dihasilkan oleh k-NN dan PCA +
k-NN
• Gunakan 10 Fold X Validation
Latihan
376
377
• Review operator apa
saja yang bisa digunakan
untuk feature extraction
• Ganti PCA dengan
metode feature
extraction yang lain
• Lakukan komparasi dan
tentukan mana metode
feature extraction
terbaik untuk data
Glass.data, gunakan 10-
fold cross validation
Latihan
378
• Another way to reduce dimensionality of data
• Redundant attributes
• Duplicate much or all of the information contained
in one or more other attributes
• E.g., purchase price of a product and the amount of
sales tax paid
• Irrelevant attributes
• Contain no information that is useful for the data
mining task at hand
• E.g., students' ID is often irrelevant to the task of
predicting students' GPA
Feature/Attribute Selection
379
A number of proposed approaches for feature
selection can broadly be categorized into the
following three classifications: wrapper, filter, and
embedded (Liu & Tu, 2004)
1. In the filter approach, statistical analysis of the
feature set is required, without utilizing any learning
model (Dash & Liu, 1997)
2. In the wrapper approach, a predetermined learning
model is assumed, wherein features are selected that
justify the learning performance of the particular
learning model (Guyon & Elisseeff, 2003)
3. The embedded approach attempts to utilize the
complementary strengths of the wrapper and filter
approaches (Huang, Cai, & Xu, 2007)
Feature Selection Approach
380
Wrapper Approach vs Filter Approach
381
Wrapper Approach Filter Approach
1. Filter Approach:
• information gain
• chi square
• log likehood rasio
• etc
2. Wrapper Approach:
• forward selection
• backward elimination
• randomized hill climbing
• etc
3. Embedded Approach:
• decision tree
• weighted naïve bayes
• etc
Feature Selection Approach
382
• Lakukan eksperimen mengikuti
buku Markus Hofmann (Rapid
Miner - Data Mining Use Case)
Chapter 4 (k-Nearest Neighbor
Classification II)
• Ganti PCA dengan metode
feature selection (filter),
misalnya:
• Information Gain
• Chi Squared
• etc
• Cek di RapidMiner, operator apa
saja yang bisa digunakan untuk
mengurangi atau membobot
atribute dari dataset!
Latihan
383
384
385
• Lakukan eksperimen mengikuti buku Markus
Hofmann (Rapid Miner - Data Mining Use Case)
Chapter 4 (k-Nearest Neighbor Classification II)
• Ganti PCA dengan metode feature selection
(wrapper), misalnya:
• Backward Elimination
• Forward Selection
• etc
• Ganti metode validasi dengan 10-Fold X
Validation
• Bandingkan akurasi dari k-NN dan BE+k-NN or
FS+k-NN
Latihan
386
387
388
389
Feature Selection
(Wrapper)
Feature Selection
(Filter)
Feature
Extraction
390
Feature Selection
(Wrapper)
Feature Selection
(Filter)
Feature
Extraction
Hasil Komparasi Akurasi dan Signifikansi t-Test
391
k-NN k-NN+PCA k-NN+ICA k-NN+IG k-NN+IGR kNN + FS k-NN+BE
Accuracy
AUC
1. Lakukan training pada data mahasiswa
(datakelulusanmahasiswa.xls) dengan menggunakan 3
algoritma klasifikasi (DT, NB, k-NN)
2. Analisis dan komparasi, mana algoritma klasifikasi yang
menghasilkan model paling akurat (AK)
3. Lakukan feature selection dengan Information Gain (Filter),
Forward Selection, Backward Elimination (Wrapper) untuk
model yang paling akurat
4. Analisis dan komparasi, mana algoritma feature selection
yang menghasilkan model paling akurat
5. Lakukan pengujian dengan menggunakan 10-fold X
Validation
Latihan: Prediksi Kelulusan Mahasiswa
392
AK AK+IG AK+FS AK+BE
Accuracy 91.55 92.10 91.82
AUC 0.909 0.920 0.917
1. Lakukan training pada data mahasiswa
(datakelulusanmahasiswa.xls) dengan menggunakan 4
algoritma klasifikasi (DT
2. Lakukan feature selection dengan Forward Selection untuk
algoritma DT (DT+FS)
3. Lakukan feature selection dengan Backward Elimination
untuk algoritma DT (DT+BE)
4. Lakukan pengujian dengan menggunakan 10-fold X
Validation
5. Uji beda dengan t-Test untuk mendapatkan model terbaik
(DT vs DT+FS vs DT+BE)
Latihan: Prediksi Kelulusan Mahasiswa
393
DT DT+FS DT+BE
Accuracy 91.55 92.10 91.82
AUC 0.909 0.920 0.917
394
DT DT+FS DT+BE
Accuracy 91.55 92.10 91.82
AUC 0.909 0.920 0.917
no significant difference
1. Lakukan komparasi algoritma pada data pemilu
(datapemilukpu.xls), sehingga didapatkan algoritma terbaik
2. Ambil algoritma terbaik dari langkah 1, kemudian lakukan
feature selection dengan Forward Selection dan Backward
Elimination
3. Tentukan kombinasi algoritma dan feature selection apa
yang memiliki performa terbaik
4. Lakukan pengujian dengan menggunakan 10-fold X
Validation
5. Uji beda dengan t-Test untuk mendapatkan model terbaik
Latihan: Prediksi Elektabilitas Pemilu
395
A A + FS A + BE
Accuracy
AUC
DT NB K-NN
Accuracy
AUC
1. Lakukan training pada data mahasiswa
(datakelulusanmahasiswa.xls) dengan menggunakan
DT, NB, K-NN
2. Lakukan dimension reduction dengan Forward
Selection untuk ketiga algoritma di atas
3. Lakukan pengujian dengan menggunakan 10-fold X
Validation
4. Uji beda dengan t-Test untuk mendapatkan model
terbaik
Latihan: Prediksi Kelulusan Mahasiswa
396
DT NB K-NN DT+FS NB+FS K-NN+FS
Accuracy
AUC
There is No Free Lunch for the Data Miner (NFL-DM)
The right model for a given application can only be discovered by
experiment
• Axiom of machine learning: if we knew enough about a problem
space, we could choose or design an algorithm to find optimal
solutions in that problem space with maximal efficiency
• Arguments for the superiority of one algorithm over others in data
mining rest on the idea that data mining problem spaces have one
particular set of properties, or that these properties can be
discovered by analysis and built into the algorithm
• However, these views arise from the erroneous idea that, in data
mining, the data miner formulates the problem and the algorithm
finds the solution
• In fact, the data miner both formulates the problem and finds the
solution – the algorithm is merely a tool which the data miner uses
to assist with certain steps in this process
No Free Lunch Theory (Data Mining Law 4)
397
Reduce data volume by choosing alternative, smaller forms of
data representation
1. Parametric methods (e.g., regression)
• Assume the data fits some model, estimate model
parameters, store only the parameters, and discard
the data (except possible outliers)
• Ex.: Log-linear models—obtain value at a point in m-D
space as the product on appropriate marginal
subspaces
2. Non-parametric methods
• Do not assume models
• Major families: histograms, clustering, sampling, …
2. Numerosity Reduction
398
Numerosity Reduction
399
• Linear regression
• Data modeled to fit a straight line
• Often uses the least-square method to fit the
line
• Multiple regression
• Allows a response variable Y to be modeled as a
linear function of multidimensional feature
vector
• Log-linear model
• Approximates discrete multidimensional
probability distributions
Parametric Data Reduction: Regression and
Log-Linear Models
400
• Regression analysis: A collective name for
techniques for the modeling and analysis of
numerical data consisting of values of a
dependent variable (also called response
variable or measurement) and of one or more
independent variables (aka. explanatory
variables or predictors)
• The parameters are estimated so as to give a
"best fit" of the data
• Most commonly the best fit is evaluated by
using the least squares method, but other
criteria have also been used
• Used for prediction (including forecasting of
time-series data), inference, hypothesis
testing, and modeling of causal relationships
Regression Analysis
401
x
y = x + 1
X1
Y1
Y1’
• Linear regression: Y = w X + b
• Two regression coefficients, w and b, specify the line and are to be
estimated by using the data at hand
• Using the least squares criterion to the known values of Y1, Y2, …, X1,
X2, ….
• Multiple regression: Y = b0 + b1 X1 + b2 X2
• Many nonlinear functions can be transformed into the above
• Log-linear models:
• Approximate discrete multidimensional probability distributions
• Estimate the probability of each point (tuple) in a multi-dimensional
space for a set of discretized attributes, based on a smaller subset of
dimensional combinations
• Useful for dimensionality reduction and data smoothing
Regress Analysis and Log-Linear Models
402
• Divide data into buckets and
store average (sum) for each
bucket
• Partitioning rules:
• Equal-width: equal bucket
range
• Equal-frequency (or equal-
depth)
Histogram Analysis
403
0
5
10
15
20
25
30
35
40
10000 30000 50000 70000 90000
• Partition data set into clusters based on
similarity, and store cluster representation (e.g.,
centroid and diameter) only
• Can be very effective if data is clustered but not
if data is “smeared”
• Can have hierarchical clustering and be stored in
multi-dimensional index tree structures
• There are many choices of clustering definitions
and clustering algorithms
Clustering
404
• Sampling: obtaining a small sample s to represent
the whole data set N
• Allow a mining algorithm to run in complexity that
is potentially sub-linear to the size of the data
• Key principle: Choose a representative subset of
the data
• Simple random sampling may have very poor performance in the
presence of skew
• Develop adaptive sampling methods, e.g., stratified sampling
• Note: Sampling may not reduce database I/Os
(page at a time)
Sampling
405
• Simple random sampling
• There is an equal probability of selecting any particular
item
• Sampling without replacement
• Once an object is selected, it is removed from the
population
• Sampling with replacement
• A selected object is not removed from the population
• Stratified sampling
• Partition the data set, and draw samples from each
partition (proportionally, i.e., approximately the same
percentage of the data)
• Used in conjunction with skewed data
Types of Sampling
406
Sampling: With or without Replacement
407
Raw Data
Sampling: Cluster or Stratified Sampling
408
Raw Data Cluster/Stratified Sample
• Stratification is the process of dividing members of the
population into homogeneous subgroups before sampling
• Suppose that in a company there are the following staff:
• Male, full-time: 90
• Male, part-time: 18
• Female, full-time: 9
• Female, part-time: 63
• Total: 180
• We are asked to take a sample of 40 staff, stratified
according to the above categories
• An easy way to calculate the percentage is to multiply each
group size by the sample size and divide by the total
population:
• Male, full-time = 90 × (40 ÷ 180) = 20
• Male, part-time = 18 × (40 ÷ 180) = 4
• Female, full-time = 9 × (40 ÷ 180) = 2
• Female, part-time = 63 × (40 ÷ 180) = 14
Stratified Sampling
409
• Lakukan eksperimen mengikuti buku
Matthew North, Data Mining for the Masses
2nd Edition, 2016, Chapter 7 Discriminant
Analysis, pp. 125-143
• Datasets:
• SportSkill-Training.csv
• SportSkill-Scoring.csv
• Analisis metode preprocessing apa saja yang
digunakan dan mengapa perlu dilakukan
pada dataset tersebut!
Latihan
410
3.3 Data Transformation and Data
Discretization
411
• A function that maps the entire set of values of a given
attribute to a new set of replacement values
• Each old value can be identified with one of the new values
• Methods:
• Smoothing: Remove noise from data
• Attribute/feature construction
• New attributes constructed from the given ones
• Aggregation: Summarization, data cube construction
• Normalization: Scaled to fall within a smaller, specified range
• min-max normalization
• z-score normalization
• normalization by decimal scaling
• Discretization: Concept hierarchy climbing
Data Transformation
412
• Min-max normalization: to [new_minA, new_maxA]
• Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0].
Then $73,000 is mapped to
• Z-score normalization (μ: mean, σ: standard deviation):
• Ex. Let μ = 54,000, σ = 16,000. Then
• Normalization by decimal scaling
Normalization
413
716
.
0
0
)
0
0
.
1
(
000
,
12
000
,
98
000
,
12
600
,
73





A
A
A
A
A
A
min
new
min
new
max
new
min
max
min
v
v _
)
_
_
(
' 




A
A
v
v




'
j
v
v
10
' Where j is the smallest integer such that Max(|ν’|) < 1
225
.
1
000
,
16
000
,
54
600
,
73


• Three types of attributes
• Nominal —values from an unordered set, e.g., color,
profession
• Ordinal —values from an ordered set, e.g., military or
academic rank
• Numeric —real numbers, e.g., integer or real numbers
• Discretization: Divide the range of a continuous
attribute into intervals
• Interval labels can then be used to replace actual data values
• Reduce data size by discretization
• Supervised vs. unsupervised
• Split (top-down) vs. merge (bottom-up)
• Discretization can be performed recursively on an attribute
• Prepare for further analysis, e.g., classification
Discretization
414
Typical methods: All the methods can be
applied recursively
• Binning: Top-down split, unsupervised
• Histogram analysis: Top-down split, unsupervised
• Clustering analysis: Unsupervised, top-down split
or bottom-up merge
• Decision-tree analysis: Supervised, top-down
split
• Correlation (e.g., 2) analysis: Unsupervised,
bottom-up merge
Data Discretization Methods
415
• Equal-width (distance) partitioning
• Divides the range into N intervals of equal size: uniform
grid
• if A and B are the lowest and highest values of the
attribute, the width of intervals will be: W = (B –A)/N.
• The most straightforward, but outliers may dominate
presentation
• Skewed data is not handled well
• Equal-depth (frequency) partitioning
• Divides the range into N intervals, each containing
approximately same number of samples
• Good data scaling
• Managing categorical attributes can be tricky
Simple Discretization: Binning
416
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24,
25, 26, 28, 29, 34
• Partition into equal-frequency (equi-depth) bins:
• Bin 1: 4, 8, 9, 15
• Bin 2: 21, 21, 24, 25
• Bin 3: 26, 28, 29, 34
• Smoothing by bin means:
• Bin 1: 9, 9, 9, 9
• Bin 2: 23, 23, 23, 23
• Bin 3: 29, 29, 29, 29
• Smoothing by bin boundaries:
• Bin 1: 4, 4, 4, 15
• Bin 2: 21, 21, 25, 25
• Bin 3: 26, 26, 26, 34
Binning Methods for Data Smoothing
417
Discretization Without Using Class Labels
(Binning vs. Clustering)
418
Data Equal interval width (binning)
Equal frequency (binning) K-means clustering leads to better results
• Classification (e.g., decision tree analysis)
• Supervised: Given class labels, e.g., cancerous vs. benign
• Using entropy to determine split point (discretization point)
• Top-down, recursive split
• Correlation analysis (e.g., Chi-merge: χ2-based
discretization)
• Supervised: use class information
• Bottom-up merge: find the best neighboring intervals (those
having similar distributions of classes, i.e., low χ2 values) to
merge
• Merge performed recursively, until a predefined stopping
condition
Discretization by Classification & Correlation
Analysis
419
• Lakukan eksperimen mengikuti buku Markus
Hofmann (Rapid Miner - Data Mining Use Case)
Chapter 5 (Naïve Bayes Classification I)
• Dataset: crx.data
• Analisis metode preprocessing apa saja yang
digunakan dan mengapa perlu dilakukan pada
dataset tersebut!
• Bandingkan akurasi model apabila tidak
menggunakan filter dan diskretisasi
• Bandingkan pula apabila digunakan feature
selection (wrapper) dengan Backward
Elimination
Latihan
420
421
Hasil
422
NB NB+
Filter
NB+
Discretization
NB+
Filter+
Discretization
NB+
Filter+
Discretization +
Backward Elimination
Accuracy 85.79 86.26
AUC
3.4 Data Integration
423
• Data integration:
• Combines data from multiple sources into a coherent store
• Schema Integration: e.g., A.cust-id  B.cust-#
• Integrate metadata from different sources
• Entity Identification Problem:
• Identify real world entities from multiple data sources,
e.g., Bill Clinton = William Clinton
• Detecting and Resolving Data Value Conflicts
• For the same real world entity, attribute values from
different sources are different
• Possible reasons: different representations, different
scales, e.g., metric vs. British units
Data Integration
424
• Redundant data occur often when integration of
multiple databases
• Object identification: The same attribute or object may
have different names in different databases
• Derivable data: One attribute may be a “derived”
attribute in another table, e.g., annual revenue
• Redundant attributes may be able to be detected
by correlation analysis and covariance analysis
• Careful integration of the data from multiple
sources may help reduce/avoid redundancies and
inconsistencies and improve mining speed and
quality
Handling Redundancy in Data Integration
425
• Χ2 (chi-square) test
• The larger the Χ2 value, the more likely the variables are
related
• The cells that contribute the most to the Χ2 value are
those whose actual count is very different from the
expected count
• Correlation does not imply causality
• # of hospitals and # of car-theft in a city are correlated
• Both are causally linked to the third variable: population
Correlation Analysis (Nominal Data)
426



Expected
Expected
Observed 2
2 )
(

• Χ2 (chi-square) calculation (numbers in parenthesis
are expected counts calculated based on the data
distribution in the two categories)
• It shows that like_science_fiction and play_chess
are correlated in the group
Chi-Square Calculation: An Example
427
Play chess Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450
Not like science fiction 50(210) 1000(840) 1050
Sum(col.) 300 1200 1500
93
.
507
840
)
840
1000
(
360
)
360
200
(
210
)
210
50
(
90
)
90
250
( 2
2
2
2
2










• Correlation coefficient (also called Pearson’s product
moment coefficient)
where n is the number of tuples, and are the respective
means of A and B, σA and σB are the respective standard deviation of
A and B, and Σ(aibi) is the sum of the AB cross-product
• If rA,B > 0, A and B are positively correlated (A’s values
increase as B’s). The higher, the stronger correlation
• rA,B = 0: independent; rAB < 0: negatively correlated
Correlation Analysis (Numeric Data)
428
B
A
n
i i
i
B
A
n
i i
i
B
A
n
B
A
n
b
a
n
B
b
A
a
r



 )
1
(
)
(
)
1
(
)
)(
( 1
1
,








 

A B
Visually Evaluating Correlation
429
Scatter plots
showing the
similarity
from –1 to 1
• Correlation measures the linear relationship
between objects
• To compute correlation, we standardize data
objects, A and B, and then take their dot product
Correlation
430
)
(
/
))
(
(
' A
std
A
mean
a
a k
k 

)
(
/
))
(
(
' B
std
B
mean
b
b k
k 

'
'
)
,
( B
A
B
A
n
correlatio 

• Covariance is similar to correlation
where n is the number of tuples, and are the respective mean or
expected values of A and B, σA and σB are the respective standard
deviation of A and B
• Positive covariance: If CovA,B > 0, then A and B both tend to be larger than
their expected values
• Negative covariance: If CovA,B < 0 then if A is larger than its expected value, B is
likely to be smaller than its expected value
• Independence: CovA,B = 0 but the converse is not true:
• Some pairs of random variables may have a covariance of 0 but are not
independent. Only under some additional assumptions (e.g., the data follow
multivariate normal distributions) does a covariance of 0 imply independence
Covariance (Numeric Data)
431
A B
Correlation coefficient:
• It can be simplified in computation as
• Suppose two stocks A and B have the following values in one week: (2, 5), (3, 8), (5,
10), (4, 11), (6, 14).
• Question: If the stocks are affected by the same industry trends, will their prices rise
or fall together?
• E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
• E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
• Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
• Thus, A and B rise together since Cov(A, B) > 0
Covariance: An Example
432
1. Data quality: accuracy, completeness,
consistency, timeliness, believability,
interpretability
2. Data cleaning: e.g. missing/noisy values, outliers
3. Data reduction
• Dimensionality reduction
• Numerosity reduction
4. Data transformation and data discretization
• Normalization
5. Data integration from multiple sources:
• Entity identification problem
• Remove redundancies
• Detect inconsistencies
Rangkuman
433
• Buat tulisan ilmiah dari slide (ppt) yang sudah dibuat, dengan
menggunakan template di http://journal.ilmukomputer.org
• Struktur Paper mengikuti format di bawah:
1. Pendahuluan
• Latar belakang masalah dan tujuan
2. Penelitian Yang Berhubungan
• Penelitian lain yang melakukan hal yang mirip dengan yang kita lakukan
3. Metode Penelitian
• Cara kita menganalisis data, jelaskan bahwa kita menggunakan CRISP-DM
4. Hasil dan Pembahasan
• 4.1 Business Understanding
• 4.2 Data Understanding
• 4.3 Data Preparation
• 4.4 Modeling
• 4.5 Evaluation
• 4.6 Deployment
5. Kesimpulan
• Kesimpulan harus sesuai dengan tujuan
6. Daftar Referensi
• Masukan daftar referensi yang digunakan
Tugas Membuat Tulisan Ilmiah
434
• Analisis masalah dan kebutuhan yang ada di organisasi
lingkungan sekitar anda
• Kumpulkan dan review dataset yang tersedia, dan hubungkan
masalah dan kebutuhan tadi dengan data yang tersedia
(analisis dari 5 peran data mining)
• Bila memungkinkan pilih beberapa peran sekaligus untuk mengolah
data tersebut, misalnya: lakukan association (analisis faktor), sekaligus
estimation atau clustering
• Lakukan proses CRISP-DM untuk menyelesaikan masalah yang
ada di organisasi sesuai dengan data yang didapatkan
• Pada proses data preparation, lakukan data cleaning (replace missing
value, replace, filter attribute) sehingga data siap dimodelkan
• Lakukan juga komparasi algoritma dan feature selection untuk memilih
pola dan model terbaik
• Rangkumkan evaluasi dari pola/model/knowledge yang dihasilkan dan
relasikan hasil evaluasi dengan deployment yang dilakukan
• Rangkumkan dalam bentuk slide dengan contoh studi kasus
Sarah untuk membantu bidang marketing
Tugas Menyelesaikan Masalah Organisasi
435
4. Algoritma Data Mining
4.1 Algoritma Klasifikasi
4.2 Algoritma Klastering
4.3 Algoritma Asosiasi
4.4 Algoritma Estimasi dan Forecasting
436
4.1 Algoritma Klasifikasi
437
4.1.1 Decision Tree
438
• Basic algorithm (a greedy algorithm)
1. Tree is constructed in a top-down recursive divide-and-
conquer manner
2. At start, all the training examples are at the root
3. Attributes are categorical (if continuous-valued, they are
discretized in advance)
4. Examples are partitioned recursively based on selected
attributes
5. Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain, gain ratio, gini
index)
• Conditions for stopping partitioning
• All samples for a given node belong to the same class
• There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
• There are no samples left
Algorithm for Decision Tree Induction
439
Brief Review of Entropy
440
m = 2
• Select the attribute with the highest information gain
• Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by | Ci, D|/|D|
• Expected information (entropy) needed to classify a tuple in D:
• Information needed (after using A to split D into v partitions) to
classify D:
• Information gained by branching on attribute A
Attribute Selection Measure:
Information Gain (ID3)
441
)
(
log
)
( 2
1
i
m
i
i p
p
D
Info 



)
(
|
|
|
|
)
(
1
j
v
j
j
A D
Info
D
D
D
Info 
 

(D)
Info
Info(D)
Gain(A) A


Attribute Selection: Information Gain
442
• Class P: buys_computer = “yes”
• Class N: buys_computer = “no”
means “age <=30” has 5 out of
14 samples, with 2 yes’es and 3
no’s. Hence
Similarly,
age pi ni I(pi, ni)
<=30 2 3 0.971
31…40 4 0 0
>40 3 2 0.971
694
.
0
)
2
,
3
(
14
5
)
0
,
4
(
14
4
)
3
,
2
(
14
5
)
(




I
I
I
D
Infoage
048
.
0
)
_
(
151
.
0
)
(
029
.
0
)
(



rating
credit
Gain
student
Gain
income
Gain
246
.
0
)
(
)
(
)
( 

 D
Info
D
Info
age
Gain age
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
)
3
,
2
(
14
5
I
940
.
0
)
14
5
(
log
14
5
)
14
9
(
log
14
9
)
5
,
9
(
)
( 2
2 



 I
D
Info
• Let attribute A be a continuous-valued attribute
• Must determine the best split point for A
• Sort the value A in increasing order
• Typically, the midpoint between each pair of adjacent
values is considered as a possible split point
• (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
• The point with the minimum expected information
requirement for A is selected as the split-point for A
• Split:
• D1 is the set of tuples in D satisfying A ≤ split-point, and
D2 is the set of tuples in D satisfying A > split-point
Computing Information-Gain for Continuous-
Valued Attributes
443
1. Siapkan data training
2. Pilih atribut sebagai akar
3. Buat cabang untuk tiap-tiap nilai
4. Ulangi proses untuk setiap cabang sampai
semua kasus pada cabang memiliki kelas yg
sama
Tahapan Algoritma Decision Tree (ID3)
444




n
i
pi
pi
S
Entropy
1
2
log
*
)
(
)
(
*
|
|
|
|
)
(
)
,
(
1
i
n
i
i
S
Entropy
S
S
S
Entropy
A
S
Gain 



1. Siapkan data training
445
• Untuk memilih atribut akar, didasarkan pada nilai Gain
tertinggi dari atribut-atribut yang ada. Untuk mendapatkan
nilai Gain, harus ditentukan terlebih dahulu nilai Entropy
• Rumus Entropy:
• S = Himpunan Kasus
• n = Jumlah Partisi S
• pi = Proporsi dari Si terhadap S
• Rumus Gain:
• S = Himpunan Kasus
• A = Atribut
• n = Jumlah Partisi Atribut A
• | Si | = Jumlah Kasus pada partisi ke-i
• | S | = Jumlah Kasus dalam S
2. Pilih atribut sebagai akar
446




n
i
pi
pi
S
Entropy
1
2
log
*
)
(
)
(
*
|
|
|
|
)
(
)
,
(
1
i
n
i
i
S
Entropy
S
S
S
Entropy
A
S
Gain 



Perhitungan Entropy dan Gain Akar
447
• Entropy Total
• Entropy (Outlook)
• Entropy (Temperature)
• Entropy (Humidity)
• Entropy (Windy)
Penghitungan Entropy Akar
448
Penghitungan Entropy Akar
449
NODE ATRIBUT
JML KASUS
(S)
YA (Si)
TIDAK
(Si)
ENTROPY GAIN
1 TOTAL 14 10 4 0,86312
OUTLOOK
CLOUDY 4 4 0 0
RAINY 5 4 1 0,72193
SUNNY 5 2 3 0,97095
TEMPERATURE
COOL 4 0 4 0
HOT 4 2 2 1
MILD 6 2 4 0,91830
HUMADITY
HIGH 7 4 3 0,98523
NORMAL 7 7 0 0
WINDY
FALSE 8 2 6 0,81128
TRUE 6 4 2 0,91830
Penghitungan Gain Akar
450
Penghitungan Gain Akar
451
NODE ATRIBUT
JML KASUS
(S)
YA (Si)
TIDAK
(Si)
ENTROPY GAIN
1 TOTAL 14 10 4 0,86312
OUTLOOK 0,25852
CLOUDY 4 4 0 0
RAINY 5 4 1 0,72193
SUNNY 5 2 3 0,97095
TEMPERATURE 0,18385
COOL 4 0 4 0
HOT 4 2 2 1
MILD 6 2 4 0,91830
HUMIDITY 0,37051
HIGH 7 4 3 0,98523
NORMAL 7 7 0 0
WINDY 0,00598
FALSE 8 2 6 0,81128
TRUE 6 4 2 0,91830
• Dari hasil pada Node 1, dapat diketahui
bahwa atribut dengan Gain tertinggi
adalah HUMIDITY yaitu sebesar 0.37051
• Dengan demikian HUMIDITY dapat menjadi
node akar
• Ada 2 nilai atribut dari HUMIDITY yaitu
HIGH dan NORMAL. Dari kedua nilai
atribut tersebut, nilai atribut NORMAL
sudah mengklasifikasikan kasus menjadi
1 yaitu keputusan-nya Yes, sehingga tidak
perlu dilakukan perhitungan lebih lanjut
• Tetapi untuk nilai atribut HIGH masih perlu
dilakukan perhitungan lagi
Gain Tertinggi Sebagai Akar
452
1.
HUMIDITY
1.1
????? Yes
High Normal
• Untuk memudahkan, dataset di filter dengan
mengambil data yang memiliki kelembaban
HUMADITY=HIGH untuk membuat table Node 1.1
2. Buat cabang untuk tiap-tiap nilai
453
OUTLOOK TEMPERATURE HUMIDITY WINDY PLAY
Sunny Hot High FALSE No
Sunny Hot High TRUE No
Cloudy Hot High FALSE Yes
Rainy Mild High FALSE Yes
Sunny Mild High FALSE No
Cloudy Mild High TRUE Yes
Rainy Mild High TRUE No
Perhitungan Entropi Dan Gain Cabang
454
NODE ATRIBUT
JML KASUS
(S)
YA (Si)
TIDAK
(Si)
ENTROPY GAIN
1.1 HUMADITY 7 3 4 0,98523
OUTLOOK 0,69951
CLOUDY 2 2 0 0
RAINY 2 1 1 1
SUNNY 3 0 3 0
TEMPERATURE 0,02024
COOL 0 0 0 0
HOT 3 1 2 0,91830
MILD 4 2 2 1
WINDY 0,02024
FALSE 4 2 2 1
TRUE 3 1 2 0,91830
• Dari hasil pada Tabel Node 1.1, dapat
diketahui bahwa atribut dengan Gain
tertinggi adalah OUTLOOK yaitu
sebesar 0.69951
• Dengan demikian OUTLOOK dapat menjadi
node kedua
• Artibut CLOUDY = YES dan SUNNY= NO
sudah mengklasifikasikan kasus
menjadi 1 keputusan, sehingga tidak
perlu dilakukan perhitungan lebih lanjut
• Tetapi untuk nilai atribut RAINY masih perlu
dilakukan perhitungan lagi
Gain Tertinggi Sebagai Node 1.1
455
1.
HUMIDITY
1.1
OUTLOOK Yes
High Normal
Yes No
1.1.2
?????
Cloudy
Rainy
Sunny
3. Ulangi proses untuk setiap cabang sampai
semua kasus pada cabang memiliki kelas yg sama
456
OUTLOOK TEMPERATURE HUMIDITY WINDY PLAY
Rainy Mild High FALSE Yes
Rainy Mild High TRUE No
NODE ATRIBUT
JML KASUS
(S)
YA (Si) TIDAK (Si) ENTROPY GAIN
1.2
HUMADITY HIGH &
OUTLOOK RAINY
2 1 1 1
TEMPERATURE 0
COOL 0 0 0 0
HOT 0 0 0 0
MILD 2 1 1 1
WINDY 1
FALSE 1 1 0 0
TRUE 1 0 1 0
• Dari tabel, Gain Tertinggi
adalah WINDY dan
menjadi node cabang dari
atribut RAINY
• Karena semua kasus
sudah masuk dalam kelas
• Jadi, pohon keputusan
pada Gambar merupakan
pohon keputusan terakhir
yang terbentuk
Gain Tertinggi Sebagai Node 1.1.2
457
1.
HUMIDIT
Y
1.1
OUTLOOK Yes
High Normal
Yes No
1.1.2
WINDY
Cloudy
Rainy
Sunny
Yes No
False True
• Training data set:
Buys_computer
Decision Tree Induction: An Example
458
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
• Information gain measure is biased towards attributes with
a large number of values
• C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
• GainRatio(A) = Gain(A)/SplitInfo(A)
• Ex.
• gain_ratio(income) = 0.029/1.557 = 0.019
• The attribute with the maximum gain ratio is selected as the
splitting attribute
Gain Ratio for Attribute Selection (C4.5)
459
)
|
|
|
|
(
log
|
|
|
|
)
( 2
1 D
D
D
D
D
SplitInfo
j
v
j
j
A 

 

• If a data set D contains examples from n classes, gini index, gini(D) is
defined as
where pj is the relative frequency of class j in D
• If a data set D is split on A into two subsets D1 and D2, the gini index
gini(D) is defined as
• Reduction in Impurity:
• The attribute provides the smallest ginisplit(D) (or the largest reduction
in impurity) is chosen to split the node (need to enumerate all the
possible splitting points for each attribute)
Gini Index (CART)
460




n
j
p j
D
gini
1
2
1
)
(
)
(
|
|
|
|
)
(
|
|
|
|
)
( 2
2
1
1
D
gini
D
D
D
gini
D
D
D
giniA


)
(
)
(
)
( D
gini
D
gini
A
gini A



• Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”
• Suppose the attribute income partitions D into 10 in D1: {low, medium}
and 4 in D2
Gini{low,high} is 0.458; Gini{medium,high} is 0.450. Thus, split on the
{low,medium} (and {high}) since it has the lowest Gini index
• All attributes are assumed continuous-valued
• May need other tools, e.g., clustering, to get the possible split values
• Can be modified for categorical attributes
Computation of Gini Index
461
459
.
0
14
5
14
9
1
)
(
2
2
















D
gini
)
(
14
4
)
(
14
10
)
( 2
1
}
,
{ D
Gini
D
Gini
D
gini medium
low
income 














The three measures, in general, return good results but
• Information gain:
• biased towards multivalued attributes
• Gain ratio:
• tends to prefer unbalanced splits in which one partition is much
smaller than the others
• Gini index:
• biased to multivalued attributes
• has difficulty when # of classes is large
• tends to favor tests that result in equal-sized partitions and purity
in both partitions
Comparing Attribute Selection Measures
462
• CHAID: a popular decision tree algorithm, measure based on χ2 test for
independence
• C-SEP: performs better than info. gain and gini index in certain cases
• G-statistic: has a close approximation to χ2 distribution
• MDL (Minimal Description Length) principle (i.e., the simplest solution is
preferred):
• The best tree as the one that requires the fewest # of bits to both (1)
encode the tree, and (2) encode the exceptions to the tree
• Multivariate splits (partition based on multiple variable combinations)
• CART: finds multivariate splits based on a linear comb. of attrs.
• Which attribute selection measure is the best?
• Most give good results, none is significantly superior than others
Other Attribute Selection Measures
463
• Overfitting: An induced tree may overfit the training data
• Too many branches, some may reflect anomalies due to
noise or outliers
• Poor accuracy for unseen samples
• Two approaches to avoid overfitting
1. Prepruning: Halt tree construction early ̵ do not split a
node if this would result in the goodness measure falling
below a threshold
• Difficult to choose an appropriate threshold
2. Postpruning: Remove branches from a “fully grown” tree
-get a sequence of progressively pruned trees
• Use a set of data different from the training data to decide which
is the “best pruned tree”
Overfitting and Tree Pruning
464
465
Pruning
• Relatively faster learning speed (than other
classification methods)
• Convertible to simple and easy to understand
classification rules
• Can use SQL queries for accessing databases
• Comparable classification accuracy with
other methods
Why is decision tree induction popular?
466
• Lakukan eksperimen mengikuti buku Matthew
North, Data Mining for the Masses 2nd Edition,
2016, Chapter 10 (Decision Tree), p 195-217
• Datasets:
• eReaderAdoption-Training.csv
• eReaderAdoption-Scoring.csv
• Analisis peran metode prunning pada decision
tree dan hubungannya dengan nilai confidence
• Analisis jenis decision tree apa saja yang
digunakan dan mengapa perlu dilakukan pada
dataset tersebut
Latihan
467
Motivation:
• Richard works for a large online retailer
• His company is launching a tablet soon, and he want to maximize the
effectiveness of his marketing
• They have a large number of customers, many of whom have
purchased digital devices and other services previously
• Richard has noticed that certain types of people were anxious to get
new devices as soon as they became available, while other folks
seemed content to wait to buy their electronic gadgets later
• He’s wondering what makes some people motivated to buy
something as soon as it comes out, while others are less driven to
have the product right away
Objectives:
• To mine the customers’ consumer behaviors on the web site, in
order to figure out which customers will buy the new tablet early,
which ones will buy next, and which ones will buy later on
1 Business Understanding
468
• Lakukan training pada data eReader Adoption
(eReader-Training.csv) dengan menggunakan DT
dengan 3 alternative criterion (Gain Ratio,
Information Gain dan Gini Index)
• Ujicoba masing-masing split criterion baik
menggunakan prunning atau tidak
• Lakukan pengujian dengan menggunakan 10-fold X
Validation
• Dari model terbaik, tentukan faktor (atribut) apa saja
yang berpengaruh pada tingkat adopsi eReader
Latihan
469
DTGR DTIG DTGI DTGR+Pr DTIG+Pr DTGI+Pr
Accuracy 58.39 51.01 31.01
• Lakukan feature selection dengan Forward Selection
untuk ketiga algoritma di atas
• Lakukan pengujian dengan menggunakan 10-fold X
Validation
• Dari model terbaik, tentukan faktor (atribut) apa saja
yang berpengaruh pada tingkat adopsi eReader
Latihan
470
DTGR DTIG DTGI DTGR+FS DTIG+FS DTGI+FS
Accuracy 58.39 51.01 31.01 61.41 56.73 31.01
471
4.1.2 Bayesian Classification
472
• A statistical classifier: performs probabilistic prediction,
i.e., predicts class membership probabilities
• Foundation: Based on Bayes’ Theorem.
• Performance: A simple Bayesian classifier, naïve
Bayesian classifier, has comparable performance with
decision tree and selected neural network classifiers
• Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is
correct — prior knowledge can be combined with
observed data
• Standard: Even when Bayesian methods are
computationally intractable, they can provide a
standard of optimal decision making against which
other methods can be measured
Bayesian Classification: Why?
473
• Total probability Theorem:
• Bayes’ Theorem:
• Let X be a data sample (“evidence”): class label is unknown
• Let H be a hypothesis that X belongs to class C
• Classification is to determine P(H|X), (i.e., posteriori probability):
the probability that the hypothesis holds given the observed data
sample X
• P(H) (prior probability): the initial probability
• E.g., X will buy computer, regardless of age, income, …
• P(X): probability that sample data is observed
• P(X|H) (likelihood): the probability of observing the sample X, given
that the hypothesis holds
• E.g., Given that X will buy computer, the prob. that X is 31..40,
medium income
Bayes’ Theorem: Basics
474
)
(
)
1
|
(
)
(
i
A
P
M
i i
A
B
P
B
P 


)
(
/
)
(
)
|
(
)
(
)
(
)
|
(
)
|
( X
X
X
X
X P
H
P
H
P
P
H
P
H
P
H
P 


• Given training data X, posteriori probability of a hypothesis H,
P(H|X), follows the Bayes’ theorem
• Informally, this can be viewed as
posteriori = likelihood x prior/evidence
• Predicts X belongs to Ci iff the probability P(Ci|X) is the highest
among all the P(Ck|X) for all the k classes
• Practical difficulty: It requires initial knowledge of many
probabilities, involving significant computational cost
Prediction Based on Bayes’ Theorem
475
)
(
/
)
(
)
|
(
)
(
)
(
)
|
(
)
|
( X
X
X
X
X P
H
P
H
P
P
H
P
H
P
H
P 


• Let D be a training set of tuples and their associated class
labels, and each tuple is represented by an n-D attribute
vector X = (x1, x2, …, xn)
• Suppose there are m classes C1, C2, …, Cm.
• Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
• This can be derived from Bayes’ theorem
• Since P(X) is constant for all classes, only
needs to be maximized
Classification is to Derive the Maximum
Posteriori
476
)
(
)
(
)
|
(
)
|
(
X
X
X
P
i
C
P
i
C
P
i
C
P 
)
(
)
|
(
)
|
(
i
C
P
i
C
P
i
C
P X
X 
• A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between
attributes):
• This greatly reduces the computation cost: Only counts the
class distribution
• If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having
value xk for Ak divided by |Ci, D| (# of tuples of Ci in D)
• If Ak is continous-valued, P(xk|Ci) is usually computed based
on Gaussian distribution with a mean μ and standard
deviation σ
and P(xk|Ci) is
Naïve Bayes Classifier
477
)
|
(
...
)
|
(
)
|
(
1
)
|
(
)
|
(
2
1
Ci
x
P
Ci
x
P
Ci
x
P
n
k
Ci
x
P
Ci
P
n
k







X
2
2
2
)
(
2
1
)
,
,
( 








x
e
x
g
)
,
,
(
)
|
( i
i C
C
k
x
g
Ci
P 


X
Naïve Bayes Classifier: Training Dataset
478
Class:
C1:buys_computer = ‘yes’
C2:buys_computer = ‘no’
Data to be classified:
X = (age <=30,
income = medium,
student = yes,
credit_rating = fair)
X  buy computer?
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium No excellent yes
31…40 high Yes fair yes
>40 medium No excellent no
• P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
• Compute P(X|Ci) for each class
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
• X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
Naïve Bayes Classifier:
An Example
479
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
1. Baca Data Training
2. Hitung jumlah class
3. Hitung jumlah kasus yang sama dengan class
yang sama
4. Kalikan semua nilai hasil sesuai dengan data X
yang dicari class-nya
Tahapan Algoritma Naïve Bayes
480
1. Baca Data Training
481
• X  Data dengan class yang belum diketahui
• H  Hipotesis data X yang merupakan suatu class
yang lebih spesifik
• P (H|X)  Probabilitas hipotesis H berdasarkan kondisi X
(posteriori probability)
• P (H)  Probabilitas hipotesis H (prior probability)
• P (X|H)  Probabilitas X berdasarkan kondisi pada hipotesis H
• P (X)  Probabilitas X
Teorema Bayes
482
)
(
/
)
(
)
|
(
)
(
)
(
)
|
(
)
|
( X
X
X
X
X P
H
P
H
P
P
H
P
H
P
H
P 


• Terdapat 2 class dari data training tersebut, yaitu:
• C1 (Class 1) Play = yes  9 record
• C2 (Class 2) Play = no  5 record
• Total = 14 record
• Maka:
• P (C1) = 9/14 = 0.642857143
• P (C2) = 5/14 = 0.357142857
• Pertanyaan:
• Data X = (outlook=rainy, temperature=cool, humidity=high, windy=true)
• Main golf atau tidak?
2. Hitung jumlah class/label
483
• Untuk P(Ci) yaitu P(C1) dan P(C2) sudah diketahui
hasilnya di langkah sebelumnya.
• Selanjutnya Hitung P(X|Ci) untuk i = 1 dan 2
• P(outlook=“sunny”|play=“yes”)=2/9=0.222222222
• P(outlook=“sunny”|play=“no”)=3/5=0.6
• P(outlook=“overcast”|play=“yes”)=4/9=0.444444444
• P(outlook=“overcast”|play=“no”)=0/5=0
• P(outlook=“rainy”|play=“yes”)=3/9=0.333333333
• P(outlook=“rainy”|play=“no”)=2/5=0.4
3. Hitung jumlah kasus yang sama dengan
class yang sama
484
• Jika semua atribut dihitung, maka didapat hasil
akhirnya seperti berikut ini:
3. Hitung jumlah kasus yang sama dengan
class yang sama
485
Atribute Parameter No Yes
Outlook value=sunny 0.6 0.2222222222222222
Outlook value=cloudy 0.0 0.4444444444444444
Outlook value=rainy 0.4 0.3333333333333333
Temperature value=hot 0.4 0.2222222222222222
Temperature value=mild 0.4 0.4444444444444444
Temperature value=cool 0.2 0.3333333333333333
Humidity value=high 0.8 0.3333333333333333
Humidity value=normal 0.2 0.6666666666666666
Windy value=false 0.4 0.6666666666666666
Windy value=true 0.6 0.3333333333333333
• Pertanyaan:
• Data X = (outlook=rainy, temperature=cool, humidity=high,
windy=true)
• Main Golf atau tidak?
• Kalikan semua nilai hasil dari data X
• P(X|play=“yes”) = 0.333333333* 0.333333333*
0.333333333*0.333333333 = 0.012345679
• P(X|play=“no”) = 0.4*0.2*0.8*0.6=0.0384
• P(X|play=“yes”)*P(C1) = 0.012345679*0.642857143
= 0.007936508
• P(X|play=“no”)*P(C2) = 0.0384*0.357142857
= 0.013714286
• Nilai “no” lebih besar dari nilai “yes” maka class dari
data X tersebut adalah “No”
4. Kalikan semua nilai hasil sesuai dengan
data X yang dicari class-nya
486
• Naïve Bayesian prediction requires each conditional prob.
be non-zero. Otherwise, the predicted prob. will be zero
• Ex. Suppose a dataset with 1000 tuples, income=low (0),
income= medium (990), and income = high (10)
• Use Laplacian correction (or Laplacian estimator)
• Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
• The “corrected” prob. estimates are close to their
“uncorrected” counterparts
Avoiding the Zero-Probability Problem
487



n
k
Ci
xk
P
Ci
X
P
1
)
|
(
)
|
(
• Advantages
• Easy to implement
• Good results obtained in most of the cases
• Disadvantages
• Assumption: class conditional independence, therefore
loss of accuracy
• Practically, dependencies exist among variables, e.g.:
• Hospitals Patients Profile: age, family history, etc.
• Symptoms: fever, cough etc.,
• Disease: lung cancer, diabetes, etc.
• Dependencies among these cannot be modeled by
Naïve Bayes Classifier
• How to deal with these dependencies? Bayesian Belief
Networks
Naïve Bayes Classifier: Comments
488
4.1.3 Neural Network
489
• Neural Network adalah suatu model yang dibuat
untuk meniru fungsi belajar yang dimiliki otak
manusia atau jaringan dari sekelompok unit pemroses
kecil yang dimodelkan berdasarkan jaringan saraf
manusia
Neural Network
490
• Model Perceptron adalah model jaringan yang
terdiri dari beberapa unit masukan (ditambah
dengan sebuah bias), dan memiliki sebuah unit
keluaran
• Fungsi aktivasi bukan hanya merupakan fungsi
biner (0,1) melainkan bipolar (1,0,-1)
• Untuk suatu harga threshold ѳ yang ditentukan:
1 Jika net > ѳ
F (net) = 0 Jika – ѳ ≤ net ≤ ѳ
-1 Jika net < - ѳ
Neural Network
491
Macam fungsi aktivasi yang dipakai untuk
mengaktifkan net diberbagai jenis neural network:
1. Aktivasi linear, Rumus: y = sign(v) = v
2. Aktivasi step, Rumus:
3. Aktivasi sigmoid biner, Rumus:
4. Aktivasi sigmoid bipolar, Rumus:
Fungsi Aktivasi
492
1. Inisialisasi semua bobot dan bias (umumnya wi = b = 0)
2. Selama ada element vektor masukan yang respon unit keluarannya
tidak sama dengan target, lakukan:
2.1 Set aktivasi unit masukan xi = Si (i = 1,...,n)
2.2 Hitung respon unit keluaran: net = + b
1 Jika net > ѳ
F (net) = 0 Jika – ѳ ≤ net ≤ ѳ
-1 Jika net < - ѳ
2.3 Perbaiki bobot pola yang mengadung kesalahan menurut persamaan:
Wi (baru) = wi (lama) + ∆w (i = 1,...,n) dengan ∆w = α t xi
B (baru) = b(lama) + ∆ b dengan ∆b = α t
Dimana: α = Laju pembelajaran (Learning rate) yang ditentukan
ѳ = Threshold yang ditentukan
t = Target
2.4 Ulangi iterasi sampai perubahan bobot (∆wn = 0) tidak ada
Tahapan Algoritma Perceptron
493
• Diketahui sebuah dataset kelulusan berdasarkan
IPK untuk program S1:
• Jika ada mahasiswa IPK 2.85 dan masih semester 1,
maka masuk ke kedalam manakah status tersebut ?
Studi Kasus
494
Status IPK Semester
Lulus 2.9 1
Tidak Lulus 2.8 3
Tidak Lulus 2.3 5
Tidak lulus 2.7 6
• Inisialisasi Bobot dan bias awal: b = 0 dan bias = 1
1: Inisialisasi Bobot
495
t X1 X2
1 2,9 1
-1 2.8 3
-1 2.3 5
-1 2,7 6
• Treshold (batasan), θ = 0 , yang artinya :
1 Jika net > 0
F (net) = 0 Jika net = 0
-1 Jika net < 0
2.1: Set aktivasi unit masukan
496
• Hitung Response Keluaran iterasi 1
• Perbaiki bobot pola yang mengandung kesalahan
2.2 - 2.3 Hitung Respon dan Perbaiki
Bobot
497
MASUKAN TARGET y= PERUBAHAN BOBOT BOBOT BARU
X1 X2 1 t NET f(NET) ∆W1 ∆W2 ∆b W1 W2 b
INISIALISASI 0 0 0
2,9 1 1 1 0 0 2,9 1 1 2,9 7 1
2,8 3 1 -1 8,12 1 -2,8 -3 -1 0,1 4 0
2,3 5 1 -1 0,23 1 -2,3 -5 -1 -2,2 -1 -1
2,7 6 1 -1 -5,94 -1 0 0 0 -2,2 -1 -1
• Hitung Response Keluaran iterasi 2
• Perbaiki bobot pola yang mengandung kesalahan
2.4 Ulangi iterasi sampai perubahan bobot
(∆wn = 0) tidak ada (Iterasi 2)
498
MASUKAN TARGET y= PERUBAHAN BOBOT BOBOT BARU
X1 X2 1 t NET f(NET) ∆W1 ∆W2 ∆b W1 W2 b
INISIALISASI -2,2 -1 -1
2,9 1 1 1 -8,38 -1 2,9 1 1 0,7 0 0
2,8 3 1 -1 1,96 1 -2,8 -3 -1 -2,1 -3 -1
2,3 5 1 -1 -20,83 -1 0 0 0 -2,1 -3 -1
2,7 6 1 -1 -24,67 -1 0 0 0 -2,1 -3 -1
• Hitung Response Keluaran iterasi 3
• Perbaiki bobot pola yang mengandung kesalahan
• Untuk data IPK memiliki pola 0.8 x - 2 y = 0 dapat dihitung prediksinya
menggunakan bobot yang terakhir didapat:
V = X1*W1 + X2*W2 = 0,8 * 2,85 - 2*1 = 2,28 -2 = 0,28
Y = sign(V) = sign(0,28) = 1 (Lulus)
2.4 Ulangi iterasi sampai perubahan bobot
(∆wn = 0) tidak ada (Iterasi 3)
499
MASUKAN TARGET y= PERUBAHAN BOBOT BOBOT BARU
X1 X2 1 t NET f(NET) ∆W1 ∆W2 ∆b W1 W2 b
INISIALISASI -2,1 -3 -1
2,9 1 1 1 -10,09 -1 2,9 1 1 0,8 -2 0
2,8 3 1 -1 -3,76 -1 0 0 0 0,8 -2 0
2,3 5 1 -1 -8,16 -1 0 0 0 0,8 -2 0
2,7 6 1 -1 -9,84 -1 0 0 0 0,8 -2 0
• Lakukan eksperimen mengikuti buku
Matthew North, Data Mining for the Masses
2nd Edition, 2016, Chapter 11 (Neural
Network), p 219-228
• Dataset:
• TeamValue-Training.csv
• TeamValue-Scoring.csv
• Pahami model neural network yang
dihasilkan, perhatikan ketebalan garis
neuron yang menyambungkan antar node
Latihan
500
Motivation:
• Juan is a performance analyst for a major professional athletic team
• His team has been steadily improving over recent seasons, and heading
into the coming season management believes that by adding between
two and four excellent players, the team will have an outstanding shot
at achieving the league championship
• They have tasked Juan with identifying their best options from among a
list of 59 players that may be available to them
• All of these players have experience; some have played professionally
before and some have years of experience as amateurs
• None are to be ruled out without being assessed for their potential
ability to add star power and productivity to the existing team
• The executives Juan works for are anxious to get going on contacting the
most promising prospects, so Juan needs to quickly evaluate these
athletes’ past performance and make recommendations based on his
analysis
Objectives:
• To evaluate each of the 59 prospects’ past statistical performance in
order to help him formulate recommendations based on his analysis
1. Business Understanding
501
• Lakukan training dengan neural network untuk
dataset TeamValue-Training.csv
• Gunakan 10-fold cross validation
• Lakukan adjusment terhadap hidden layer dan
neuron size, misal: hidden layer saya buat 3,
neuron size masing-masing 5
• Apa yang terjadi, apakah ada peningkatan
akurasi?
Latihan
502
NN NN (HL
2, NS 3)
NN (HL
2, NS 5)
NN (HL
3, NS 3)
NN (HL
3, NS 5)
NN (HL
4, NS 3)
NN (HL
4, NS 5)
Accuracy
Hidden
Layer
Capabilities
0 Only capable of representing linear separable functions
or decisions
1 Can approximate any function that contains a continuous
mapping from one finite space to another
2 Can represent an arbitrary decision boundary to
arbitrary accuracy with rational activation functions and
can approximate any smooth mapping to any accuracy
Penentuan Hidden Layer
503
1. Trial and Error
2. Rule of Thumb:
• Between the size of the input layer and the size of the
output layer
• 2/3 the size of the input layer, plus the size of the output
layer
• Less than twice the size of the input layer
3. Search Algorithm:
• Greedy
• Genetic Algorithm
• Particle Swarm Optimization
• etc
Penentuan Neuron Size
504
Techniques to Improve Classification
Accuracy: Ensemble Methods
505
• Ensemble methods
• Use a combination of models to increase accuracy
• Combine a series of k learned models, M1, M2, …, Mk, with
the aim of creating an improved model M*
• Popular ensemble methods
• Bagging: averaging the prediction over a collection of
classifiers
• Boosting: weighted vote with a collection of classifiers
• Ensemble: combining a set of heterogeneous classifiers
Ensemble Methods: Increasing the Accuracy
506
• Analogy: Diagnosis based on multiple doctors’ majority vote
• Training
• Given a set D of d tuples, at each iteration i, a training set Di of d
tuples is sampled with replacement from D (i.e., bootstrap)
• A classifier model Mi is learned for each training set Di
• Classification: classify an unknown sample X
• Each classifier Mi returns its class prediction
• The bagged classifier M* counts the votes and assigns the class
with the most votes to X
• Prediction: can be applied to the prediction of continuous values by
taking the average value of each prediction for a given test tuple
• Accuracy
• Often significantly better than a single classifier derived from D
• For noise data: not considerably worse, more robust
• Proved improved accuracy in prediction
Bagging: Boostrap Aggregation
507
• Analogy: Consult several doctors, based on a combination of
weighted diagnoses—weight assigned based on the previous
diagnosis accuracy
• How boosting works?
1. Weights are assigned to each training tuple
2. A series of k classifiers is iteratively learned
3. After a classifier Mi is learned, the weights are updated to
allow the subsequent classifier, Mi+1, to pay more attention
to the training tuples that were misclassified by Mi
4. The final M* combines the votes of each individual
classifier, where the weight of each classifier's vote is a
function of its accuracy
• Boosting algorithm can be extended for numeric prediction
• Comparing with bagging: Boosting tends to have greater accuracy,
but it also risks overfitting the model to misclassified data
Boosting
508
1. Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd)
2. Initially, all the weights of tuples are set the same (1/d)
3. Generate k classifiers in k rounds. At round i,
1. Tuples from D are sampled (with replacement) to form a training
set Di of the same size
2. Each tuple’s chance of being selected is based on its weight
3. A classification model Mi is derived from Di
4. Its error rate is calculated using Di as a test set
5. If a tuple is misclassified, its weight is increased, o.w. it is
decreased
4. Error rate: err(Xj) is the misclassification error of tuple Xj. Classifier Mi
error rate is the sum of the weights of the misclassified tuples:
5. The weight of classifier Mi’s vote is
Adaboost (Freund and Schapire, 1997)
509
)
(
)
(
1
log
i
i
M
error
M
error

 

d
j
j
i err
w
M
error )
(
)
( j
X
• Random Forest:
• Each classifier in the ensemble is a decision tree classifier and is
generated using a random selection of attributes at each node to
determine the split
• During classification, each tree votes and the most popular class is
returned
• Two Methods to construct Random Forest:
1. Forest-RI (random input selection): Randomly select, at each
node, F attributes as candidates for the split at the node. The
CART methodology is used to grow the trees to maximum size
2. Forest-RC (random linear combinations): Creates new attributes
(or features) that are a linear combination of the existing
attributes (reduces the correlation between individual classifiers)
• Comparable in accuracy to Adaboost, but more robust to
errors and outliers
• Insensitive to the number of attributes selected for
consideration at each split, and faster than bagging or
boosting
Random Forest (Breiman 2001)
510
• Class-imbalance problem: Rare positive example but
numerous negative ones, e.g., medical diagnosis, fraud, oil-
spill, fault, etc.
• Traditional methods assume a balanced distribution of
classes and equal error costs: not suitable for class-
imbalanced data
• Typical methods for imbalance data in 2-class classification:
1. Oversampling: re-sampling of data from positive class
2. Under-sampling: randomly eliminate tuples from
negative class
3. Threshold-moving: moves the decision threshold, t, so
that the rare class tuples are easier to classify, and
hence, less chance of costly false negative errors
4. Ensemble techniques: Ensemble multiple classifiers
introduced above
• Still difficult for class imbalance problem on multiclass tasks
Classification of Class-Imbalanced Data Sets
511
4.2 Algoritma Klastering
4.2.1 Partitioning Methods
4.2.2 Hierarchical Methods
4.2.3 Density-Based Methods
4.2.4 Grid-Based Methods
512
• Cluster: A collection of data objects
• similar (or related) to one another within the same
group
• dissimilar (or unrelated) to the objects in other groups
• Cluster analysis (or clustering, data segmentation, …)
• Finding similarities between data according to the
characteristics found in the data and grouping similar
data objects into clusters
• Unsupervised learning: no predefined classes (i.e., learning
by observations vs. learning by examples: supervised)
• Typical applications
• As a stand-alone tool to get insight into data distribution
• As a preprocessing step for other algorithms
What is Cluster Analysis?
513
• Data reduction
• Summarization: Preprocessing for regression, PCA,
classification, and association analysis
• Compression: Image processing: vector quantization
• Hypothesis generation and testing
• Prediction based on groups
• Cluster & find characteristics/patterns for each group
• Finding K-nearest Neighbors
• Localizing search to one or a small number of clusters
• Outlier detection: Outliers are often viewed as those “far
away” from any cluster
Applications of Cluster Analysis
514
• Biology: taxonomy of living things: kingdom, phylum,
class, order, family, genus and species
• Information retrieval: document clustering
• Land use: Identification of areas of similar land use in an
earth observation database
• Marketing: Help marketers discover distinct groups in
their customer bases, and then use this knowledge to
develop targeted marketing programs
• City-planning: Identifying groups of houses according to
their house type, value, and geographical location
• Earth-quake studies: Observed earth quake epicenters
should be clustered along continent faults
• Climate: understanding earth climate, find patterns of
atmospheric and ocean
• Economic Science: market research
Clustering: Application Examples
515
• Feature selection
• Select info concerning the task of interest
• Minimal information redundancy
• Proximity measure
• Similarity of two feature vectors
• Clustering criterion
• Expressed via a cost function or some rules
• Clustering algorithms
• Choice of algorithms
• Validation of the results
• Validation test (also, clustering tendency test)
• Interpretation of the results
• Integration with applications
Basic Steps to Develop a Clustering Task
516
• A good clustering method will produce high quality
clusters
• high intra-class similarity: cohesive within clusters
• low inter-class similarity: distinctive between clusters
• The quality of a clustering method depends on
• the similarity measure used by the method
• its implementation, and
• Its ability to discover some or all of the hidden patterns
Quality: What Is Good Clustering?
517
• Dissimilarity/Similarity metric
• Similarity is expressed in terms of a distance function,
typically metric: d(i, j)
• The definitions of distance functions are usually rather
different for interval-scaled, boolean, categorical,
ordinal ratio, and vector variables
• Weights should be associated with different variables
based on applications and data semantics
• Quality of clustering:
• There is usually a separate “quality” function that
measures the “goodness” of a cluster.
• It is hard to define “similar enough” or “good enough”
• The answer is typically highly subjective
Measure the Quality of Clustering
518
• Partitioning criteria
• Single level vs. hierarchical partitioning (often, multi-level
hierarchical partitioning is desirable)
• Separation of clusters
• Exclusive (e.g., one customer belongs to only one region) vs. non-
exclusive (e.g., one document may belong to more than one class)
• Similarity measure
• Distance-based (e.g., Euclidian, road network, vector) vs.
connectivity-based (e.g., density or contiguity)
• Clustering space
• Full space (often when low dimensional) vs. subspaces (often in
high-dimensional clustering)
Considerations for Cluster Analysis
519
• Scalability
• Clustering all the data instead of only on samples
• Ability to deal with different types of attributes
• Numerical, binary, categorical, ordinal, linked, and mixture of
these
• Constraint-based clustering
• User may give inputs on constraints
• Use domain knowledge to determine input parameters
• Interpretability and usability
• Others
• Discovery of clusters with arbitrary shape
• Ability to deal with noisy data
• Incremental clustering and insensitivity to input order
• High dimensionality
Requirements and Challenges
520
• Partitioning approach:
• Construct various partitions and then evaluate them by some
criterion, e.g., minimizing the sum of square errors
• Typical methods: k-means, k-medoids, CLARANS
• Hierarchical approach:
• Create a hierarchical decomposition of the set of data (or
objects) using some criterion
• Typical methods: Diana, Agnes, BIRCH, CAMELEON
• Density-based approach:
• Based on connectivity and density functions
• Typical methods: DBSACN, OPTICS, DenClue
• Grid-based approach:
• based on a multiple-level granularity structure
• Typical methods: STING, WaveCluster, CLIQUE
Major Clustering Approaches 1
521
• Model-based:
• A model is hypothesized for each of the clusters and tries to
find the best fit of that model to each other
• Typical methods: EM, SOM, COBWEB
• Frequent pattern-based:
• Based on the analysis of frequent patterns
• Typical methods: p-Cluster
• User-guided or constraint-based:
• Clustering by considering user-specified or application-
specific constraints
• Typical methods: COD (obstacles), constrained clustering
• Link-based clustering:
• Objects are often linked together in various ways
• Massive links can be used to cluster objects: SimRank,
LinkClus
Major Clustering Approaches 2
522
4.2.1 Partitioning Methods
523
• Partitioning method: Partitioning a database D of n objects
into a set of k clusters, such that the sum of squared
distances is minimized (where ci is the centroid or medoid of
cluster Ci)
• Given k, find a partition of k clusters that optimizes the
chosen partitioning criterion
• Global optimal: exhaustively enumerate all partitions
• Heuristic methods: k-means and k-medoids algorithms
• k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented by
the center of the cluster
• k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects in the
cluster
Partitioning Algorithms: Basic Concept
524
2
1 ))
,
(
( i
C
p
k
i c
p
d
E i

 


• Given k, the k-means algorithm is implemented in four
steps:
1. Partition objects into k nonempty subsets
2. Compute seed points as the centroids of the clusters
of the current partitioning (the centroid is the center,
i.e., mean point, of the cluster)
3. Assign each object to the cluster with the nearest seed
point
4. Go back to Step 2, stop when the assignment does not
change
The K-Means Clustering Method
525
An Example of K-Means Clustering
526
K=2
Arbitrarily
partition
objects into
k groups
Update the
cluster
centroids
Update the
cluster
centroids
Reassign objects
Loop if
needed
The initial data set
 Partition objects into k
nonempty subsets
 Repeat
 Compute centroid (i.e., mean
point) for each partition
 Assign each object to the
cluster of its nearest centroid
 Until no change
1. Pilih jumlah klaster k yang diinginkan
2. Inisialisasi k pusat klaster (centroid) secara random
3. Tempatkan setiap data atau objek ke klaster terdekat. Kedekatan
dua objek ditentukan berdasar jarak. Jarak yang dipakai pada
algoritma k-Means adalah Euclidean distance (d)
• x = x1, x2, . . . , xn, dan y = y1, y2, . . . , yn merupakan banyaknya n
atribut(kolom) antara 2 record
4. Hitung kembali pusat klaster dengan keanggotaan klaster yang
sekarang. Pusat klaster adalah rata-rata (mean) dari semua data
atau objek dalam klaster tertentu
5. Tugaskan lagi setiap objek dengan memakai pusat klaster yang
baru. Jika pusat klaster sudah tidak berubah lagi, maka proses
pengklasteran selesai. Atau, kembali lagi ke langkah nomor 3
sampai pusat klaster tidak berubah lagi (stabil) atau tidak ada
penurunan yang signifikan dari nilai SSE (Sum of Squared Errors)
Tahapan Algoritma k-Means
527
   




n
i
i
i
Euclidean y
x
y
x
d
1
2
,
1. Tentukan jumlah klaster k=2
2. Tentukan centroid awal secara acak
misal dari data disamping m1 =(1,1),
m2=(2,1)
3. Tempatkan tiap objek ke klaster
terdekat berdasarkan nilai centroid
yang paling dekat selisihnya
(jaraknya). Didapatkan hasil, anggota
cluster1 = {A,E,G}, cluster2={B,C,D,F,H}
Nilai SSE yaitu:
Contoh Kasus – Iterasi 1
528
 
2
1
,
 
 

k
i C
p
i
i
m
p
d
SSE
4. Menghitung nilai centroid yang baru
5. Tugaskan lagi setiap objek dengan
memakai pusat klaster yang baru.
Nilai SSE yang baru:
Interasi 2
529
   
   
2
,
1
3
/
1
2
3
,
3
/
1
1
1
1 





m
   
   
4
,
2
;
6
,
3
5
/
1
2
3
3
3
,
5
/
2
4
5
4
3
2 









m
4. Terdapat perubahan anggota
cluster yaitu cluster1={A,E,G,H},
cluster2={B,C,D,F}, maka cari lagi
nilai centroid yang baru yaitu:
m1=(1,25;1,75) dan m2=(4;2,75)
5. Tugaskan lagi setiap objek
dengan memakai pusat klaster
yang baru
Nilai SSE yang baru:
Iterasi 3
530
• Dapat dilihat pada tabel.
Tidak ada perubahan anggota
lagi pada masing-masing
cluster
• Hasil akhir yaitu:
cluster1={A,E,G,H}, dan
cluster2={B,C,D,F}
Dengan nilai SSE = 6,25 dan
jumlah iterasi 3
Hasil Akhir
531
• Lakukan eksperimen mengikuti buku
Matthew North, Data Mining for the Masses,
2012, Chapter 6 k-Means Clustering, pp. 91-
103 (CoronaryHeartDisease.csv)
• Gambarkan grafik (chart) dan pilih Scatter 3D
Color untuk menggambarkan data hasil
klastering yang telah dilakukan
• Analisis apa yang telah dilakukan oleh Sonia,
dan apa manfaat k-Means clustering bagi
pekerjaannya?
Latihan
532
• Lakukan pengukuran performance dengan
menggunakan Cluster Distance Performance, untuk
mendapatkan nilai Davies Bouldin Index (DBI)
• Nilai DBI semakin rendah berarti cluster yang kita
bentuk semakin baik
Latihan
533
• Lakukan klastering terhadap data
IMFdata.csv
(http://romisatriawahono.net/lecture/dm/dataset)
Latihan
534
• Strength:
• Efficient: O(tkn), where n is # objects, k is # clusters, and t is #
iterations. Normally, k, t << n.
• Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
• Comment: Often terminates at a local optimal
• Weakness
• Applicable only to objects in a continuous n-dimensional space
• Using the k-modes method for categorical data
• In comparison, k-medoids can be applied to a wide range of data
• Need to specify k, the number of clusters, in advance (there are ways to
automatically determine the best k (see Hastie et al., 2009)
• Sensitive to noisy data and outliers
• Not suitable to discover clusters with non-convex shapes
Comments on the K-Means Method
535
• Most of the variants of the k-means which differ in
• Selection of the initial k means
• Dissimilarity calculations
• Strategies to calculate cluster means
• Handling categorical data: k-modes
• Replacing means of clusters with modes
• Using new dissimilarity measures to deal with
categorical objects
• Using a frequency-based method to update modes of
clusters
• A mixture of categorical and numerical data: k-prototype
method
Variations of the K-Means Method
536
• The k-means algorithm is sensitive to outliers!
• Since an object with an extremely large value may substantially
distort the distribution of the data
• K-Medoids:
• Instead of taking the mean value of the object in a cluster as a
reference point, medoids can be used, which is the most centrally
located object in a cluster
What Is the Problem of the K-Means Method?
537
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
538
PAM: A Typical K-Medoids Algorithm
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Total Cost = 20
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
K=2
Arbitrary
choose k
object as
initial
medoids
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Assign
each
remainin
g object
to
nearest
medoids
Randomly select a
nonmedoid object,Oramdom
Compute
total cost of
swapping
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Total Cost = 26
Swapping O
and Oramdom
If quality is
improved.
Do loop
Until no change
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
• K-Medoids Clustering: Find representative objects
(medoids) in clusters
• PAM (Partitioning Around Medoids, Kaufmann &
Rousseeuw 1987)
• Starts from an initial set of medoids and iteratively
replaces one of the medoids by one of the non-medoids
if it improves the total distance of the resulting
clustering
• PAM works effectively for small data sets, but does not
scale well for large data sets (due to the computational
complexity)
• Efficiency improvement on PAM
• CLARA (Kaufmann & Rousseeuw, 1990): PAM on samples
• CLARANS (Ng & Han, 1994): Randomized re-sampling
The K-Medoid Clustering Method
539
4.2.2 Hierarchical Methods
540
• Use distance matrix as clustering criteria
• This method does not require the number of clusters k as an
input, but needs a termination condition
Hierarchical Clustering
541
Step 0 Step 1 Step 2 Step 3 Step 4
b
d
c
e
a
a b
d e
c d e
a b c d e
Step 4 Step 3 Step 2 Step 1 Step 0
agglomerative
(AGNES)
divisive
(DIANA)
• Introduced in Kaufmann and Rousseeuw (1990)
• Implemented in statistical packages, e.g., Splus
• Use the single-link method and the dissimilarity matrix
• Merge nodes that have the least dissimilarity
• Go on in a non-descending fashion
• Eventually all nodes belong to the same cluster
AGNES (Agglomerative Nesting)
542
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Dendrogram: Shows How Clusters are Merged
543
Decompose data objects into a several levels of nested partitioning (tree of
clusters), called a dendrogram
A clustering of the data objects is obtained by cutting the dendrogram at the
desired level, then each connected component forms a cluster
• Introduced in Kaufmann and Rousseeuw (1990)
• Implemented in statistical analysis packages, e.g.,
Splus
• Inverse order of AGNES
• Eventually each node forms a cluster on its own
DIANA (Divisive Analysis)
544
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
• Single link: smallest distance between an element in one cluster and an
element in the other, i.e., dist(Ki, Kj) = min(tip, tjq)
• Complete link: largest distance between an element in one cluster and an
element in the other, i.e., dist(Ki, Kj) = max(tip, tjq)
• Average: avg distance between an element in one cluster and an element
in the other, i.e., dist(Ki, Kj) = avg(tip, tjq)
• Centroid: distance between the centroids of two clusters, i.e., dist(Ki, Kj) =
dist(Ci, Cj)
• Medoid: distance between the medoids of two clusters, i.e., dist(Ki, Kj) =
dist(Mi, Mj)
• Medoid: a chosen, centrally located object in the cluster
Distance between Clusters
545
X X
• Centroid: the “middle” of a
cluster
• Radius: square root of average
distance from any point of the
cluster to its centroid
• Diameter: square root of
average mean squared distance
between all pairs of points in the
cluster
Centroid, Radius and Diameter of a Cluster
(for numerical data sets)
546
N
t
N
i ip
m
C
)
(
1



N
m
c
ip
t
N
i
m
R
2
)
(
1




)
1
(
2
)
(
1
1







N
N
iq
t
ip
t
N
i
N
i
m
D
4.2.3 Density-Based Methods
547
• Clustering based on density (local cluster criterion), such as
density-connected points
• Major features:
• Discover clusters of arbitrary shape
• Handle noise
• One scan
• Need density parameters as termination condition
• Several interesting studies:
• DBSCAN: Ester, et al. (KDD’96)
• OPTICS: Ankerst, et al (SIGMOD’99).
• DENCLUE: Hinneburg & D. Keim (KDD’98)
• CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based)
Density-Based Clustering Methods
548
• Two parameters:
• Eps: Maximum radius of the neighbourhood
• MinPts: Minimum number of points in an Eps-
neighbourhood of that point
• NEps(q): {p belongs to D | dist(p,q) ≤ Eps}
• Directly density-reachable: A point p is directly density-
reachable from a point q w.r.t. Eps, MinPts if
• p belongs to NEps(q)
• core point condition:
|NEps (q)| ≥ MinPts
Density-Based Clustering: Basic Concepts
549
MinPts = 5
Eps = 1 cm
p
q
• Density-reachable:
• A point p is density-reachable from a
point q w.r.t. Eps, MinPts if there is a
chain of points p1, …, pn, p1 = q, pn =
p such that pi+1 is directly density-
reachable from pi
• Density-connected
• A point p is density-connected to a
point q w.r.t. Eps, MinPts if there is a
point o such that both, p and q are
density-reachable from o w.r.t. Eps
and MinPts
Density-Reachable and Density-Connected
550
p
q
p1
p q
o
• Relies on a density-based notion of cluster: A
cluster is defined as a maximal set of density-
connected points
• Discovers clusters of arbitrary shape in spatial
databases with noise
DBSCAN: Density-Based Spatial Clustering of
Applications with Noise
551
Core
Border
Outlier
Eps = 1cm
MinPts = 5
1. Arbitrary select a point p
2. Retrieve all points density-reachable from p w.r.t. Eps and
MinPts
3. If p is a core point, a cluster is formed
4. If p is a border point, no points are density-reachable
from p and DBSCAN visits the next point of the database
5. Continue the process until all of the points have been
processed
If a spatial index is used, the computational complexity of DBSCAN is
O(nlogn), where n is the number of database objects. Otherwise, the
complexity is O(n2)
DBSCAN: The Algorithm
552
DBSCAN: Sensitive to Parameters
553
http://webdocs.cs.ualberta.ca/~yaling/Cluster/Applet/Code/Cluster.html
• OPTICS: Ordering Points To Identify the Clustering Structure
• Ankerst, Breunig, Kriegel, and Sander (SIGMOD’99)
• Produces a special order of the database wrt its density-
based clustering structure
• This cluster-ordering contains info equiv to the density-
based clusterings corresponding to a broad range of
parameter settings
• Good for both automatic and interactive cluster analysis,
including finding intrinsic clustering structure
• Can be represented graphically or using visualization
techniques
OPTICS: A Cluster-Ordering Method (1999)
554
• Index-based: k = # of dimensions, N: # of points
• Complexity: O(N*logN)
• Core Distance of an object p: the smallest value ε such that
the ε-neighborhood of p has at least MinPts objects
Let Nε(p): ε-neighborhood of p, ε is a distance value
Core-distanceε, MinPts(p) = Undefined if card(Nε(p)) < MinPts
MinPts-distance(p), otherwise
• Reachability Distance of object p from core object q is the
min radius value that makes p density-reachable from q
Reachability-distanceε, MinPts(p, q) =
Undefined if q is not a core object
max(core-distance(q), distance (q, p)), otherwise
OPTICS: Some Extension from DBSCAN
555
Core Distance & Reachability Distance
556
557


Reachability-
distance
Cluster-order of the objects
undefined

‘
Density-Based Clustering: OPTICS & Applications:
http://www.dbs.informatik.uni-muenchen.de/Forschung/KDD/Clustering/OPTICS/Demo
558
4.2.4 Grid-Based Methods
559
• Using multi-resolution grid data structure
• Several interesting methods
• STING (a STatistical INformation Grid approach) by
Wang, Yang and Muntz (1997)
• WaveCluster by Sheikholeslami, Chatterjee, and Zhang
(VLDB’98)
• A multi-resolution clustering approach using
wavelet method
• CLIQUE: Agrawal, et al. (SIGMOD’98)
• Both grid-based and subspace clustering
Grid-Based Clustering Method
560
• Wang, Yang and Muntz (VLDB’97)
• The spatial area is divided into rectangular cells
• There are several levels of cells corresponding to
different levels of resolution
STING: A Statistical Information Grid
Approach
561
• Each cell at a high level is partitioned into a number of
smaller cells in the next lower level
• Statistical info of each cell is calculated and stored
beforehand and is used to answer queries
• Parameters of higher level cells can be easily calculated
from parameters of lower level cell
• count, mean, s, min, max
• type of distribution—normal, uniform, etc.
• Use a top-down approach to answer spatial data queries
• Start from a pre-selected layer—typically with a small
number of cells
• For each cell in the current level compute the confidence
interval
The STING Clustering Method
562
• Remove the irrelevant cells from further consideration
• When finish examining the current layer, proceed to the
next lower level
• Repeat this process until the bottom layer is reached
• Advantages:
• Query-independent, easy to parallelize, incremental
update
• O(K), where K is the number of grid cells at the lowest
level
• Disadvantages:
• All the cluster boundaries are either horizontal or
vertical, and no diagonal boundary is detected
STING Algorithm and Its Analysis
563
• Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98)
• Automatically identifying subspaces of a high
dimensional data space that allow better clustering
than original space
• CLIQUE can be considered as both density-based and
grid-based
• It partitions each dimension into the same number of equal
length interval
• It partitions an m-dimensional data space into non-
overlapping rectangular units
• A unit is dense if the fraction of total data points contained in
the unit exceeds the input model parameter
• A cluster is a maximal set of connected dense units within a
subspace
CLIQUE (Clustering In QUEst)
564
1. Partition the data space and find the number of
points that lie inside each cell of the partition.
2. Identify the subspaces that contain clusters using
the Apriori principle
3. Identify clusters
1. Determine dense units in all subspaces of interests
2. Determine connected dense units in all subspaces of
interests.
4. Generate minimal description for the clusters
1. Determine maximal regions that cover a cluster of
connected dense units for each cluster
2. Determination of minimal cover for each cluster
CLIQUE: The Major Steps
565
566
Salary
(10,000)
20 30 40 50 60
age
5
4
3
1
2
6
7
0
20 30 40 50 60
age
5
4
3
1
2
6
7
0
Vacation(w
eek) age
Vacation
30 50
 = 3
• Strength
• automatically finds subspaces of the highest
dimensionality such that high density clusters exist in
those subspaces
• insensitive to the order of records in input and does not
presume some canonical data distribution
• scales linearly with the size of input and has good
scalability as the number of dimensions in the data
increases
• Weakness
• The accuracy of the clustering result may be degraded at
the expense of simplicity of the method
Strength and Weakness of CLIQUE
567
4.3 Algoritma Asosiasi
4.3.1 Frequent Itemset Mining Methods
4.3.2 Pattern Evaluation Methods
568
• Frequent pattern: a pattern (a set of items,
subsequences, substructures, etc.) that occurs
frequently in a data set
• First proposed by Agrawal, Imielinski, and Swami
[AIS93] in the context of frequent itemsets and
association rule mining
• Motivation: Finding inherent regularities in data
• What products were often purchased together?— Beer and
diapers?!
• What are the subsequent purchases after buying a PC?
• What kinds of DNA are sensitive to this new drug?
• Can we automatically classify web documents?
• Applications
• Basket data analysis, cross-marketing, catalog design, sale
campaign analysis, Web log (click stream) analysis, and DNA
sequence analysis.
What Is Frequent Pattern Analysis?
569
• Freq. pattern: An intrinsic and important property of
datasets
• Foundation for many essential data mining tasks
• Association, correlation, and causality analysis
• Sequential, structural (e.g., sub-graph) patterns
• Pattern analysis in spatiotemporal, multimedia, time-
series, and stream data
• Classification: discriminative, frequent pattern analysis
• Cluster analysis: frequent pattern-based clustering
• Data warehousing: iceberg cube and cube-gradient
• Semantic data compression: fascicles
• Broad applications
Why Is Freq. Pattern Mining Important?
570
• itemset: A set of one or more
items
• k-itemset X = {x1, …, xk}
• (absolute) support, or, support
count of X: Frequency or
occurrence of an itemset X
• (relative) support, s, is the fraction
of transactions that contains X
(i.e., the probability that a
transaction contains X)
• An itemset X is frequent if X’s
support is no less than a minsup
threshold
Basic Concepts: Frequent Patterns
571
Customer
buys diaper
Customer
buys both
Customer
buys beer
Tid Items bought
10 Beer, Nuts, Diaper
20 Beer, Coffee, Diaper
30 Beer, Diaper, Eggs
40 Nuts, Eggs, Milk
50 Nuts, Coffee, Diaper, Eggs, Milk
• Find all the rules X  Y with
minimum support and confidence
• support, s, probability that a
transaction contains X  Y
• confidence, c, conditional
probability that a transaction
having X also contains Y
Let minsup = 50%, minconf = 50%
Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3, {Beer,
Diaper}:3
Basic Concepts: Association Rules
572
Customer
buys
diaper
Customer
buys both
Customer
buys beer
Nuts, Eggs, Milk
40
Nuts, Coffee, Diaper, Eggs, Milk
50
Beer, Diaper, Eggs
30
Beer, Coffee, Diaper
20
Beer, Nuts, Diaper
10
Items bought
Tid
• Association rules: (many more!)
• Beer  Diaper (60%, 100%)
• Diaper  Beer (60%, 75%)
• A long pattern contains a combinatorial number of sub-
patterns, e.g., {a1, …, a100} contains (100
1) + (100
2) + … + (1
1
0
0
0
0)
= 2100 – 1 = 1.27*1030 sub-patterns!
• Solution: Mine closed patterns and max-patterns instead
• An itemset X is closed if X is frequent and there exists no
super-pattern Y ‫כ‬ X, with the same support as X (proposed
by Pasquier, et al. @ ICDT’99)
• An itemset X is a max-pattern if X is frequent and there
exists no frequent super-pattern Y ‫כ‬ X (proposed by Bayardo
@ SIGMOD’98)
• Closed pattern is a lossless compression of freq. patterns
• Reducing the # of patterns and rules
Closed Patterns and Max-Patterns
573
• Exercise. DB = {<a1, …, a100>, < a1, …, a50>}
• Min_sup = 1.
• What is the set of closed itemset?
• <a1, …, a100>: 1
• < a1, …, a50>: 2
• What is the set of max-pattern?
• <a1, …, a100>: 1
• What is the set of all patterns?
• !!
Closed Patterns and Max-Patterns
574
• How many itemsets are potentially to be generated in
the worst case?
• The number of frequent itemsets to be generated is senstive
to the minsup threshold
• When minsup is low, there exist potentially an exponential
number of frequent itemsets
• The worst case: MN where M: # distinct items, and N: max
length of transactions
• The worst case complexty vs. the expected probability
Ex. Suppose Walmart has 104 kinds of products
• The chance to pick up one product 10-4
• The chance to pick up a particular set of 10 products: ~10-40
• What is the chance this particular set of 10 products to be
frequent 103 times in 109 transactions?
Computational Complexity of Frequent
Itemset Mining
575
4.3.1 Frequent Itemset Mining Methods
576
• Apriori: A Candidate Generation-and-Test Approach
• Improving the Efficiency of Apriori
• FPGrowth: A Frequent Pattern-Growth Approach
• ECLAT: Frequent Pattern Mining with Vertical Data
Format
Scalable Frequent Itemset Mining Methods
577
• The downward closure property of frequent
patterns
• Any subset of a frequent itemset must be frequent
• If {beer, diaper, nuts} is frequent, so is {beer, diaper}
• i.e., every transaction having {beer, diaper, nuts} also
contains {beer, diaper}
• Scalable mining methods: Three major approaches
• Apriori (Agrawal & Srikant@VLDB’94)
• Freq. pattern growth (FPgrowth—Han, Pei & Yin
@SIGMOD’00)
• Vertical data format approach (Charm—Zaki & Hsiao
@SDM’02)
The Downward Closure Property and
Scalable Mining Methods
578
• Apriori pruning principle: If there is any itemset
which is infrequent, its superset should not be
generated/tested! (Agrawal & Srikant @VLDB’94,
Mannila, et al. @ KDD’ 94)
• Method:
1. Initially, scan DB once to get frequent 1-itemset
2. Generate length (k+1) candidate itemsets from length
k frequent itemsets
3. Test the candidates against DB
4. Terminate when no frequent or candidate set can be
generated
Apriori: A Candidate Generation & Test
Approach
579
The Apriori Algorithm—An Example
580
Database TDB
1st scan
C1
L1
L2
C2 C2
2nd scan
C3 L3
3rd scan
Tid Items
10 A, C, D
20 B, C, E
30 A, B, C, E
40 B, E
Itemset sup
{A} 2
{B} 3
{C} 3
{D} 1
{E} 3
Itemset sup
{A} 2
{B} 3
{C} 3
{E} 3
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
Itemset sup
{A, B} 1
{A, C} 2
{A, E} 1
{B, C} 2
{B, E} 3
{C, E} 2
Itemset sup
{A, C} 2
{B, C} 2
{B, E} 3
{C, E} 2
Itemset
{B, C, E}
Itemset sup
{B, C, E} 2
Supmin = 2
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are
contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
The Apriori Algorithm (Pseudo-Code)
581
• How to generate candidates?
• Step 1: self-joining Lk
• Step 2: pruning
• Example of Candidate-generation
• L3={abc, abd, acd, ace, bcd}
• Self-joining: L3*L3
• abcd from abc and abd
• acde from acd and ace
• Pruning:
• acde is removed because ade is not in L3
• C4 = {abcd}
Implementation of Apriori
582
• Why counting supports of candidates a problem?
• The total number of candidates can be very huge
• One transaction may contain many candidates
• Method:
• Candidate itemsets are stored in a hash-tree
• Leaf node of hash-tree contains a list of itemsets and
counts
• Interior node contains a hash table
• Subset function: finds all the candidates contained in a
transaction
How to Count Supports of Candidates?
583
Counting Supports of Candidates Using Hash Tree
584
1,4,7
2,5,8
3,6,9
Subset function
2 3 4
5 6 7
1 4 5
1 3 6
1 2 4
4 5 7 1 2 5
4 5 8
1 5 9
3 4 5 3 5 6
3 5 7
6 8 9
3 6 7
3 6 8
Transaction: 1 2 3 5 6
1 + 2 3 5 6
1 2 + 3 5 6
1 3 + 5 6
• SQL Implementation of candidate generation
• Suppose the items in Lk-1 are listed in an order
• Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1
• Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
• Use object-relational extensions like UDFs, BLOBs, and Table
functions for efficient implementation
(S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with
relational database systems: Alternatives and implications. SIGMOD’98)
Candidate Generation: An SQL Implementation
585
• Bottlenecks of the Apriori approach
• Breadth-first (i.e., level-wise) search
• Candidate generation and test
• Often generates a huge number of candidates
• The FPGrowth Approach (J. Han, J. Pei, and Y. Yin, SIGMOD’ 00)
• Depth-first search
• Avoid explicit candidate generation
• Major philosophy: Grow long patterns from short ones using
local frequent items only
• “abc” is a frequent pattern
• Get all transactions having “abc”, i.e., project DB on abc: DB|abc
• “d” is a local frequent item in DB|abc  abcd is a frequent pattern
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
586
Construct FP-tree from a Transaction
Database
587
{}
f:4 c:1
b:1
p:1
b:1
c:3
a:3
b:1
m:2
p:2 m:1
Header Table
Item frequency head
f 4
c 4
a 3
b 3
m 3
p 3
min_support = 3
TID Items bought (ordered) frequent items
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
1. Scan DB once, find frequent 1-
itemset (single item pattern)
2. Sort frequent items in
frequency descending order, f-
list
3. Scan DB again, construct FP-
tree
F-list = f-c-a-b-m-p
• Frequent patterns can be partitioned into subsets
according to f-list
• F-list = f-c-a-b-m-p
• Patterns containing p
• Patterns having m but no p
• …
• Patterns having c but no a nor b, m, p
• Pattern f
• Completeness and non-redundency
Partition Patterns and Databases
588
• Starting at the frequent item header table in the FP-tree
• Traverse the FP-tree by following the link of each frequent item p
• Accumulate all of transformed prefix paths of item p to form p’s
conditional pattern base
Find Patterns Having P From P-conditional
Database
589
Conditional pattern bases
item cond. pattern base
c f:3
a fc:3
b fca:1, f:1, c:1
m fca:2, fcab:1
p fcam:2, cb:1
{}
f:4 c:1
b:1
p:1
b:1
c:3
a:3
b:1
m:2
p:2 m:1
Header Table
Item frequency head
f 4
c 4
a 3
b 3
m 3
p 3
• For each pattern-base
• Accumulate the count for each item in the base
• Construct the FP-tree for the frequent items of the
pattern base
From Conditional Pattern-bases to
Conditional FP-trees
590
m-conditional pattern base:
fca:2, fcab:1
{}
f:3
c:3
a:3
m-conditional FP-tree
All frequent patterns
relate to m
m,
fm, cm, am,
fcm, fam, cam,
fcam


{}
f:4 c:1
b:1
p:1
b:1
c:3
a:3
b:1
m:2
p:2 m:1
Header Table
Item frequency head
f 4
c 4
a 3
b 3
m 3
p 3
Recursion: Mining Each Conditional FP-tree
591
{}
f:3
c:3
a:3
m-conditional FP-tree
Cond. pattern base of “am”: (fc:3)
{}
f:3
c:3
am-conditional FP-tree
Cond. pattern base of “cm”: (f:3)
{}
f:3
cm-conditional FP-tree
Cond. pattern base of “cam”: (f:3)
{}
f:3
cam-conditional FP-tree
• Suppose a (conditional) FP-tree T has a shared single prefix-path P
• Mining can be decomposed into two parts
• Reduction of the single prefix path into one node
• Concatenation of the mining results of the two parts
A Special Case: Single Prefix Path in FP-tree
592

a2:n2
a3:n3
a1:n1
{}
b1:m1
C1:k1
C2:k2 C3:k3
b1:m1
C1:k1
C2:k2 C3:k3
r1
+
a2:n2
a3:n3
a1:n1
{}
r1 =
• Completeness
• Preserve complete information for frequent pattern
mining
• Never break a long pattern of any transaction
• Compactness
• Reduce irrelevant info—infrequent items are gone
• Items in frequency descending order: the more
frequently occurring, the more likely to be shared
• Never be larger than the original database (not count
node-links and the count field)
Benefits of the FP-tree Structure
593
• Idea: Frequent pattern growth
• Recursively grow frequent patterns by pattern and
database partition
• Method
1. For each frequent item, construct its conditional
pattern-base, and then its conditional FP-tree
2. Repeat the process on each newly created conditional
FP-tree
3. Until the resulting FP-tree is empty, or it contains only
one path—single path will generate all the
combinations of its sub-paths, each of which is a
frequent pattern
The Frequent Pattern Growth Mining
Method
594
• What about if FP-tree cannot fit in memory? DB projection
• First partition a database into a set of projected DBs
• Then construct and mine FP-tree for each projected DB
• Parallel projection vs. partition projection techniques
• Parallel projection
• Project the DB in parallel for each frequent item
• Parallel projection is space costly
• All the partitions can be processed in parallel
• Partition projection
• Partition the DB based on the ordered frequent items
• Passing the unprocessed parts to the subsequent partitions
Scaling FP-growth by Database Projection
595
• Parallel projection needs a lot of disk space
• Partition projection saves it
Partition-Based Projection
596
Tran. DB
fcamp
fcabm
fb
cbp
fcamp
p-proj DB
fcam
cb
fcam
m-proj DB
fcab
fca
fca
b-proj DB
f
cb
…
a-proj DB
fc
…
c-proj DB
f
…
f-proj DB
…
am-proj DB
fc
fc
fc
cm-proj DB
f
f
f
…
FP-Growth vs. Apriori:
Scalability With the Support Threshold
597
0
10
20
30
40
50
60
70
80
90
100
0 0.5 1 1.5 2 2.5 3
Support threshold(%)
Run
time(sec.)
D1 FP-grow th runtime
D1 Apriori runtime
Data set T25I20D10K
FP-Growth vs. Tree-Projection:
Scalability with the Support Threshold
598
0
20
40
60
80
100
120
140
0 0.5 1 1.5 2
Support threshold (%)
Runtime
(sec.)
D2 FP-growth
D2 TreeProjection
Data set T25I20D100K
• Divide-and-conquer:
• Decompose both the mining task and DB according to
the frequent patterns obtained so far
• Lead to focused search of smaller databases
• Other factors
• No candidate generation, no candidate test
• Compressed database: FP-tree structure
• No repeated scan of entire database
• Basic ops: counting local freq items and building sub FP-
tree, no pattern search and matching
• A good open-source implementation and
refinement of FPGrowth
• FPGrowth+ (Grahne and J. Zhu, FIMI'03)
Advantages of the Pattern Growth Approach
599
• AFOPT (Liu, et al. @ KDD’03)
• A “push-right” method for mining condensed frequent
pattern (CFP) tree
• Carpenter (Pan, et al. @ KDD’03)
• Mine data sets with small rows but numerous columns
• Construct a row-enumeration tree for efficient mining
• FPgrowth+ (Grahne and Zhu, FIMI’03)
• Efficiently Using Prefix-Trees in Mining Frequent Itemsets,
Proc. ICDM'03 Int. Workshop on Frequent Itemset Mining
Implementations (FIMI'03), Melbourne, FL, Nov. 2003
• TD-Close (Liu, et al, SDM’06)
Further Improvements of Mining Methods
600
• Mining closed frequent itemsets and max-patterns
• CLOSET (DMKD’00), FPclose, and FPMax (Grahne & Zhu, Fimi’03)
• Mining sequential patterns
• PrefixSpan (ICDE’01), CloSpan (SDM’03), BIDE (ICDE’04)
• Mining graph patterns
• gSpan (ICDM’02), CloseGraph (KDD’03)
• Constraint-based mining of frequent patterns
• Convertible constraints (ICDE’01), gPrune (PAKDD’03)
• Computing iceberg data cubes with complex measures
• H-tree, H-cubing, and Star-cubing (SIGMOD’01, VLDB’03)
• Pattern-growth-based Clustering
• MaPle (Pei, et al., ICDM’03)
• Pattern-Growth-Based Classification
• Mining frequent and discriminative patterns (Cheng, et al, ICDE’07)
Extension of Pattern Growth Mining Methodology
601
1. Penyiapan Dataset
2. Pencarian Frequent Itemset (Item yang sering
muncul)
3. Dataset diurutkan Berdasarkan Priority
4. Pembuatan FP-Tree Berdasarkan Item yang sudah
diurutkan
5. Pembangkitan Conditional Pattern Base
6. Pembangkitan Conditional FP-tree
7. Pembangkitan Frequent Pattern
8. Mencari Support
9. Mencari Confidence
Tahapan Algoritma FP Growth
602
1. Penyiapan Dataset
603
2. Pencarian Frequent Itemset
604
3. Dataset diurutkan Berdasarkan Priority
605
4. Pembuatan FP-Tree
606
5. Pembangkitan Conditional Pattern Base
607
6. Pembangkitan Conditional FP-tree
608
7. Pembangkitan Frequent Pattern
609
Frequent 2 Itemset
610
8. Mencari Support 2 Itemset
611
9. Mencari Confidence 2 Itemset
612
4.3.2 Pattern Evaluation Methods
613
• play basketball  eat cereal [40%, 66.7%] is misleading
• The overall % of students eating cereal is 75% > 66.7%.
• play basketball  not eat cereal [20%, 33.3%] is more accurate,
although with lower support and confidence
• Measure of dependent/correlated events: lift
Interestingness Measure: Correlations (Lift)
614
89
.
0
5000
/
3750
*
5000
/
3000
5000
/
2000
)
,
( 

C
B
lift
Basketball Not basketball Sum (row)
Cereal 2000 1750 3750
Not cereal 1000 250 1250
Sum(col.) 3000 2000 5000
)
(
)
(
)
(
B
P
A
P
B
A
P
lift


33
.
1
5000
/
1250
*
5000
/
3000
5000
/
1000
)
,
( 

C
B
lift
• “Buy walnuts  buy milk
[1%, 80%]” is misleading
if 85% of customers buy
milk
• Support and confidence
are not good to indicate
correlations
• Over 20 interestingness
measures have been
proposed (see Tan,
Kumar, Sritastava
@KDD’02)
• Which are good ones?
Are lift and 2 Good Measures of
Correlation?
615
Null-Invariant Measures
616
• Null-(transaction) invariance is crucial for correlation analysis
• Lift and 2 are not null-invariant
• 5 null-invariant measures
Comparison of Interestingness Measures
617
617
Milk No Milk Sum (row)
Coffee m, c ~m, c c
No Coffee m, ~c ~m, ~c ~c
Sum(col.) m ~m 
Null-transactions
w.r.t. m and c Null-invariant
Subtle: They disagree
Kulczynski
measure (1927)
Analysis of DBLP Coauthor Relationships
618
Advisor-advisee relation: Kulc: high,
coherence: low, cosine: middle
Recent DB conferences, removing balanced associations, low sup, etc.
Tianyi Wu, Yuguo Chen and Jiawei Han, “Association Mining in Large Databases: A
Re-Examination of Its Measures”, Proc. 2007 Int. Conf. Principles and Practice of
Knowledge Discovery in Databases (PKDD'07), Sept. 2007
• IR (Imbalance Ratio): measure the imbalance of two
itemsets A and B in rule implications
• Kulczynski and Imbalance Ratio (IR) together present a clear
picture for all the three datasets D4 through D6
• D4 is balanced & neutral
• D5 is imbalanced & neutral
• D6 is very imbalanced & neutral
Which Null-Invariant Measure Is Better?
619
• Lakukan eksperimen mengikuti buku
Matthew North, Data Mining for the Masses
2nd Edition, 2016, Chapter 5 (Association
Rules), p 85-97
• Analisis, bagaimana data mining bisa
bermanfaat untuk membantu Roger, seorang
City Manager
Latihan
620
• Motivation:
• Roger is a city manager for a medium-sized, but steadily growing
city
• The city has limited resources, and like most municipalities,
there are more needs than there are resources
• He feels like the citizens in the community are fairly active in
various community organizations, and believes that he may be
able to get a number of groups to work together to meet some
of the needs in the community
• He knows there are churches, social clubs, hobby enthusiasts
and other types of groups in the community
• What he doesn’t know is if there are connections between the
groups that might enable natural collaborations between two or
more groups that could work together on projects around town
• Objectives:
• To find out if there are any existing associations between the
different types of groups in the area
1. Business Understanding
621
4.4 Algoritma Estimasi dan
Forecasting
4.4.1 Linear Regression
4.4.2 Time Series Forecasting
622
4.4.1 Linear Regression
623
1. Siapkan data
2. Identifikasi Atribut dan Label
3. Hitung X², Y², XY dan total dari masing-
masingnya
4. Hitung a dan b berdasarkan persamaan yang
sudah ditentukan
5. Buat Model Persamaan Regresi Linear Sederhana
Tahapan Algoritma Linear Regression
624
1. Persiapan Data
625
Tanggal
Rata-rata Suhu
Ruangan (X)
Jumlah Cacat
(Y)
1 24 10
2 22 5
3 21 6
4 20 3
5 22 6
6 19 4
7 20 5
8 23 9
9 24 11
10 25 13
Y = a + bX
Dimana:
Y = Variabel terikat (Dependen)
X = Variabel tidak terikat (Independen)
a = konstanta
b = koefisien regresi (kemiringan); besaran Response yang
ditimbulkan oleh variabel
a = (Σy) (Σx²) – (Σx) (Σxy)
n(Σx²) – (Σx)²
b = n(Σxy) – (Σx) (Σy)
n(Σx²) – (Σx)²
2. Identifikasikan Atribut dan Label
626
3. Hitung X², Y², XY dan total dari masing-
masingnya
627
Tanggal
Rata-rata Suhu
Ruangan (X)
Jumlah
Cacat (Y) X2 Y2 XY
1 24 10 576 100 240
2 22 5 484 25 110
3 21 6 441 36 126
4 20 3 400 9 60
5 22 6 484 36 132
6 19 4 361 16 76
7 20 5 400 25 100
8 23 9 529 81 207
9 24 11 576 121 264
10 25 13 625 169 325
220
72 4876 618 1640
• Menghitung Koefisien Regresi (a)
a = (Σy) (Σx²) – (Σx) (Σxy)
n(Σx²) – (Σx)²
a = (72) (4876) – (220) (1640)
10 (4876) – (220)²
a = -27,02
• Menghitung Koefisien Regresi (b)
b = n(Σxy) – (Σx) (Σy)
n(Σx²) – (Σx)²
b = 10 (1640) – (220) (72)
10 (4876) – (220)²
b = 1,56
4. Hitung a dan b berdasarkan persamaan
yang sudah ditentukan
628
Y = a + bX
Y = -27,02 + 1,56X
5. Buatkan Model Persamaan Regresi Linear
Sederhana
629
1. Prediksikan Jumlah Cacat Produksi jika suhu dalam
keadaan tinggi (Variabel X), contohnya: 30°C
Y = -27,02 + 1,56X
Y = -27,02 + 1,56(30)
=19,78
2. Jika Cacat Produksi (Variabel Y) yang ditargetkan hanya
boleh 5 unit, maka berapakah suhu ruangan yang
diperlukan untuk mencapai target tersebut?
5= -27,02 + 1,56X
1,56X = 5+27,02
X= 32,02/1,56
X =20,52
Jadi Prediksi Suhu Ruangan yang paling sesuai untuk
mencapai target Cacat Produksi adalah sekitar 20,520C
Pengujian
630
7.1.2 Studi Kasus CRISP-DM
Heating Oil Consumption – Estimation
(Matthew North, Data Mining for the Masses, 2012,
Chapter 8 Estimation, pp. 127-140)
Dataset: HeatingOil-Training.csv dan HeatingOil-Scoring.csv
631
• Lakukan eksperimen mengikuti buku Matthew
North, Data Mining for the Masses, 2012,
Chapter 8 Estimation, pp. 127-140 tentang Heating
Oil Consumption
• Dataset: HeatingOil-Training.csv dan HeatingOil-
Scoring.csv
Latihan
632
CRISP-DM
633
• Sarah, the regional sales manager is back for more help
• Business is booming, her sales team is signing up thousands of new
clients, and she wants to be sure the company will be able to meet
this new level of demand, she now is hoping we can help her do
some prediction as well
• She knows that there is some correlation between the attributes in
her data set (things like temperature, insulation, and occupant ages),
and she’s now wondering if she can use the previous data set to
predict heating oil usage for new customers
• You see, these new customers haven’t begun consuming heating oil
yet, there are a lot of them (42,650 to be exact), and she wants to
know how much oil she needs to expect to keep in stock in order to
meet these new customers’ demand
• Can she use data mining to examine household attributes and
known past consumption quantities to anticipate and meet her new
customers’ needs?
Context and Perspective
634
• Sarah’s new data mining objective is pretty clear: she
wants to anticipate demand for a consumable product
• We will use a linear regression model to help her with
her desired predictions
• She has data, 1,218 observations that give an attribute
profile for each home, along with those homes’ annual
heating oil consumption
• She wants to use this data set as training data to
predict the usage that 42,650 new clients will bring to
her company
• She knows that these new clients’ homes are similar in
nature to her existing client base, so the existing
customers’ usage behavior should serve as a solid
gauge for predicting future usage by new customers.
1. Business Understanding
635
We create a data set comprised of the following attributes:
• Insulation: This is a density rating, ranging from one to ten,
indicating the thickness of each home’s insulation. A home
with a density rating of one is poorly insulated, while a
home with a density of ten has excellent insulation
• Temperature: This is the average outdoor ambient
temperature at each home for the most recent year,
measure in degree Fahrenheit
• Heating_Oil: This is the total number of units of heating oil
purchased by the owner of each home in the most recent
year
• Num_Occupants: This is the total number of occupants
living in each home
• Avg_Age: This is the average age of those occupants
• Home_Size: This is a rating, on a scale of one to eight, of the
home’s overall size. The higher the number, the larger the
home
2. Data Understanding
636
• A CSV data set for this chapter’s example is available for
download at the book’s companion web site
(https://sites.google.com/site/dataminingforthemasses/)
3. Data Preparation
637
3. Data Preparation
638
3. Data Preparation
639
4. Modeling
640
4. Modeling
641
5. Evaluation
642
5. Evaluation
643
6. Deployment
644
6. Deployment
645
6. Deployment
646
4.4.2 Time Series Forecasting
647
• Time series forecasting is one of the oldest known
predictive analytics techniques
• It has existed and been in widespread use even before the
term “predictive analytics” was ever coined
• Independent or predictor variables are not strictly
necessary for univariate time series forecasting, but are
strongly recommended for multivariate time series
• Time series forecasting methods:
1. Data Driven Method: There is no difference between a
predictor and a target. Techniques such as time series
averaging or smoothing are considered data-driven
approaches to time series forecasting
2. Model Driven Method: Similar to “conventional” predictive
models, which have independent and dependent variables,
but with a twist: the independent variable is now time
Time Series Forecasting
648
• There is no difference between a predictor and a
target
• The predictor is also the target variable
• Data Driven Methods:
• Naïve Forecast
• Simple Average
• Moving Average
• Weighted Moving Average
• Exponential Smoothing
• Holt’s Two-Parameter Exponential Smoothing
Data Driven Methods
649
• In model-driven methods, time is the predictor or
independent variable and the time series value is the
dependent variable
• Model-based methods are generally preferable when
the time series appears to have a “global” pattern
• The idea is that the model parameters will be able to
capture these patterns
• Thus enable us to make predictions for any step ahead in
the future under the assumption that this pattern is going
to repeat
• For a time series with local patterns instead of a
global pattern, using the model-driven approach
requires specifying how and when the patterns
change, which is difficult
Model Driven Methods
650
• Linear Regression
• Polynomial Regression
• Linear Regression with Seasonality
• Autoregression Models and ARIMA
Model Driven Methods
651
• RapidMiner’s approach to time series is
based on two main data transformation
processes
• The first is windowing to transform the time
series data into a generic data set:
• This step will convert the last row of a window
within the time series into a label or target
variable
• We apply any of the “learners” or algorithms
to predict the target variable and thus
predict the next time step in the series
How to Implement
652
• The parameters of the Windowing operator allow
changing the size of the windows, the overlap between
consecutive windows (step size), and the prediction
horizon, which is used for forecasting
• The prediction horizon controls which row in the raw
data series ends up as the label variable in the
transformed series
Windowing Concept
653
Rapidminer Windowing Operator
654
• Window size: Determines how many “attributes”
are created for the cross-sectional data
• Each row of the original time series within the window
width will become a new attribute
• We choose w = 6
• Step size: Determines how to advance the window
• Let us use s = 1
• Horizon: Determines how far out to make the
forecast
• If the window size is 6 and the horizon is 1, then the
seventh row of the original time series becomes the first
sample for the “label” variable
• Let us use h = 1
Windowing Operator Parameters
655
• Lakukan training dengan menggunakan linear
regression pada dataset hargasaham-training-
uni.xls
• Gunakan Split Data untuk memisahkan dataset di
atas, 90% training dan 10% untuk testing
• Harus dilakukan proses Windowing pada dataset
• Plot grafik antara label dan hasil prediksi dengan
menggunakan chart
Latihan
656
657
• Lakukan training dengan menggunakan linear
regression pada dataset hargasaham-training.xls
• Terapkan model yang dihasilkan untuk data
hargasaham-testing-kosong.xls
• Harus dilakukan proses Windowing pada dataset
• Plot grafik antara label dan hasil prediksi dengan
menggunakan chart
Latihan
658
659
5. Text Mining
5.1 Text Mining Concepts
5.2 Text Clustering
5.3 Text Classification
5.4 Data Mining Law
660
5.1 Text Mining Concepts
661
1. Text Mining:
• Mengolah data tidak terstruktur dalam bentuk text,
web, social media, dsb
• Menggunakan metode text processing untuk
mengkonversi data tidak terstruktur menjadi terstruktur
• Kemudian diolah dengan data mining
2. Data Mining:
• Mengolah data terstruktur dalam bentuk tabel yang
memiliki atribut dan kelas
• Menggunakan metode data mining, yang terbagi
menjadi metode estimasi, forecasting, klasifikasi,
klastering atau asosiasi
• Yang dasar berpikirnya menggunakan konsep statistika atau
heuristik ala machine learning
Data Mining vs Text Mining
662
• The fundamental step is to convert text into semi-structured data
• Then apply the data mining methods to classify, cluster, and predict
How Text Mining Works
663
Text
Processing
Text Mining: Jejak Pornografi di Indonesia
664
Text Mining: AHY-AHOK-ANIES
665
1. Himpunan
Data
(Pahami dan
Persiapkan Data)
2. Metode
Data Mining
(Pilih Metode
Sesuai Karakter Data)
3. Pengetahuan
(Pahami Model dan
Pengetahuan yg Sesuai )
4. Evaluation
(Analisis Model dan
Kinerja Metode)
Proses Data Mining
666
DATA PREPROCESSING
Data Cleaning
Data Integration
Data Reduction
Data Transformation
Text Processing
MODELING
Estimation
Prediction
Classification
Clustering
Association
MODEL
Formula
Tree
Cluster
Rule
Correlation
KINERJA
Akurasi
Tingkat Error
Jumlah Cluster
MODEL
Atribute/Faktor
Korelasi
Bobot
• Words are separated by a special character: a blank space
• Each word is called a token
• The process of discretizing words within a document is
called tokenization
• For our purpose here, each sentence can be considered a
separate document, although what is considered an
individual document may depend upon the context
• For now, a document here is simply a sequential collection
of tokens
Word, Token and Tokenization
667
• We can impose some form of structure on this raw
data by creating a matrix, where:
• the columns consist of all the tokens found in the two
documents
• the cells of the matrix are the counts of the number of
times a token appears
• Each token is now an attribute in standard data
mining parlance and each document is an example
Matrix of Terms
668
• Basically, unstructured raw data is now transformed
into a format that is recognized, not only by the
human users as a data table, but more importantly
by all the machine learning algorithms which
require such tables for training
• This table is called a document vector or term
document matrix (TDM) and is the cornerstone of
the preprocessing required for text mining
Term Document Matrix (TDM)
669
• We could have also chosen to use the TF–IDF
scores for each term to create the document vector
• N is the number of documents that we are trying to
mine
• Nk is the number of documents that contain the
keyword, k
TF–IDF
670
• In the two sample text documents was the occurrence of
common words such as “a,” “this,” “and,” and other similar
terms
• Clearly in larger documents we would expect a larger number
of such terms that do not really convey specific meaning
• Most grammatical necessities such as articles, conjunctions,
prepositions, and pronouns may need to be filtered before we
perform additional analysis
• Such terms are called stopwords and usually include most articles,
conjunctions, pronouns, and prepositions
• Stopword filtering is usually the second step that follows immediately
after tokenization
• Notice that our document vector has a significantly reduced
size after applying standard English stopword filtering
Stopwords
671
• Lakukan googling dengan keyword:
stopwords bahasa Indonesia
• Download stopword bahasa Indonesia dan
gunakan di Rapidminer
Stopwords Bahasa Indonesia
672
• Words such as “recognized,” “recognizable,” or
“recognition” in different usages, but contextually they
may all imply the same meaning, for example:
• “Einstein is a well-recognized name in physics”
• “The physicist went by the easily recognizable name of
Einstein”
• “Few other physicists have the kind of name recognition that
Einstein has”
• The so-called root of all these highlighted words is “recognize”
• By reducing terms in a document to their basic stems,
we can simplify the conversion of unstructured text to
structured data because we now only take into account
the occurrence of the root terms
• This process is called stemming. The most common
stemming technique for text mining in English is the
Porter method (Porter, 1980)
Stemming
673
A Typical Sequence of Preprocessing Steps to
Use in Text Mining
674
• There are families of words in the spoken and written
language that typically go together
• The word “Good” is usually followed by either “Morning,”
“Afternoon,” “Evening,” “Night,” or in Australia, “Day”
• Grouping such terms, called n-grams, and analyzing them
statistically can present new insights
• Search engines use word n-gram models for a variety
of applications, such as:
• Automatic translation, identifying speech patterns,
checking misspelling, entity detection, information
extraction, among many different use cases
N-Grams
675
Rapidminer Process of Text Mining
676
5.2 Text Clustering
677
• Lakukan eksperimen mengikuti buku
Matthew North (Data Mining for the Masses)
Chapter 12 (Text Mining), 2012, p 189-215
• Datasets: Federalist Papers
• Pahami alur text mining yang dilakukan dan
sesuaikan dengan konsep yang sudah
dipelajari
Latihan
678
• Motivation:
• Gillian is a historian, and she has recently curated an exhibit on the Federalist
Papers, the essays that were written and published in the late 1700’s
• The essays were published anonymously under the author name ‘Publius’,
and no one really knew at the time if ‘Publius’ was one individual or many
• After Alexander Hamilton died in 1804, some notes were discovered that
revealed that he (Hamilton), James Madison and John Jay had been the
authors of the papers
• The notes indicated specific authors for some papers, but not for others:
• John Jay was revealed to be the author for papers 3, 4 and 5
• James Madison for paper 14
• Hamilton for paper 17
• Paper 18 had no author named, but there was evidence that Hamilton and
Madison worked on that one together
• Objective:
• Gillian would like to analyze paper 18’s content in the context of the other
papers with known authors, to see if she can generate some evidence that
the suspected collaboration between Hamilton and Madison is in fact
1. Business Understanding
679
• The Federalist Papers are available through a number
of sources:
• They have been re-published in book form, they are available on
a number of different web sites
• Their text is archived in many libraries throughout the world
• Gillian’s data set is simple (6 dataset):
• Federalist03_Jay.txt
• Federalist04_Jay.txt
• Federalist05_Jay.txt
• Federalist14_Madison.txt
• Federalist17_Hamilton.txt
• Federalist18_Collaboration.txt (suspected)
2. Data Understanding
680
Text Processing Extension Installation
Modeling
681
Modeling
682
Operator
Read Document
Operator
K-Means
Parameters
K = 2
Text Processing
Modelling with Annotation
683
• Gillian feels confident that paper 18 is a
collaboration that John Jay did not contribute to
• His vocabulary and grammatical structure was quite
different from those of Hamilton and Madison
Evaluation
684
• Lakukan eksperimen mengikuti buku Vijay Kotu
(Predictive Analytics and Data Mining) Chapter 9
(Text Mining), Case Study 1: Keyword Clustering,
p 284-287
• Datasets (file pages.txt):
1. https://www.cnnindonesia.com/olahraga
2. https://www.cnnindonesia.com/ekonomi
• Gunakan stopword Bahasa Indonesia (ada di
folder dataset), dengan operator Stopword
(Dictionary) dan pilih file stopword-indonesia.txt
• Untuk mempermudah, copy/paste file
09_Text_9.3.1_keyword_clustering_webmining.
rmp ke Repository dan kemudian buka di
Rapidminer
• Pilih file pages.txt yang berisis URL pada Read URL
Latihan
685
686
Testing Model (Read Document)
687
Testing Model (Get Page)
688
5.3 Text Classification
689
• Dengan berbagai konsep dan teknik yang
anda kuasai, lakukan text classification pada
dataset polarity data - small
• Gunakan algoritma Decision Tree untuk
membentuk model
• Ambil 1 artikel di dalam folder polaritydata -
small – testing , misalnya dalam folder pos,
uji apakah artikel tersebut diprediksi
termasuk sentiment negative atau positive
Latihan
690
691
692
Ukur Akurasi dari polaritydata-small-testing
693
• Dengan berbagai konsep dan teknik yang
anda kuasai, lakukan text classification pada
dataset polarity data
• Terapkan beberapa metode feature
selection, baik filter maupun wrapper
• Lakukan komparasi terhadap berbagai
algoritma klasifikasi, dan pilih yang terbaik
Latihan
694
• Lakukan eksperimen mengikuti buku Vijay Kotu
(Predictive Analytics and Data Mining) Chapter 9
(Text Mining), Case Study 2: Predicting the
Gender of Blog Authors, p 287-301
• Datasets: blog-gender-dataset.xslx
• Split Data: 50% data training dan 50% data testing
• Gunakan algoritma Naïve Bayes
• Apply model yang dihasilkan untuk data testing
• Ukur performance nya
Latihan
695
696
• Lakukan eksperimen mengikuti buku Vijay
Kotu (Predictive Analytics and Data Mining)
Chapter 9 (Text Mining), Case Study 2:
Predicting the Gender of Blog Authors, p 287-
301
• Datasets:
• blog-gender-dataset.xslx
• blog-gender-dataset-testing.xlsx
• Gunakan 10-fold X validation dan operator
write model (read model), store (retrieve)
Latihan
697
698
699
700
701
1. Jelaskan perbedaan antara data, informasi dan pengetahuan!
2. Jelaskan apa yang anda ketahui tentang data mining!
3. Sebutkan peran utama data mining!
4. Sebutkan pemanfaatan dari data mining di berbagai bidang!
5. Pengetahuan atau pola apa yang bisa kita dapatkan dari data
di bawah?
Post-Test
702
NIM Gender Nilai
UN
Asal
Sekolah
IPS1 IPS2 IPS3 IPS 4 ... Lulus Tepat
Waktu
10001 L 28 SMAN 2 3.3 3.6 2.89 2.9 Ya
10002 P 27 SMAN 7 4.0 3.2 3.8 3.7 Tidak
10003 P 24 SMAN 1 2.7 3.4 4.0 3.5 Tidak
10004 L 26.4 SMAN 3 3.2 2.7 3.6 3.4 Ya
...
11000 L 23.4 SMAN 5 3.3 2.8 3.1 3.2 Ya
5.4 Data Mining Laws
Tom Khabaza, Nine Laws of Data Mining, 2010
(http://khabaza.codimension.net/index_files/9laws.htm)
703
1. Business objectives are the origin of every data mining
solution
2. Business knowledge is central to every step of the data
mining process
3. Data preparation is more than half of every data mining
process
4. There is no free lunch for the data miner
5. There are always patterns
6. Data mining amplifies perception in the business domain
7. Prediction increases information locally by generalisation
8. The value of data mining results is not determined by the
accuracy or stability of predictive models
9. All patterns are subject to change
Data Mining Laws
704
Tom Khabaza, Nine Laws of Data Mining, 2010
(http://khabaza.codimension.net/index_files/9laws.htm)
Business objectives are the origin of every data
mining solution
• This defines the field of data mining: data mining is
concerned with solving business problems and
achieving business goals
• Data mining is not primarily a technology; it is a
process, which has one or more business objectives
at its heart
• Without a business objective, there is no data
mining
• The maxim: “Data Mining is a Business Process”
1 Business Goals Law
705
Business knowledge is central to every step of the
data mining process
• A naive reading of CRISP-DM would see business
knowledge used at the start of the process in
defining goals, and at the end of the process in
guiding deployment of results
• This would be to miss a key property of the data
mining process, that business knowledge has a
central role in every step
2 Business Knowledge Law
706
1. Business understanding must be based on business
knowledge, and so must the mapping of business
objectives to data mining goals
2. Data understanding uses business knowledge to
understand which data is related to the business problem,
and how it is related
3. Data preparation means using business knowledge to
shape the data so that the required business questions
can be asked and answered
4. Modelling means using data mining algorithms to create
predictive models and interpreting both the models and
their behaviour in business terms – that is, understanding
their business relevance
5. Evaluation means understanding the business impact of
using the models
6. Deployment means putting the data mining results to
work in a business process
2 Business Knowledge Law
707
Data preparation is more than half of every data
mining process
• Maxim of data mining: most of the effort in a data
mining project is spent in data acquisition and
preparation, and informal estimates vary from 50 to
80 percent
• The purpose of data preparation is:
1. To put the data into a form in which the data mining
question can be asked
2. To make it easier for the analytical techniques (such as
data mining algorithms) to answer it
3 Data Preparation Law
708
There is No Free Lunch for the Data Miner (NFL-DM)
The right model for a given application can only be discovered by
experiment
• Axiom of machine learning: if we knew enough about a problem
space, we could choose or design an algorithm to find optimal
solutions in that problem space with maximal efficiency
• Arguments for the superiority of one algorithm over others in data
mining rest on the idea that data mining problem spaces have one
particular set of properties, or that these properties can be
discovered by analysis and built into the algorithm
• However, these views arise from the erroneous idea that, in data
mining, the data miner formulates the problem and the algorithm
finds the solution
• In fact, the data miner both formulates the problem and finds the
solution – the algorithm is merely a tool which the data miner uses
to assist with certain steps in this process
4 No Free Lunch Theory
709
• If the problem space were well-understood, the data mining
process would not be needed
• Data mining is the process of searching for as yet unknown
connections
• For a given application, there is not only one problem space
• Different models may be used to solve different parts of the
problem
• The way in which the problem is decomposed is itself often the
result of data mining and not known before the process begins
• The data miner manipulates, or “shapes”, the problem space
by data preparation, so that the grounds for evaluating a
model are constantly shifting
• There is no technical measure of value for a predictive
model
• The business objective itself undergoes revision and
development during the data mining process
• so that the appropriate data mining goals may change completely
4 No Free Lunch Theory
710
There are always patterns
• This law was first stated by David Watkins
• There is always something interesting to be found in
a business-relevant dataset, so that even if the
expected patterns were not found, something else
useful would be found
• A data mining project would not be undertaken
unless business experts expected that patterns
would be present, and it should not be surprising
that the experts are usually right
5 Watkins’ Law
711
Data mining amplifies perception in the
business domain
• How does data mining produce insight? This law approaches
the heart of data mining – why it must be a business process
and not a technical one
• Business problems are solved by people, not by algorithms
• The data miner and the business expert “see” the solution to a
problem, that is the patterns in the domain that allow the
business objective to be achieved
• Thus data mining is, or assists as part of, a perceptual process
• Data mining algorithms reveal patterns that are not normally visible to
human perception
• The data mining process integrates these algorithms with the
normal human perceptual process, which is active in nature
• Within the data mining process, the human problem solver
interprets the results of data mining algorithms and integrates
them into their business understanding
6 Insight Law
712
Prediction increases information locally by
generalisation
• “Predictive models” and “predictive analytics” means “predict the
most likely outcome”
• Other kinds of data mining models, such as clustering and
association, are also characterised as “predictive”; this is a much
looser sense of the term:
• A clustering model might be described as “predicting” the group into
which an individual falls
• An association model might be described as “predicting” one or more
attributes on the basis of those that are known
• What is “prediction” in this sense? What do classification,
regression, clustering and association algorithms and their resultant
models have in common?
• The answer lies in “scoring”, that is the application of a predictive model to
a new example
• The available information about the example in question has been
increased, locally, on the basis of the patterns found by the algorithm and
embodied in the model, that is on the basis of generalisation or induction
7 Prediction Law
713
The value of data mining results is not determined by
the accuracy or stability of predictive models
• Accuracy and stability are useful measures of how
well a predictive model makes its predictions
• Accuracy means how often the predictions are correct
• Stability means how much the predictions would change
if the data used to create the model were a different
sample from the same population
• The value of a predictive model arises in two ways:
• The model’s predictions drive improved (more effective)
action
• The model delivers insight (new knowledge) which leads
to improved strategy
8 Value Law
714
All patterns are subject to change
• The patterns discovered by data mining do not last
forever
• In marketing and CRM applications of data mining, it is
well-understood that patterns of customer behaviour
are subject to change over time
• Fashions change, markets and competition change, and the
economy changes as a whole; for all these reasons, predictive
models become out-of-date and should be refreshed
regularly or when they cease to predict accurately
• The same is true in risk and fraud-related applications of data
mining. Patterns of fraud change with a changing
environment and because criminals change their behaviour in
order to stay ahead of crime prevention efforts
9 Law of Change
715
• Analisis masalah dan kebutuhan yang ada di organisasi
lingkungan sekitar anda
• Kumpulkan dan review dataset yang tersedia, dan hubungkan
masalah dan kebutuhan tadi dengan data yang tersedia
(analisis dari 5 peran data mining)
• Bila memungkinkan pilih beberapa peran sekaligus untuk mengolah
data tersebut, misalnya: lakukan association (analisis faktor), sekaligus
estimation atau clustering
• Lakukan proses CRISP-DM untuk menyelesaikan masalah yang
ada di organisasi sesuai dengan data yang didapatkan
• Pada proses data preparation, lakukan data cleaning (replace missing
value, replace, filter attribute) sehingga data siap dimodelkan
• Lakukan juga komparasi algoritma dan feature selection untuk memilih
pola dan model terbaik
• Rangkumkan evaluasi dari pola/model/knowledge yang dihasilkan dan
relasikan hasil evaluasi dengan deployment yang dilakukan
• Rangkumkan dalam bentuk slide dengan contoh studi kasus
Sarah untuk membantu bidang marketing
Tugas Menyelesaikan Masalah Organisasi
716
Organisasi Masalah Tujuan Dataset
KPK • Sulitnya mengidentifikasi
profil koruptor
• Tidak patuhnya WL dalam
LHKPN
• Klasifikasi Profil Pelaku Korupsi
• Asosiasi Atribut Pelaku Korupsi
• Klasifikasi Kepatuhan LHKPN
• Estimasi Penentuan Angka
Tuntutan
• LHKPN
• Penuntuta
n
BSM Sulit mengidentifikasi faktor
apa yang mempengaruhi
kualitas pembiayaan
Klasifikasi kualitas profil nasabah Data
pembiayaan
nasabah
LKPP Banyaknya konsultasi dan
pertanyaan dari berbagai
instansi yg harus dijawab
• Asosiasi pola pertanyaan instansi
• Klasifikasi jenis pertanyaan
Data
konsultasi
BPPK Sulitnya penanganan tweet
dari masyarakat, apakah
terkait pertanyaan, keluhan
atau saran
Klasifikasi dan Klastering text mining
dari keluhan atau pertanyaan atau
saran di media sosial
Data twitter
masyarakat
Universitas
Siliwangi
Tingkat kelulusan tepat waktu
belum maksimal (apakah
dikarenakan faktor jurusan
Klasifikasi data kelulusan mahasiswa Data
mahasiswa
Studi Kasus Organisasi
717
Organisasi Masalah Tujuan Dataset
Kemenkeu
(DJPB)
Sulit menentukan faktor
refinement indicator
kinerja
1. Seberapa erat hubungan antar
komponen terhadap potensi
penyempurnaan
2. Klastering data kinerja organsiasi
Data kinerja
organisasi
Kemenkeu
(DJPB)
Sulit menentukan arah
opini hasil audit
kementerian
1. Melihat hubungan beberapa data
terhadap opini
2. Klasifikasi profil kementerian
Data profil
kementerian
Kemenkeu
(DJPB)
Banyaknya pelaporan
kanwil yang harus
dianalisis dengan beragam
atribut
1. Melihat hubungan beberapa
indikator laporan kanwil terhadap
akurasi
2. Klastering data pelaporan kanwil
3. Klasifikasi akurasi pelaporan kanwil
Data
pelaporan
kanwil
Kemenkeu
(DJPB)
Sulit menentukan prioritas
monitoring kanwil
1. Klastering data profil kanwil
2. Melihat hubungan beberapa
atribut terhadap klaster profil
kanwil
Data
transaksi dan
profil kanwil
Studi Kasus Organisasi
718
Organisasi Masalah Tujuan Dataset
Kemenkeu
(SDM)
Kebijakan masalah reward
dan punishment untuk
pegawai sering tidak efektif
Klasifikasi profil pegawai yang
sering telat dan disiplin, sehingga
terdeteksi lebih dini
Pegawai
Kemenkeu
(SDM)
Rasio perempuan yang
menjabat eselon 4/3/2/1
hanya 15%, padahal masuk
PNS rasionya hampir imbang
• Klasifikasi dan klastering profile
pejabat eselon 4/3/2/1
• Asosiasi jabatan dan atribut
profile pegawai
Pegawai
Bank
Indonesia
Peredaran uang palsu yang
semakin banyak di Indonesia
• Asosiasi jumlah peredaran uang
palsu dengan profil wilayah
Indonesia
• Klastering wilayah peredaran
uang palsu
Peredaran
Uang Palsu
Adira
Finance
Rasio kredit macet yang
semakin meninggi
• Klasifikasi kualitas kreditur yang
lancar dan macet
• Forecasting jumlah kredit macet
• Tingkat hubungan kredit macet
dengan berbagai atribut
Kreditur
Studi Kasus Organisasi
719
Organisasi Masalah Tujuan Dataset
Kemsos Kompleksnya parameter
penentuan tingkat kemiskinan
rumah tangga di Indonesia
Klasifikasi profil rumah tangga
miskin di kabupaten Brebes
Rumah
tangga
miskin di
kabupaten
Brebes
Kemsos Sulitnya menentukan rumah
tangga yang diprioritaskan
menerima bantuan sosial
Klastering profile rumah tangga
miskin yang belum menerima
bantuan
Rumah
tangga
miskin di
kabupaten
Belitung
Kemsos Sulitnya menentukan jenis
penyakit kronis yang
diprioritaskan menerima
program Penerima Bantuan
Iuran jaminan kesehatan
(PBIJK)
Klasifikasi penyakit kronis yang
diderita anggota rumah tangga
miskin
Anggota
rumah
tangga di
kabupaten
Belitung
Kemsos Sulitnya menentukan rumah
tangga miskin di Indonesia
Studi Kasus Organisasi
720
Organisasi Masalah Tujuan Dataset
Kemsos Kompleksnya parameter
penentuan tingkat kemiskinan
rumah tangga di Indonesia
Klastering profil rumah tangga
miskin di kabupaten Belitung
Data Terpadu
Kesejahteraan
Sosial (DTKS)
kabupaten
Belitung
Kemsos Penerima Program Keluarga
Harapan (PKH) yang tidak
tepat sasaran
Klasifikasi faktor utama atribut
yang berpengaruh pada
penerima PKH
Data Terpadu
Kesejahteraan
Sosial (DTKS)
kabupaten
Belitung
Kemsos Penentuan kebijakan
penerima bantuan program
rehabilitasi sosial rumah tidak
layak huni
Klastering spesifikasi rumah pada
data rumah tangga Data Terpadu
Kesejahteraan Sosial (DTKS)
Data Terpadu
Kesejahteraan
Sosial (DTKS)
kabupaten
Seram bagian
barat
Kemsos Banyaknya penerima bantuan
sosial yang tidak tepat
Klastering profil rumah tangga
miskin dari Data Terpadu
Data Terpadu
Kesejahteraan
Studi Kasus Organisasi
721
Romi Satria Wahono
romi@romisatriawahono.net
http://romisatriawahono.net
08118228331
Terima Kasih

More Related Content

DOCX
Dokumen Perencanaan Proyek
PPTX
Introduction to Data Visualization
PPTX
Data visualization
PDF
DAS Slides: Building a Data Strategy — Practical Steps for Aligning with Busi...
PDF
Linear Regression Algorithm | Linear Regression in Python | Machine Learning ...
PDF
Implikasi UU PDP terhadap Tata Kelola Data Sektor Kesehatan - Rangkuman UU Pe...
PDF
Bias and variance trade off
PDF
Makalah kelompok 4 metode simpleks
Dokumen Perencanaan Proyek
Introduction to Data Visualization
Data visualization
DAS Slides: Building a Data Strategy — Practical Steps for Aligning with Busi...
Linear Regression Algorithm | Linear Regression in Python | Machine Learning ...
Implikasi UU PDP terhadap Tata Kelola Data Sektor Kesehatan - Rangkuman UU Pe...
Bias and variance trade off
Makalah kelompok 4 metode simpleks

What's hot (20)

PPTX
Powerpoint Seminar Hasil Penelitian
PDF
[PBO] Pertemuan 4 - Getter Setter
PPTX
TATA CARA PENYUSUNAN TOR
PDF
Panduan skripsi ta word revisi
PDF
Project charter-Contoh
DOCX
Proposal Project Management Plan
PPT
TOLOK UKUR KEBERHASILAN - DJOKO AW
PDF
CONTOH JURNAL SKRIPSI GUNADARMA
PPTX
Contoh PPT Ujian Skripsi
PPT
Algoritma penjadwalan proses
PPTX
PERENCANAAN WAKTU & JARINGAN KERJA
PDF
Bhs assembly
PDF
Contoh Soal, Hasil Olahan dan Interpretasi Hasil Olahan SPSS
PPTX
Populasi dan Sampel Penelitian Kualitatif dan Kuantitatif (Anantyo Bimosuseno...
PPTX
Ppt big data dina nisrina rosandi 6018210043
PPTX
Presentation seminar hasil penelitian.
PDF
Business Process Modelling Notation - overview
PPTX
Data mining 4 konsep dasar klasifikasi
PDF
Tabel Nilai Kritis Distribusi T
PPT
K-Means Clustering.ppt
Powerpoint Seminar Hasil Penelitian
[PBO] Pertemuan 4 - Getter Setter
TATA CARA PENYUSUNAN TOR
Panduan skripsi ta word revisi
Project charter-Contoh
Proposal Project Management Plan
TOLOK UKUR KEBERHASILAN - DJOKO AW
CONTOH JURNAL SKRIPSI GUNADARMA
Contoh PPT Ujian Skripsi
Algoritma penjadwalan proses
PERENCANAAN WAKTU & JARINGAN KERJA
Bhs assembly
Contoh Soal, Hasil Olahan dan Interpretasi Hasil Olahan SPSS
Populasi dan Sampel Penelitian Kualitatif dan Kuantitatif (Anantyo Bimosuseno...
Ppt big data dina nisrina rosandi 6018210043
Presentation seminar hasil penelitian.
Business Process Modelling Notation - overview
Data mining 4 konsep dasar klasifikasi
Tabel Nilai Kritis Distribusi T
K-Means Clustering.ppt
Ad

Similar to romi-dm-aug2020.pptx (20)

PPT
dwdm unit 1.ppt
PPTX
Data curator: who is s / he?
Findings of the IFLA Library Theory and Research...
PDF
00-01 DSnDA.pdf
PPT
Unit 1_data mining and warehousing subject
PPTX
Big Data and Data Mining - Lecture 3 in Introduction to Computational Social ...
PPTX
Chapter 24
PPTX
1.-DE-LECTURE-1-INTRO-TO-DATA-ENGG.pptx
PPTX
DataONE Education Module 07: Metadata
PPTX
7005_Chapter 1_v5.pptx7005_Chapter 1_v5.pptx
PDF
A New Paradigm on Analytic-Driven Information and Automation V2.pdf
PPTX
Pertemuan 1 - Data Mining (Introduction).pptx
PDF
IICT-Big Data.pdf slideshow information to communication
PDF
IICT-Big Data.pdf slideshow Information to communication technology
PDF
Metadata Strategies - Data Squared
PPTX
A review on data mining
PPTX
BAS 250 Lecture 1
PPTX
Chapter 1 Introduction to Data Science (Computing)
PDF
Data Mining and Big Data Challenges and Research Opportunities
PPTX
Data Mining : Concepts and Techniques
dwdm unit 1.ppt
Data curator: who is s / he?
Findings of the IFLA Library Theory and Research...
00-01 DSnDA.pdf
Unit 1_data mining and warehousing subject
Big Data and Data Mining - Lecture 3 in Introduction to Computational Social ...
Chapter 24
1.-DE-LECTURE-1-INTRO-TO-DATA-ENGG.pptx
DataONE Education Module 07: Metadata
7005_Chapter 1_v5.pptx7005_Chapter 1_v5.pptx
A New Paradigm on Analytic-Driven Information and Automation V2.pdf
Pertemuan 1 - Data Mining (Introduction).pptx
IICT-Big Data.pdf slideshow information to communication
IICT-Big Data.pdf slideshow Information to communication technology
Metadata Strategies - Data Squared
A review on data mining
BAS 250 Lecture 1
Chapter 1 Introduction to Data Science (Computing)
Data Mining and Big Data Challenges and Research Opportunities
Data Mining : Concepts and Techniques
Ad

Recently uploaded (20)

PPTX
MET 305 MODULE 1 KTU 2019 SCHEME 25.pptx
PPTX
Fluid Mechanics, Module 3: Basics of Fluid Mechanics
PPTX
“Next-Gen AI: Trends Reshaping Our World”
PPTX
Simulation of electric circuit laws using tinkercad.pptx
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
bas. eng. economics group 4 presentation 1.pptx
PDF
algorithms-16-00088-v2hghjjnjnhhhnnjhj.pdf
PDF
Structs to JSON How Go Powers REST APIs.pdf
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
Lesson 3_Tessellation.pptx finite Mathematics
PPTX
Strings in CPP - Strings in C++ are sequences of characters used to store and...
PDF
Geotechnical Engineering, Soil mechanics- Soil Testing.pdf
PPTX
Geodesy 1.pptx...............................................
PPTX
24AI201_AI_Unit_4 (1).pptx Artificial intelligence
PPT
Project quality management in manufacturing
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
OOP with Java - Java Introduction (Basics)
MET 305 MODULE 1 KTU 2019 SCHEME 25.pptx
Fluid Mechanics, Module 3: Basics of Fluid Mechanics
“Next-Gen AI: Trends Reshaping Our World”
Simulation of electric circuit laws using tinkercad.pptx
Operating System & Kernel Study Guide-1 - converted.pdf
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
bas. eng. economics group 4 presentation 1.pptx
algorithms-16-00088-v2hghjjnjnhhhnnjhj.pdf
Structs to JSON How Go Powers REST APIs.pdf
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Lesson 3_Tessellation.pptx finite Mathematics
Strings in CPP - Strings in C++ are sequences of characters used to store and...
Geotechnical Engineering, Soil mechanics- Soil Testing.pdf
Geodesy 1.pptx...............................................
24AI201_AI_Unit_4 (1).pptx Artificial intelligence
Project quality management in manufacturing
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
OOP with Java - Java Introduction (Basics)

romi-dm-aug2020.pptx

  • 2. Romi Satria Wahono 2 • SMA Taruna Nusantara Magelang (1993) • B.Eng, M.Eng and Ph.D in Software Engineering Saitama University Japan (1994-2004) Universiti Teknikal Malaysia Melaka (2014) • Core Competency in Enterprise Architecture, Software Engineering and Machine Learning • LIPI Researcher (2004-2007) • Founder, CoFounder and CEO: • PT Brainmatics Cipta Informatika (2005) • PT Imani Prima (2007) • PT IlmuKomputerCom Braindevs Sistema (2014) • PT Brainmatics Indonesia Cendekia (2020) • Professional Member of IEEE, ACM and PMI • IT and Research Award Winners from WSIS (United Nations), Kemdikbud, Ristekdikti, LIPI, etc • SCOPUS/ISI Indexed Journal Reviewer: Information and Software Technology, Journal of Systems and Software, Software: Practice and Experience, etc • Industrial IT Certifications: TOGAF, ITIL, CCAI, CCNA, etc • Enterprise Architecture & Digital Transformation Expert: KPK, INSW, BPPT, LIPI, Kemenkeu, RistekDikti, Pertamina EP, PLN, PJB, PJBI, IP, FIF, Kemlu, ESDM, etc.
  • 3. 3
  • 4. Educational Objectives (Benjamin Bloom) Cognitive Affective Psychomotor Criterion Referenced Instruction (Robert Mager) Competencies Performance Evaluation Minimalism (John Carroll) Start Immediately Minimize the Reading Error Recognition Self-Contained Learning Design 4
  • 6. 1. Jelaskan perbedaan antara data, informasi dan pengetahuan! 2. Jelaskan apa yang anda ketahui tentang data mining! 3. Sebutkan peran utama data mining! 4. Sebutkan pemanfaatan dari data mining di berbagai bidang! 5. Pengetahuan apa yang bisa kita dapatkan dari data di bawah? Pre-Test 6 NIM Gender Nilai UN Asal Sekolah IPS1 IPS2 IPS3 IPS 4 ... Lulus Tepat Waktu 10001 L 28 SMAN 2 3.3 3.6 2.89 2.9 Ya 10002 P 27 SMAN 7 4.0 3.2 3.8 3.7 Tidak 10003 P 24 SMAN 1 2.7 3.4 4.0 3.5 Tidak 10004 L 26.4 SMAN 3 3.2 2.7 3.6 3.4 Ya ... 11000 L 23.4 SMAN 5 3.3 2.8 3.1 3.2 Ya
  • 7. Course Outline • 1.1 Apa dan Mengapa Data Mining? • 1.2 Peran Utama dan Metode Data Mining • 1.3 Sejarah dan Penerapan Data Mining 1.1. Pengantar • 2.1 Proses dan Tools Data Mining • 2.2 Penerapan Proses Data Mining • 2.3 Evaluasi Model Data Mining • 2.4 Proses Data Mining berbasis CRISP-DM 2. Proses • 3.1 Data Cleaning • 3.2 Data Reduction • 3.3 Data Transformation • 3.4 Data Integration 3. Persiapan Data • 4.1 Algoritma Klasifikasi • 4.2 Algoritma Klastering • 4.3 Algoritma Asosiasi • 4.4 Algoritma Estimasi dan Forecasting 4. Algoritma • 5.1 Text Mining Concepts • 5.2 Text Clustering • 5.3 Text Classification • 5.4 Data Mining Laws 5. Text Mining 7
  • 8. 1. Pengantar Data Mining 1.1 Apa dan Mengapa Data Mining? 1.2 Peran Utama dan Metode Data Mining 1.3 Sejarah dan Penerapan Data Mining 8
  • 9. 1.1 Apa dan Mengapa Data Mining? 9
  • 10. Manusia memproduksi beragam data yang jumlah dan ukurannya sangat besar • Astronomi • Bisnis • Kedokteran • Ekonomi • Olahraga • Cuaca • Financial • … Manusia Memproduksi Data 10
  • 11. Astronomi • Sloan Digital Sky Survey • New Mexico, 2000 • 140TB over 10 years • Large Synoptic Survey Telescope • Chile, 2016 • Will acquire 140TB every five days Biologi dan Kedokteran • European Bioinformatics Institute (EBI) • 20PB of data (genomic data doubles in size each year) • A single sequenced human genome can be around 140GB in size Pertumbuhan Data 11 kilobyte (kB) 103 megabyte (MB) 106 gigabyte (GB) 109 terabyte (TB) 1012 petabyte (PB) 1015 exabyte (EB) 1018 zettabyte (ZB) 1021 yottabyte (YB) 1024
  • 12. Perubahan Kultur dan Perilaku 12
  • 13. • Mobile Electronics market • 7B smartphone subscriptions in 2015 • Web & Social Networks generates amount of data • Google processes 100 PB per day, 3 million servers • Facebook has 300 PB of user data per day • Youtube has 1000PB video storage Datangnya Tsunami Data 13 kilobyte (kB) 103 megabyte (MB) 106 gigabyte (GB) 109 terabyte (TB) 1012 petabyte (PB) 1015 exabyte (EB) 1018 zettabyte (ZB) 1021 yottabyte (YB) 1024
  • 14. We are drowning in data, but starving for knowledge! (John Naisbitt, Megatrends, 1988) Kebanjiran Data tapi Miskin Pengetahuan 14
  • 15. Mengubah Data Menjadi Pengetahuan 15 • Data harus kita olah menjadi pengetahuan supaya bisa bermanfaat bagi manusia • Dengan pengetahuan tersebut, manusia dapat: • Melakukan estimasi dan prediksi apa yang terjadi di depan • Melakukan analisis tentang asosiasi, korelasi dan pengelompokan antar data dan atribut • Membantu pengambilan keputusan dan pembuatan kebijakan
  • 16. 16
  • 17. Data Kehadiran Pegawai Data - Informasi – Pengetahuan - Kebijakan 17 NIP TGL DATANG PULANG 1103 02/12/2004 07:20 15:40 1142 02/12/2004 07:45 15:33 1156 02/12/2004 07:51 16:00 1173 02/12/2004 08:00 15:15 1180 02/12/2004 07:01 16:31 1183 02/12/2004 07:49 17:00
  • 18. Informasi Akumulasi Bulanan Kehadiran Pegawai Data - Informasi – Pengetahuan - Kebijakan 18 NIP Masuk Alpa Cuti Sakit Telat 1103 22 1142 18 2 2 1156 10 1 11 1173 12 5 5 1180 10 12
  • 19. Pola Kebiasaan Kehadiran Mingguan Pegawai Data - Informasi – Pengetahuan - Kebijakan 19 Senin Selasa Rabu Kamis Jumat Terlambat 7 0 1 0 5 Pulang Cepat 0 1 1 1 8 Izin 3 0 0 1 4 Alpa 1 0 2 0 2
  • 20. • Kebijakan penataan jam kerja karyawan khusus untuk hari senin dan jumat • Peraturan jam kerja: • Hari Senin dimulai jam 10:00 • Hari Jumat diakhiri jam 14:00 • Sisa jam kerja dikompensasi ke hari lain Data - Informasi – Pengetahuan - Kebijakan 20
  • 21. Data - Informasi – Pengetahuan - Kebijakan 21 Kebijakan Pengetahuan Informasi Data Kebijakan Penataan Jam Kerja Pegawai Pola Kebiasaan Datang- Pulang Pegawai Informasi Rekap Kehadiran Pegawai Data Absensi Pegawai
  • 22. Data - Informasi – Pengetahuan - Kebijakan 22
  • 23. Apa itu Data Mining? 23 Disiplin ilmu yang mempelajari metode untuk mengekstrak pengetahuan atau menemukan pola dari suatu data yang besar
  • 24. • Disiplin ilmu yang mempelajari metode untuk mengekstrak pengetahuan atau menemukan pola dari suatu data yang besar • Ekstraksi dari data ke pengetahuan: 1. Data: fakta yang terekam dan tidak membawa arti 2. Informasi: Rekap, rangkuman, penjelasan dan statistik dari data 3. Pengetahuan: pola, rumus, aturan atau model yang muncul dari data • Nama lain data mining: • Knowledge Discovery in Database (KDD) • Big data • Business intelligence • Knowledge extraction • Pattern analysis • Information harvesting 24 Terminologi dan Nama Lain Data Mining
  • 25. Konsep Proses Data Mining 25 Himpunan Data Metode Data Mining Pengetahuan
  • 26. • Melakukan ekstraksi untuk mendapatkan informasi penting yang sifatnya implisit dan sebelumnya tidak diketahui, dari suatu data (Witten et al., 2011) • Kegiatan yang meliputi pengumpulan, pemakaian data historis untuk menemukan keteraturan, pola dan hubungan dalam set data berukuran besar (Santosa, 2007) • Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data (Han et al., 2011) Definisi Data Mining 26
  • 27. • Puluhan ribu data mahasiswa di kampus yang diambil dari sistem informasi akademik • Apakah pernah kita ubah menjadi pengetahuan yang lebih bermanfaat? TIDAK! • Seperti apa pengetahuan itu? Rumus, Pola, Aturan Contoh Data di Kampus 27
  • 29. • Puluhan ribu data calon anggota legislatif di KPU • Apakah pernah kita ubah menjadi pengetahuan yang lebih bermanfaat? TIDAK! Contoh Data di Komisi Pemilihan Umum 29
  • 30. Prediksi Calon Legislatif DKI Jakarta 30
  • 33. Prediksi Kebakaran Hutan 33 FFMC DMC DC ISI temp RH wind rain ln(area+1) 93.5 139.4 594.2 20.3 17.6 52 5.8 0 0 92.4 124.1 680.7 8.5 17.2 58 1.3 0 0 90.9 126.5 686.5 7 15.6 66 3.1 0 0 85.8 48.3 313.4 3.9 18 42 2.7 0 0.307485 91 129.5 692.6 7 21.7 38 2.2 0 0.357674 90.9 126.5 686.5 7 21.9 39 1.8 0 0.385262 95.5 99.9 513.3 13.2 23.3 31 4.5 0 0.438255 SVM SVM+GA C 4.3 1,840 Gamma (𝛾) 5.9 9,648 Epsilon (𝜀) 3.9 5,615 RMSE 1.391 1.379 4.3 5.9 3.9 1.391 1.840 9.648 5.615 1.379 0 2 4 6 8 10 12 C Gamma Epsilon RMSE SVM SVM+GA
  • 34. Prediksi dan klastering calon tersangka koruptor Profiling dan Prediksi Koruptor 34 Data Data Data Data Aktivitas Penindakan Aktivitas Pencegahan Pengetahuan Asosiasi atribut tersangka koruptor Prediksi pencucian uang Estimasi jenis dan jumlah tahun hukuman
  • 35. Pola Profil Tersangka Koruptor 35
  • 36. Profiling dan Deteksi Kasus TKI 36
  • 38. Pola Aturan Asosiasi dari Data Transaksi 38
  • 39. Pola Aturan Asosiasi di Amazon.com 39
  • 40. Stupid Applications • Sistem Informasi Akademik • Sistem Pencatatan Pemilu • Sistem Laporan Kekayaan Pejabat • Sistem Pencatatan Kredit Smart Applications • Sistem Prediksi Kelulusan Mahasiswa • Sistem Prediksi Hasil Pemilu • Sistem Prediksi Koruptor • Sistem Penentu Kelayakan Kredit From Stupid Apps to Smart Apps 40
  • 42. • Uber - the world’s largest taxi company, owns no vehicles • Google - world’s largest media/advertising company, creates no content • Alibaba - the most valuable retailer, has no inventory • Airbnb - the world’s largest accommodation provider, owns no real estate • Gojek - perusahaan angkutan umum, tanpa memiliki kendaraan Perusahaan Pengolah Pengetahuan 42
  • 43. Data Mining Tasks and Roles in General 43 Increasing potential values to support business decisions End User Business Analyst Data Scientist DBA/ DBE Decision Making Data Presentation Visualization Techniques Data Mining Information Discovery and Modeling Data Exploration Statistical Summary, Metadata, and Description Data Preprocessing, Data Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems
  • 44. • Menganalisis data untuk dijadikan pola pengetahuan (model/rule/formula/tree) • Pola pengetahuan (model/rule/formula/tree) dimasukkan ke dalam sistem (software) • Sistem (software) menjadi cerdas dan bermanfaat signifikan dalam meningkatkan value dan benefit dari perusahaan/organisasi • Dimana peran data scientist dalam perusahaan pengembang teknologi (startup business atau GAFAM)? • Data Scientist? Software Engineer? Researcher? IT Infrastructure Engineer? Data Mining Tasks and Roles in Product Development 44
  • 45. 45 How COMPUTING SOLUTION WORKS? JUALAN CD MUSIK GA LAKU? BUAT produk aplikasi langganan musik? SOFTWARE ENGINEERING DATA SCIENCE Infrastructure & Security Service operation Computing research Product management It governance
  • 47. 1. Tremendous amount of data • Algorithms must be highly scalable to handle such as tera-bytes of data 2. High-dimensionality of data • Micro-array may have tens of thousands of dimensions 3. High complexity of data • Data streams and sensor data • Time-series data, temporal data, sequence data • Structure data, graphs, social networks and multi-linked data • Heterogeneous databases and legacy databases • Spatial, spatiotemporal, multimedia, text and Web data • Software programs, scientific simulations 4. New and sophisticated applications Masalah-Masalah di Data Mining 47
  • 48. 1. Jelaskan dengan kalimat sendiri apa yang dimaksud dengan data mining? 2. Sebutkan konsep alur proses data mining! Latihan 48
  • 49. 1.2 Peran Utama dan Metode Data Mining 49
  • 50. Peran Utama Data Mining 50 1. Estimasi 2. Forecasting 3. Klasifikasi 4. Klastering 5. Asosiasi Data Mining Roles (Larose, 2005)
  • 53. Tipe Data Deskripsi Contoh Operasi Ratio (Mutlak) • Data yang diperoleh dengan cara pengukuran, dimana jarak dua titik pada skala sudah diketahui • Mempunyai titik nol yang absolut (*, /) • Umur • Berat badan • Tinggi badan • Jumlah uang geometric mean, harmonic mean, percent variation Interval (Jarak) • Data yang diperoleh dengan cara pengukuran, dimana jarak dua titik pada skala sudah diketahui • Tidak mempunyai titik nol yang absolut (+, - ) • Suhu 0°c-100°c, • Umur 20-30 tahun mean, standard deviation, Pearson's correlation, t and F tests Ordinal (Peringkat) • Data yang diperoleh dengan cara kategorisasi atau klasifikasi • Tetapi diantara data tersebut terdapat hubungan atau berurutan (<, >) • Tingkat kepuasan pelanggan (puas, sedang, tidak puas) median, percentiles, rank correlation, run tests, sign tests Nominal (Label) • Data yang diperoleh dengan cara kategorisasi atau klasifikasi • Menunjukkan beberapa object yang berbeda (=, ) • Kode pos • Jenis kelamin • Nomer id karyawan • Nama kota mode, entropy, contingency correlation, 2 test 53
  • 54. Peran Utama Data Mining 54 1. Estimasi 2. Forecasting 3. Klasifikasi 4. Klastering 5. Asosiasi Data Mining Roles (Larose, 2005)
  • 55. Customer Jumlah Pesanan (P) Jumlah Traffic Light (TL) Jarak (J) Waktu Tempuh (T) 1 3 3 3 16 2 1 7 4 20 3 2 4 6 18 4 4 6 8 36 ... 1000 2 4 2 12 1. Estimasi Waktu Pengiriman Pizza 55 Waktu Tempuh (T) = 0.48P + 0.23TL + 0.5J Pengetahuan Pembelajaran dengan Metode Estimasi (Regresi Linier) Label
  • 56. • Example: 209 different computer configurations • Linear regression function PRP = -55.9 + 0.0489 MYCT + 0.0153 MMIN + 0.0056 MMAX + 0.6410 CACH - 0.2700 CHMIN + 1.480 CHMAX Contoh: Estimasi Performansi CPU 56 0 0 32 128 CHMAX 0 0 8 16 CHMIN Channels Performance Cache (Kb) Main memory (Kb) Cycle time (ns) 45 0 4000 1000 480 209 67 32 8000 512 480 208 … 269 32 32000 8000 29 2 198 256 6000 256 125 1 PRP CACH MMAX MMIN MYCT
  • 57. 1. Formula/Function (Rumus atau Fungsi Regresi) • WAKTU TEMPUH = 0.48 + 0.6 JARAK + 0.34 LAMPU + 0.2 PESANAN 2. Decision Tree (Pohon Keputusan) 3. Korelasi dan Asosiasi 4. Rule (Aturan) • IF ips3=2.8 THEN lulustepatwaktu 5. Cluster (Klaster) Output/Pola/Model/Knowledge 57
  • 58. Dataset harga saham dalam bentuk time series (rentet waktu) 2. Forecasting Harga Saham 58 Pembelajaran dengan Metode Forecasting (Neural Network) Label Time Series
  • 59. Pengetahuan berupa Rumus Neural Network 59 Prediction Plot
  • 63. NIM Gender Nilai UN Asal Sekolah IPS1 IPS2 IPS3 IPS 4 ... Lulus Tepat Waktu 10001 L 28 SMAN 2 3.3 3.6 2.89 2.9 Ya 10002 P 27 SMA DK 4.0 3.2 3.8 3.7 Tidak 10003 P 24 SMAN 1 2.7 3.4 4.0 3.5 Tidak 10004 L 26.4 SMAN 3 3.2 2.7 3.6 3.4 Ya ... ... 11000 L 23.4 SMAN 5 3.3 2.8 3.1 3.2 Ya 3. Klasifikasi Kelulusan Mahasiswa 63 Pembelajaran dengan Metode Klasifikasi (C4.5) Label
  • 64. Pengetahuan Berupa Pohon Keputusan 64
  • 65. • Input: • Output (Rules): If outlook = sunny and humidity = high then play = no If outlook = rainy and windy = true then play = no If outlook = overcast then play = yes If humidity = normal then play = yes If none of the above then play = yes Contoh: Rekomendasi Main Golf 65
  • 66. • Output (Tree): Contoh: Rekomendasi Main Golf 66
  • 68. • Output/Model (Tree): Contoh: Rekomendasi Contact Lens 68
  • 71. 4. Klastering Bunga Iris 71 Pembelajaran dengan Metode Klastering (K-Means) Dataset Tanpa Label
  • 76. 5. Aturan Asosiasi Pembelian Barang 76 Pembelajaran dengan Metode Asosiasi (FP-Growth)
  • 78. • Algoritma association rule (aturan asosiasi) adalah algoritma yang menemukan atribut yang “muncul bersamaan” • Contoh, pada hari kamis malam, 1000 pelanggan telah melakukan belanja di supermaket ABC, dimana: • 200 orang membeli Sabun Mandi • dari 200 orang yang membeli sabun mandi, 50 orangnya membeli Fanta • Jadi, association rule menjadi, “Jika membeli sabun mandi, maka membeli Fanta”, dengan nilai support = 200/1000 = 20% dan nilai confidence = 50/200 = 25% • Algoritma association rule diantaranya adalah: A priori algorithm, FP-Growth algorithm, GRI algorithm Contoh Aturan Asosiasi 78
  • 79. Aturan Asosiasi di Amazon.com 79
  • 80. Korelasi antara jumlah konsumsi minyak pemanas dengan faktor-faktor di bawah: 1. Insulation: Ketebalan insulasi rumah 2. Temperatur: Suhu udara sekitar rumah 3. Heating Oil: Jumlah konsumsi minyak pertahun perrumah 4. Number of Occupant: Jumlah penghuni rumah 5. Average Age: Rata-rata umur penghuni rumah 6. Home Size: Ukuran rumah Heating Oil Consumption 80
  • 81. 81
  • 82. 82
  • 83. Korelasi 4 Variable terhadap Konsumsi Minyak 83 Jumlah Penghuni Rumah Rata-Rata Umur Ketebalan Insulasi Rumah Temperatur Konsumsi Minyak 0.848 -0.774 0.736 0.381
  • 84. Data mining amplifies perception in the business domain • How does data mining produce insight? This law approaches the heart of data mining – why it must be a business process and not a technical one • Business problems are solved by people, not by algorithms • The data miner and the business expert “see” the solution to a problem, that is the patterns in the domain that allow the business objective to be achieved • Thus data mining is, or assists as part of, a perceptual process • Data mining algorithms reveal patterns that are not normally visible to human perception • Within the data mining process, the human problem solver interprets the results of data mining algorithms and integrates them into their business understanding Insight Law (Data Mining Law 6) 84
  • 85. 1. Estimation (Estimasi): Linear Regression (LR), Neural Network (NN), Deep Learning (DL), Support Vector Machine (SVM), Generalized Linear Model (GLM), etc 2. Forecasting (Prediksi/Peramalan): Linear Regression (LR), Neural Network (NN), Deep Learning (DL), Support Vector Machine (SVM), Generalized Linear Model (GLM), etc 3. Classification (Klasifikasi): Decision Tree (CART, ID3, C4.5, Credal DT, Credal C4.5, Adaptative Credal C4.5), Naive Bayes (NB), K-Nearest Neighbor (kNN), Linear Discriminant Analysis (LDA), Logistic Regression (LogR), etc 4. Clustering (Klastering): K-Means, K-Medoids, Self-Organizing Map (SOM), Fuzzy C-Means (FCM), etc 5. Association (Asosiasi): FP-Growth, A Priori, Coefficient of Correlation, Chi Square, etc Metode Data Mining 85
  • 86. 1. Formula/Function (Rumus atau Fungsi Regresi) • WAKTU TEMPUH = 0.48 + 0.6 JARAK + 0.34 LAMPU + 0.2 PESANAN 2. Decision Tree (Pohon Keputusan) 3. Tingkat Korelasi 4. Rule (Aturan) • IF ips3=2.8 THEN lulustepatwaktu 5. Cluster (Klaster) Output/Pola/Model/Knowledge 86
  • 87. Kategorisasi Algoritma Data Mining 87 Supervised Learning Unsupervised Learning Semi- Supervised Learning Association based Learning
  • 88. • Pembelajaran dengan guru, data set memiliki target/label/class • Sebagian besar algoritma data mining (estimation, prediction/forecasting, classification) adalah supervised learning • Algoritma melakukan proses belajar berdasarkan nilai dari variabel target yang terasosiasi dengan nilai dari variable prediktor 1. Supervised Learning 88
  • 90. • Algoritma data mining mencari pola dari semua variable (atribut) • Variable (atribut) yang menjadi target/label/class tidak ditentukan (tidak ada) • Algoritma clustering adalah algoritma unsupervised learning 2. Unsupervised Learning 90
  • 92. • Semi-supervised learning adalah metode data mining yang menggunakan data dengan label dan tidak berlabel sekaligus dalam proses pembelajarannya • Data yang memiliki kelas digunakan untuk membentuk model (pengetahuan), data tanpa label digunakan untuk membuat batasan antara kelas 3. Semi-Supervised Learning 92
  • 93. 1. Sebutkan 5 peran utama data mining! 2. Jelaskan perbedaan estimasi dan forecasting! 3. Jelaskan perbedaan forecasting dan klasifikasi! 4. Jelaskan perbedaan klasifikasi dan klastering! 5. Jelaskan perbedaan klastering dan association! 6. Jelaskan perbedaan estimasi dan klasifikasi! 7. Jelaskan perbedaan estimasi dan klastering! 8. Jelaskan perbedaan supervised dan unsupervised learning! 9. Sebutkan tahapan utama proses data mining! Latihan 93
  • 94. 1.3 Sejarah dan Penerapan Data Mining 94
  • 95. • Sebelum 1600: Empirical science • Disebut sains kalau bentuknya kasat mata • 1600-1950: Theoretical science • Disebut sains kalau bisa dibuktikan secara matematis atau eksperimen • 1950s-1990: Computational science • Seluruh disiplin ilmu bergerak ke komputasi • Lahirnya banyak model komputasi • 1990-sekarang: Data science • Kultur manusia menghasilkan data besar • Kemampuan komputer untuk mengolah data besar • Datangnya data mining sebagai arus utama sains (Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science, Communication of ACM, 45(11): 50-54, Nov. 2002) Evolution of Sciences 95
  • 96. 96
  • 98. 98
  • 100. Business objectives are the origin of every data mining solution • This defines the field of data mining: data mining is concerned with solving business problems and achieving business goals • Data mining is not primarily a technology; it is a process, which has one or more business objectives at its heart • Without a business objective, there is no data mining • The maxim: “Data Mining is a Business Process” Business Goals Law (Data Mining Law 1) 100
  • 101. Business Knowledge Law (Data Mining Law 2) 101 Business knowledge is central to every step of the data mining process • A naive reading of CRISP-DM would see business knowledge used at the start of the process in defining goals, and at the end of the process in guiding deployment of results • This would be to miss a key property of the data mining process, that business knowledge has a central role in every step
  • 102. • Marketing: product recommendation, market basket analysis, product targeting, customer retention • Finance: investment support, portfolio management, price forecasting • Banking and Insurance: credit and policy approval, money laundry detection • Security: fraud detection, access control, intrusion detection, virus detection • Manufacturing: process modeling, quality control, resource allocation • Web and Internet: smart search engines, web marketing • Software Engineering: effort estimation, fault prediction • Telecommunication: network monitoring, customer churn prediction, user behavior analysis Private and Commercial Sector 102
  • 103. Use Case: Product Recommendation 103 0 500,000 1,000,000 1,500,000 2,000,000 2,500,000 3,000,000 3,500,000 4,000,000 0 5 10 15 20 25 30 35 Tot.Belanja Jml.Pcs Jml.Item Cluster - 2 Cluster - 3 Cluster - 1
  • 104. • The cost of capturing and correcting defects is expensive • $14,102 per defect in post-release phase (Boehm & Basili 2008) • $60 billion per year (NIST 2002) • Industrial methods of manual software reviews activities can find only 60% of defects (Shull et al. 2002) • The probability of detection of software fault prediction models is higher (71%) than software reviews (60%) Use Case: Software Fault Prediction 104
  • 105. • Finance: exchange rate forecasting, sentiment analysis • Taxation: adaptive monitoring, fraud detection • Medicine and Healt Care: hypothesis discovery, disease prediction and classification, medical diagnosis • Education: student allocation, resource forecasting • Insurance: worker’s compensation analysis • Security: bomb, iceberg detection • Transportation: simulation and analysis, load estimation • Law: legal patent analysis, law and rule analysis • Politic: election prediction Public and Government Sector 105
  • 106. • Penentuan kelayakan kredit pemilihan rumah di bank • Penentuan pasokan listrik PLN untuk wilayah Jakarta • Prediksi profile tersangka koruptor dari data pengadilan • Perkiraan harga saham dan tingkat inflasi • Analisis pola belanja pelanggan • Memisahkan minyak mentah dan gas alam • Penentuan pola pelanggan yang loyal pada perusahaan operator telepon • Deteksi pencucian uang dari transaksi perbankan • Deteksi serangan (intrusion) pada suatu jaringan Contoh Penerapan Data Mining 106
  • 107. • 1989 IJCAI Workshop on Knowledge Discovery in Databases • Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991) • 1991-1994 Workshops on Knowledge Discovery in Databases • Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996) • 1995-1998 International Conferences on Knowledge Discovery in Databases and Data Mining (KDD’95-98) • Journal of Data Mining and Knowledge Discovery (1997) • ACM SIGKDD conferences since 1998 and SIGKDD Explorations • More conferences on data mining • PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM (2001), WSDM (2008), etc. • ACM Transactions on KDD (2007) Data Mining Society 107
  • 108. Conferences • ACM SIGKDD Int. Conf. on Knowledge Discovery in Databases and Data Mining (KDD) • SIAM Data Mining Conf. (SDM) • (IEEE) Int. Conf. on Data Mining (ICDM) • European Conf. on Machine Learning and Principles and practices of Knowledge Discovery and Data Mining (ECML-PKDD) • Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD) • Int. Conf. on Web Search and Data Mining (WSDM) Journals • ACM Transactions on Knowledge Discovery from Data (TKDD) • ACM Transactions on Information Systems (TOIS) • IEEE Transactions on Knowledge and Data Engineering • Springer Data Mining and Knowledge Discovery • International Journal of Business Intelligence and Data Mining (IJBIDM) Conferences dan Journals Data Mining 108
  • 109. 2. Proses Data Mining 2.1 Proses dan Tools Data Mining 2.2 Penerapan Proses Data Mining 2.3 Evaluasi Model Data Mining 2.4 Proses Data Mining berbasis CRISP-DM 109
  • 110. 2.1 Proses dan Tools Data Mining 110
  • 111. 1. Himpunan Data (Pahami dan Persiapkan Data) 2. Metode Data Mining (Pilih Metode Sesuai Karakter Data) 3. Pengetahuan (Pahami Model dan Pengetahuan yg Sesuai ) 4. Evaluation (Analisis Model dan Kinerja Metode) Proses Data Mining 111 DATA PREPROCESSING Data Cleaning Data Integration Data Reduction Data Transformation MODELING Estimation Prediction Classification Clustering Association MODEL Formula Tree Cluster Rule Correlation KINERJA Akurasi Tingkat Error Jumlah Cluster MODEL Atribute/Faktor Korelasi Bobot
  • 112. • Atribut adalah faktor atau parameter yang menyebabkan class/label/target terjadi • Jenis dataset ada dua: Private dan Public • Private Dataset: data set dapat diambil dari organisasi yang kita jadikan obyek penelitian • Bank, Rumah Sakit, Industri, Pabrik, Perusahaan Jasa, etc • Public Dataset: data set dapat diambil dari repositori pubik yang disepakati oleh para peneliti data mining • UCI Repository (http://www.ics.uci.edu/~mlearn/MLRepository.html) • ACM KDD Cup (http://www.sigkdd.org/kddcup/) • PredictionIO (http://docs.prediction.io/datacollection/sample/) • Trend penelitian data mining saat ini adalah menguji metode yang dikembangkan oleh peneliti dengan public dataset, sehingga penelitian dapat bersifat: comparable, repeatable dan verifiable 1. Himpunan Data (Dataset) 112
  • 113. Public Data Set (UCI Repository) 113
  • 115. 1. Estimation (Estimasi): Linear Regression (LR), Neural Network (NN), Deep Learning (DL), Support Vector Machine (SVM), Generalized Linear Model (GLM), etc 2. Forecasting (Prediksi/Peramalan): Linear Regression (LR), Neural Network (NN), Deep Learning (DL), Support Vector Machine (SVM), Generalized Linear Model (GLM), etc 3. Classification (Klasifikasi): Decision Tree (CART, ID3, C4.5, Credal DT, Credal C4.5, Adaptative Credal C4.5), Naive Bayes (NB), K-Nearest Neighbor (kNN), Linear Discriminant Analysis (LDA), Logistic Regression (LogR), etc 4. Clustering (Klastering): K-Means, K-Medoids, Self-Organizing Map (SOM), Fuzzy C-Means (FCM), etc 5. Association (Asosiasi): FP-Growth, A Priori, Coefficient of Correlation, Chi Square, etc 2. Metode Data Mining 115
  • 116. 1. Formula/Function (Rumus atau Fungsi Regresi) • WAKTU TEMPUH = 0.48 + 0.6 JARAK + 0.34 LAMPU + 0.2 PESANAN 2. Decision Tree (Pohon Keputusan) 3. Tingkat Korelasi 4. Rule (Aturan) • IF ips3=2.8 THEN lulustepatwaktu 5. Cluster (Klaster) 3. Pengetahuan (Pola/Model) 116
  • 117. 1. Estimation: Error: Root Mean Square Error (RMSE), MSE, MAPE, etc 2. Prediction/Forecasting (Prediksi/Peramalan): Error: Root Mean Square Error (RMSE) , MSE, MAPE, etc 3. Classification: Confusion Matrix: Accuracy ROC Curve: Area Under Curve (AUC) 4. Clustering: Internal Evaluation: Davies–Bouldin index, Dunn index, External Evaluation: Rand measure, F-measure, Jaccard index, Fowlkes–Mallows index, Confusion matrix 5. Association: Lift Charts: Lift Ratio Precision and Recall (F-measure) 4. Evaluasi (Akurasi, Error, etc) 117
  • 118. 1. Akurasi • Ukuran dari seberapa baik model mengkorelasikan antara hasil dengan atribut dalam data yang telah disediakan • Terdapat berbagai model akurasi, tetapi semua model akurasi tergantung pada data yang digunakan 2. Kehandalan • Ukuran di mana model data mining diterapkan pada dataset yang berbeda • Model data mining dapat diandalkan jika menghasilkan pola umum yang sama terlepas dari data testing yang disediakan 3. Kegunaan • Mencakup berbagai metrik yang mengukur apakah model tersebut memberikan informasi yang berguna Kriteria Evaluasi dan Validasi Model 118 Keseimbangan diantaranya ketiganya diperlukan karena belum tentu model yang akurat adalah handal, dan yang handal atau akurat belum tentu berguna
  • 119. Magic Quadrant for Data Science Platform (Gartner, 2017) 119
  • 120. Magic Quadrant for Data Science Platform (Gartner, 2018) 120
  • 121. • KNIME (Konstanz Information Miner) adalah platform data mining untuk analisis, pelaporan, dan integrasi data yang termasuk perangkat lunak bebas dan sumber terbuka • KNIME mulai dikembangkan tahun 2004 oleh tim pengembang perangkat lunak dari Universitas Konstanz, yang dipimpin oleh Michael Berthold, yang awalnya digunakan untuk penelitian di industri farmasi • Mulai banyak digunakan orang sejak tahun 2006, dan setelah itu berkembang pesat sehingga tahun 2017 masuk ke Magic Quadrant for Data Science Platform (Gartner Group) KNIME 121
  • 123. • Pengembangan dimulai pada 2001 oleh Ralf Klinkenberg, Ingo Mierswa, dan Simon Fischer di Artificial Intelligence Unit dari University of Dortmund, ditulis dalam bahasa Java • Open source berlisensi AGPL (GNU Affero General Public License) versi 3 • Meraih penghargaan sebagai software data mining dan data analytics terbaik di berbagai lembaga kajian, termasuk IDC, Gartner, KDnuggets, dsb Rapidminer 123
  • 125. 1. Atribut: karakteristik atau fitur dari data yang menggambarkan sebuah proses atau situasi • ID, atribut biasa 2. Atribut target: atribut yang menjadi tujuan untuk diisi oleh proses data mining • Label, cluster, weight Role Atribut Pada Rapidminer 125
  • 126. 1. nominal: nilai secara kategori 2. binominal: nominal dua nilai 3. polynominal: nominal lebih dari dua nilai 4. numeric: nilai numerik secara umum 5. integer: bilangan bulat 6. real: bilangan nyata 7. text: teks bebas tanpa struktur 8. date_time: tanggal dan waktu 9. date: hanya tanggal 10.time: hanya waktu Tipe Nilai Atribut pada Rapidminer 126
  • 127. 1. Perspektif Selamat Datang (Welcome perspective) 2. Perspektif Desain (Design perspective) 3. Perspektif Hasil (Result perspective) Perspektif dan View 127
  • 128. • Perspektif dimana semua proses dibuat dan dikelola • Pindah ke Perspektif Desain dengan Klik: Perspektif Desain 128
  • 129. • Process Control Untuk mengontrol aliran proses, seperti loop atau conditional branch • Utility Untuk mengelompokkan subprocess, juga macro dan logger • Repository Access Untuk membaca dan menulis repositori • Import Untuk membaca data dari berbagai format eksternal • Export Untuk menulis data ke berbagai format eksternal • Data Transformation Untuk transformasi data dan metadata • Modelling Untuk proses data mining seperti klasifikasi, regresi, clustering, asosiasi, dll • Evaluation Untuk mengukur kualitas dan perfomansi dari model View Operator 129
  • 131. • Proses data mining pada dasarnya adalah proses analisis yang berisi alur kerja dari komponen data mining • Komponen dari proses ini disebut operator, yang memiliki: 1. Input 2. Output 3. Aksi yang dilakukan 4. Parameter yang diperlukan • Sebuah operator bisa disambungkan melalui port masukan (kiri) dan port keluaran (kanan) • Indikator status dari operator: • Lampu status: merah (tak tersambung), kuning (lengkap tetapi belum dijalankan), hijau (sudah behasil dijalankan) • Segitiga warning: bila ada pesan status • Breakpoint: bila ada breakpoint sebelum/sesudahnya • Comment: bila ada komentar Operator dan Proses 131
  • 132. • Operator kadang memerlukan parameter untuk bisa berfungsi • Setelah operator dipilih di view Proses, parameternya ditampilkan di view ini View Parameter 132
  • 133. • View Help menampilkan deskripsi dari operator • View Comment menampilkan komentar yang dapat diedit terhadap operator View Help dan View Comment 133
  • 134. Kumpulan dan rangkaian fungsi-fungsi (operator) yang bisa disusun secara visual (visual programming) Mendesain Proses 134
  • 135. Proses dapat dijalankan dengan: • Menekan tombol Play • Memilih menu Process → Run • Menekan kunci F11 Menjalankan Proses 135
  • 137. View Problems dan View Log 137
  • 138. • Instal Rapidminer versi 9 • Registrasi account di rapidminer.com dan dapatkan lisensi Educational Program untuk mengolah data tanpa batasan record Instalasi dan Registrasi Lisensi Rapidminer 138
  • 139. 139
  • 140. 2.2 Penerapan Proses Data Mining 140
  • 141. 1. Himpunan Data (Pahami dan Persiapkan Data) 2. Metode Data Mining (Pilih Metode Sesuai Karakter Data) 3. Pengetahuan (Pahami Model dan Pengetahuan yg Sesuai ) 4. Evaluation (Analisis Model dan Kinerja Metode) Proses Data Mining 141 DATA PREPROCESSING Data Cleaning Data Integration Data Reduction Data Transformation MODELING Estimation Prediction Classification Clustering Association MODEL Formula Tree Cluster Rule Correlation KINERJA Akurasi Tingkat Error Jumlah Cluster MODEL Atribute/Faktor Korelasi Bobot
  • 142. 1. Lakukan training pada data golf (ambil dari repositories rapidminer) dengan menggunakan algoritma decision tree 2. Tampilkan himpunan data (dataset) dan pengetahuan (model tree) yang terbentuk Latihan: Rekomendasi Main Golf 142
  • 143. 143
  • 144. 144
  • 145. 145
  • 146. 146
  • 147. 147
  • 148. 148
  • 149. 149
  • 150. 150
  • 151. 151
  • 152. Latihan: Penentuan Jenis Bunga Iris 152 1. Lakukan training pada data Bunga Iris (ambil dari repositories rapidminer) dengan menggunakan algoritma decision tree 2. Tampilkan himpunan data (dataset) dan pengetahuan (model tree) yang terbentuk
  • 153. Latihan: Klastering Jenis Bunga Iris 153 1. Lakukan training pada data Bunga Iris (ambil dari repositories rapidminer) dengan algoritma k-Means 2. Tampilkan himpunan data (dataset) dan pengetahuan (model tree) yang terbentuk 3. Tampilkan grafik dari cluster yang terbentuk
  • 154. Latihan: Penentuan Mine/Rock 154 1. Lakukan training pada data Sonar (ambil dari repositories rapidminer) dengan menggunakan algoritma decision tree (C4.5) 2. Tampilkan himpunan data (dataset) dan pengetahuan (model tree) yang terbentuk
  • 155. 1. Lakukan training pada data Contact Lenses (contact- lenses.xls) dengan menggunakan algoritma decision tree 2. Gunakan operator Read Excel (on the fly) atau langsung menggunakan fitur Import Data (persistent) 3. Tampilkan himpunan data (dataset) dan pengetahuan (model tree) yang terbentuk Latihan: Rekomendasi Contact Lenses 155
  • 158. 1. Lakukan training pada data CPU (cpu.xls) dengan menggunakan algoritma linear regression 2. Lakukan pengujian terhadap data baru (cpu- testing.xls), untuk model yang dihasilkan dari tahapan 1. Data baru berisi 10 setting konfigurasi, yang belum diketahui berapa performancenya 3. Amati hasil estimasi performance dari 10 setting konfigurasi di atas Latihan: Estimasi Performance CPU 158
  • 159. Estimasi Performace cpu-testing.xls 159 cpu-testing.xls cpu.xls Performance CPU = 0.038 * MYCT + 0.017 * MMIN + 0.004 * MMAX + 0.603 * CACH + 1.291 * CHMIN + 0.906 * CHMAX - 43.975
  • 160. 1. Lakukan training pada data pemilu (datapemilukpu.xls) dengan algoritma yang tepat 2. Data bisa ditarik dari Import Data atau operator Read Excel 3. Tampilkan himpunan data (dataset) dan pengetahuan (pola/model) yang terbentuk 4. Gunakan model yang dihasilkan untuk memprediksi datapemilukpu-testing.xls Latihan: Prediksi Elektabilitas Caleg 160
  • 162. 1. Lakukan training pada data konsumsi minyak (HeatingOil.csv) • Dataset jumlah konsumsi minyak untuk alat pemanas ruangan di rumah pertahun perrumah • Atribut: • Insulation: Ketebalan insulasi rumah • Temperatur: Suhu udara sekitar rumah • Heating Oil: Jumlah konsumsi minyak pertahun perrumah • Number of Occupant: Jumlah penghuni rumah • Average Age: Rata-rata umur penghuni rumah • Home Size: Ukuran rumah 2. Gunakan operator Set Role untuk memilih Label (Heating Oil), tidak langsung dipilih pada saat Import Data 3. Pilih metode yang tepat supaya menghasilkan model 4. Apply model yang dihasilkan ke data pelanggan baru di file HeatingOil-Scoring.csv, supaya kita bisa mengestimasi berapa kebutuhan konsumsi minyak mereka, untuk mengatur stok penjualan minyak Latihan: Estimasi Konsumsi Minyak 162
  • 163. Heating Oil = 3.323 * Insulation - 0.869 * Temperature + 1.968 * Avg_Age + 3.173 * Home_Size + 134.511 Proses Estimasi Konsumsi Minyak 163
  • 164. 1. Lakukan training pada data konsumsi minyak (HeatingOil.csv) • Dataset jumlah konsumsi minyak untuk alat pemanas ruangan di rumah pertahun perrumah • Atribut: • Insulation: Ketebalan insulasi rumah • Temperatur: Suhu udara sekitar rumah • Heating Oil: Jumlah konsumsi minyak pertahun perrumah • Number of Occupant: Jumlah penghuni rumah • Average Age: Rata-rata umur penghuni rumah • Home Size: Ukuran rumah 2. Tujuannya ingin mendapatkan informasi tentang atribut apa saja yang paling berpengaruh pada konsumsi minyak Latihan: Matrix Correlation Konsumsi Minyak 164
  • 165. 165
  • 166. Tingkat Korelasi 4 Atribut terhadap Konsumsi Minyak 166 Jumlah Penghuni Rumah Rata-Rata Umur Ketebalan Insulasi Rumah Temperatur Konsumsi Minyak 0.848 -0.774 0.736 0.381
  • 167. 1. Lakukan training pada data transaksi (transaksi.xlsx) 2. Pilih metode yang tepat supaya menghasilkan pola Latihan: Aturan Asosiasi Data Transaksi 167
  • 168. 168
  • 169. 1. Lakukan training pada data kelulusan mahasiswa (datakelulusanmahasiswa.xls) 2. Gunakan operator Split Data untuk memecah data secara otomatis menjadi dua dengan perbandingan 0.9:0.1, di mana 0.9 untuk training dan 0.1 untuk testing 3. Pilih metode yang tepat supaya menghasilkan pola yang bisa menguji data testing 10% Latihan: Klasifikasi Data Kelulusan Mahasiswa 169
  • 170. 170
  • 171. 1. Lakukan training pada data Harga Saham (hargasaham-training.xls) dengan menggunakan algoritma yang tepat 2. Tampilkan himpunan data (dataset) dan pengetahuan (model regresi) yang terbentuk 3. Lakukan pengujian terhadap data baru (hargasaham-testing.xls), untuk model yang dihasilkan dari tahapan 1 4. Lakukan visualisasi berupa grafik dari data yang terbentuk dengan menggunakan Line atau Spline Latihan: Forecasting Harga Saham 171
  • 172. 172
  • 173. 173
  • 174. Latihan: Forecasting Harga Saham (Univariat) 174
  • 175. • Window size: Determines how many “attributes” are created for the cross-sectional data • Each row of the original time series within the window width will become a new attribute • We choose w = 6 • Step size: Determines how to advance the window • Let us use s = 1 • Horizon: Determines how far out to make the forecast • If the window size is 6 and the horizon is 1, then the seventh row of the original time series becomes the first sample for the “label” variable • Let us use h = 1 Parameter dari Windowing 175
  • 176. • Lakukan training dengan menggunakan linear regression pada dataset hargasaham- training-uni.xls • Gunakan Split Data untuk memisahkan dataset di atas, 90% training dan 10% untuk testing • Harus dilakukan proses Windowing pada dataset • Plot grafik antara label dan hasil prediksi dengan menggunakan chart Latihan 176
  • 177. Forecasting Harga Saham (Data Lampau) 177
  • 178. Forecasting Harga Saham (Data Masa Depan) 178
  • 179. 179
  • 180. 1. Lakukan training dengan algoritma yang tepat pada dataset: creditapproval- training.xls 2. Ujicoba model yang dibentuk dari training di atas ke dataset di bawah: creditapproval- testing.xls Latihan: Penentuan Kelayakan Kredit 180
  • 181. 1. Lakukan training pada data kanker payudara (breasttissue.xls) 2. Dataset adalah di sheet 2, sedangkan sheet 1 berisi penjelasan tentangd ata 3. Bagi dataset dengan menggunakan operator Split Data, 90% untuk training dan 10% untuk testing 4. Pilih metode yang tepat supaya menghasilkan pola, analisis pola yang dihasilkan Latihan: Deteksi Kanker Payudara 181
  • 182. 1. Lakukan training pada data serangan jaringan (intrusion-training.xls) 2. Pilih metode yang tepat supaya menghasilkan pola Latihan: Deteksi Serangan Jaringan 182
  • 183. 1. Lakukan training pada data resiko kredit (CreditRisk.csv) (http://romisatriawahono.net/lecture/dm/dataset/) 2. Pilih metode yang tepat supaya menghasilkan pola Latihan: Klasifikasi Resiko Kredit 183
  • 184. 1. Lakukan training pada data Music Genre (musicgenre-small.csv) 2. Pilih metode yang tepat supaya menghasilkan pola Latihan: Klasifikasi Music Genre 184
  • 185. NIP Gender Univers itas Progra m Studi IPK Usia Hasil Penjual an Status Keluarg a Jumlah Anak Kota Tinggal 1001 L UI Komuni kasi 3.1 21 100jt Single 0 Jakarta 1002 P UNDIP Inform atika 2.9 26 50jt menika h 1 Bekasi … … … … … … … … … … Data Profile dan Kinerja Marketing 185 NIP Gender Hasil Penjualan Produk A Hasil Penjualan Produk B Hasil Penjualan Layanan C Hasil Penjualan Layanan D Total Hasil Penjualan 1001 L 10 20 50 30 100jt 1002 P 10 10 5 25 50jt … … … … … … … Mana Atribut yang Layak jadi Class dan Tidak?
  • 186. Mana Atribut yang Layak jadi Class dan Tidak? Data Profile dan Kinerja Marketing 186 Tahun Total Hasil Penjualan Total Pengeluaran Marketing Total Keuntungan 1990 100jt 98jt 2jt 1991 120jt 100 20jt … …
  • 187. NIP Gender Univers itas Progra m Studi Absens i Usia Jumlah Peneliti an Status Keluarg a Disiplin Kota Tinggal 1001 L UI Komuni kasi 98% 21 3 Single Baik Jakarta 1002 P UNDIP Inform atika 50% 26 4 menika h Buruk Bekasi … … … … … … … … … … Data Profil dan Kinerja Dosen 187 NIP Gender Universitas Program Studi Jumlah Publikasi Jurnal Jumlah Publikasi Konferensi Total Publikasi Penelitian 1001 L UI Komunikasi 5 3 8 1002 P UNDIP Informatika 2 1 3 … … … … … … …
  • 188. 1. Dataset – Methods – Knowledge 1. Dataset Main Golf (Klasifikasi) 2. Dataset Iris (Klasifikasi) 3. Dataset Iris (Klastering) 4. Dataset CPU (Estimasi) 5. Dataset Pemilu (Klasifikasi) 6. Dataset Heating Oil (Asosiasi, Estimasi) 7. Dataset Transaksi (Association) 8. Dataset Harga Saham (Forecasting) (Uni dan Multi) Competency Check 188
  • 189. • Pahami berbagai dataset yang ada di folder dataset • Gunakan rapidminer untuk mengolah dataset tersebut sehingga menjadi pengetahuan • Pilih algoritma yang sesuai dengan jenis data pada dataset Tugas: Mencari dan Mengolah Dataset 189
  • 190. 1. Pahami dan kuasai satu metode data mining dari berbagai literature: 1. Naïve Bayes 2. k Nearest Neighbor 3. k-Means 4. C4.5 5. Neural Network 6. Logistic Regression 7. FP Growth 8. Fuzzy C-Means 9. Self-Organizing Map 0. Support Vector Machine 2. Rangkumkan dengan detail dalam bentuk slide, dengan format: 1. Definisi (Solid dan Sistematis) 2. Tahapan Algoritma (lengkap dengan formulanya) 3. Penerapan Tahapan Algoritma untuk Studi Kasus Dataset Main Golf, Iris, Transaksi, CPU, dsb (hitung manual (gunakan excel) dan tidak dengan menggunakan rapidminer, harus sinkron dengan tahapan algoritma) 3. Kirimkan slide dan excel ke [email protected], sehari sebelum mata kuliah berikutnya 4. Presentasikan di depan kelas pada mata kuliah berikutnya dengan bahasa manusia yang baik dan benar Tugas: Menguasai Satu Metode DM 190
  • 191. 1. Kembangkan Java Code dari algoritma yang dipilih 2. Gunakan hanya 1 class (file) dan beri nama sesuai nama algoritma, boleh membuat banyak method dalam class tersebut 3. Buat account di Trello.Com dan register ke https://trello.com/b/ZOwroEYg/course-assignment 4. Buat card dengan nama sendiri dan upload semua file (pptx, xlsx, pdf, etc) laporan ke card tersebut 5. Deadline: sehari sebelum pertemuan berikutnya Tugas: Kembangkan Code dari Algoritma DM 191
  • 193. • Rangkuman Definisi: • K-means adalah ..... (John, 2016) • K-Means adalah …. (Wedus, 2020) • Kmeans adalah … (Telo, 2017) • Kesimpulan makna dari k-means: • asoidhjaihdiahdoisjhoi Definisi 193
  • 194. 1. Siapkan dataset 2. Tentukan A dengan rumus A = x + y 3. Tentukan B dengan rumus B = d + e 4. Ulangi proses 1-2-3 sampai tidak ada perubahan Tahapan Algoritma k-Means 194
  • 199. • blablabla 4. Iterasi 2 ... dst 199
  • 200. 2.3 Evaluasi Model Data Mining 200
  • 201. 1. Himpunan Data (Pahami dan Persiapkan Data) 2. Metode Data Mining (Pilih Metode Sesuai Karakter Data) 3. Pengetahuan (Pahami Model dan Pengetahuan yg Sesuai ) 4. Evaluation (Analisis Model dan Kinerja Metode) Proses Data Mining 201 DATA PREPROCESSING Data Cleaning Data Integration Data Reduction Data Transformation MODELING Estimation Prediction Classification Clustering Association MODEL Formula Tree Cluster Rule Correlation KINERJA Akurasi Tingkat Error Jumlah Cluster MODEL Atribute/Faktor Korelasi Bobot
  • 202. 1. Estimation: • Error: Root Mean Square Error (RMSE), MSE, MAPE, etc 2. Prediction/Forecasting (Prediksi/Peramalan): • Error: Root Mean Square Error (RMSE) , MSE, MAPE, etc 3. Classification: • Confusion Matrix: Accuracy • ROC Curve: Area Under Curve (AUC) 4. Clustering: • Internal Evaluation: Davies–Bouldin index, Dunn index, • External Evaluation: Rand measure, F-measure, Jaccard index, Fowlkes–Mallows index, Confusion matrix 5. Association: • Lift Charts: Lift Ratio • Precision and Recall (F-measure) Evaluasi Kinerja Model Data Mining 202
  • 203. • Pembagian dataset, perbandingan 90:10 atau 80:20: • Data Training • Data Testing • Data training untuk pembentukan model, dan data testing digunakan untuk pengujian model • Pemisahan data training dan testing 1. Data dipisahkan secara manual 2. Data dipisahkan otomatis dengan operator Split Data 3. Data dipisahkan otomatis dengan X Validation Evaluasi Model Data Mining 203
  • 204. 2.3.1 Pemisahan Data Manual 204
  • 205. • Gunakan dataset di bawah: • creditapproval-training.xls: untuk membuat model • creditapproval-testing.xls: untuk menguji model • Data di atas terpisah dengan perbandingan: data testing (10%) dan data training (90%) • Data training sebagai pembentuk model, dan data testing untuk pengujian model, ukur performancenya Latihan: Penentuan Kelayakan Kredit 205
  • 206. • pred MACET- true MACET: Jumlah data yang diprediksi macet dan kenyataannya macet (TP) • pred LANCAR-true LANCAR: Jumlah data yang diprediksi lancar dan kenyataannya lancar (TN) • pred MACET-true LANCAR: Jumlah data yang diprediksi macet tapi kenyataannya lancer (FP) • pred LANCAR-true MACET: Jumlah data yang diprediksi lancar tapi kenyataanya macet (FN) Confusion Matrix  Accuracy 206 Accuracy = 𝐓𝐏 + 𝐓𝐍 𝐓𝐏 + 𝐓𝐍 + 𝐅𝐏 + 𝐅𝐍 = 53 + 37 53 + 37 + 4 + 6 = 90 100 = 90%
  • 207. • Precision: exactness – what % of tuples that the classifier labeled as positive are actually positive • Recall: completeness – what % of positive tuples did the classifier label as positive? • Perfect score is 1.0 • Inverse relationship between precision & recall • F measure (F1 or F-score): harmonic mean of precision and recall, • Fß: weighted measure of precision and recall • assigns ß times as much weight to recall as to precision Precision and Recall, and F-measures 207
  • 208. Binary classification should be both sensitive and specific as much as possible: 1. Sensitivity measures the proportion of true ’positives’ that are correctly identified (True Positive Rate (TP Rate) or Recall) 2. Specificity measures the proportion of true ’negatives’ that are correctly identified (False Negative Rate (FN Rate or Precision) Sensitivity and Specificity 208
  • 209. We need to know the probability that the classifier will give the correct diagnosis, but the sensitivity and specificity do not give us this information • Positive Predictive Value (PPV) is the proportion of cases with ’positive’ test results that are correctly diagnosed • Negative Predictive Value (NPV) is the proportion of cases with ’negative’ test results that are correctly diagnosed PPV and NPV 209
  • 210. • ROC (Receiver Operating Characteristics) curves: for visual comparison of classification models • Originated from signal detection theory • ROC curves are two-dimensional graphs in which the TP rate is plotted on the Y-axis and the FP rate is plotted on the X-axis • ROC curve depicts relative trade-offs between benefits (’true positives’) and costs (’false positives’) • Two types of ROC curves: discrete and continuous Kurva ROC - AUC (Area Under Curve) 210
  • 211. Kurva ROC - AUC (Area Under Curve) 211
  • 212. 1. 0.90 - 1.00 = excellent classification 2. 0.80 - 0.90 = good classification 3. 0.70 - 0.80 = fair classification 4. 0.60 - 0.70 = poor classification 5. 0.50 - 0.60 = failure (Gorunescu, 2011) Guide for Classifying the AUC 212
  • 213. • Gunakan dataset: breasttissue.xls • Split data dengan perbandingan: data testing (10%) dan data training (90%) • Ukur performance (Accuracy dan Kappa) Latihan: Prediksi Kanker Payudara 213
  • 214. • The (Cohen’s) Kappa statistics is a more vigorous measure than the ‘percentage correct prediction’ calculation, because Kappa considers the correct prediction that is occurring by chance • Kappa is essentially a measure of how well the classifier performed as compared to how well it would have performed simply by chance • A model has a high Kappa score if there is a big difference between the accuracy and the null error rate (Markham, K., 2014) • Kappa is an important measure on classifier performance, especially on imbalanced data set Kappa Statistics 214
  • 215. • Gunakan dataset di bawah: • hargasaham-training.xls: untuk membuat model • hargasaham-testing.xls: untuk menguji model • Data di atas terpisah dengan perbandingan: data testing (10%) dan data training (90%) • Jadikan data training sebagai pembentuk model/pola/knowledge, dan data testing untuk pengujian model • Ukur performance Latihan: Prediksi Harga Saham 215
  • 216. 216
  • 217. Root Mean Square Error 217 • The square root of the mean/average of the square of all of the error • The use of RMSE is very common and it makes an excellent general purpose error metric for numerical predictions • To construct the RMSE, we first need to determine the residuals • Residuals are the difference between the actual values and the predicted values • We denoted them by • where is the observed value for the ith observation and • is the predicted value • They can be positive or negative as the predicted value under or over estimates the actual value • You then use the RMSE as a measure of the spread of the y values about the predicted y value
  • 218. Latihan: Klastering Jenis Bunga Iris 218 k DBI 3 0.666 4 0.764 5 0.806 6 0.910 7 0.999 1. Lakukan training pada data iris (ambil dari repositories rapidminer) dengan menggunakan algoritma clustering k-means 2. Ukur performance-nya dengan Cluster Distance Performance, cek dan analisis nilai yang keluar Davies Bouldin Indeks (DBI) 3. Lakukan pengubahan pada nilai k pada parameter k-means dengan memasukkan nilai: 3, 4, 5, 6, 7
  • 219. 219
  • 220. • The Davies–Bouldin index (DBI) (introduced by David L. Davies and Donald W. Bouldin in 1979) is a metric for evaluating clustering algorithms • This is an internal evaluation scheme, where the validation of how well the clustering has been done is made using quantities and features inherent to the dataset • As a function of the ratio of the within cluster scatter, to the between cluster separation, a lower value will mean that the clustering is better • This affirms the idea that no cluster has to be similar to another, and hence the best clustering scheme essentially minimizes the Davies–Bouldin index • This index thus defined is an average over all the i clusters, and hence a good measure of deciding how many clusters actually exists in the data is to plot it against the number of clusters it is calculated over • The number i for which this value is the lowest is a good measure of the number of clusters the data could be ideally classified into Davies–Bouldin index (DBI) 220
  • 221. 2.3.2 Pemisahan Data Otomatis dengan Operator Split Data 221
  • 222. • The Split Data operator takes a dataset as its input and delivers the subsets of that dataset through its output ports • The sampling type parameter decides how the examples should be shuffled in the resultant partitions: 1. Linear sampling: Divides the dataset into partitions without changing the order of the examples 2. Shuffled sampling: Builds random subsets of the dataset 3. Stratified sampling: Builds random subsets and ensures that the class distribution in the subsets is the same as in the whole dataset Split Data Otomatis 222
  • 223. 223
  • 224. 1. Dataset: datakelulusanmahasiswa.xls 2. Pisahkan data menjadi dua secara otomatis (Split Data): data testing (10%) dan data training (90%) 3. Ujicoba parameter pemisahan data baik menggunakan Linear Sampling, Shuffled Sampling dan Stratified Sampling 4. Jadikan data training sebagai pembentuk model/pola/knowledge, dan data testing untuk pengujian model 5. Terapkan algoritma yang sesuai dan ukur performance dari model yang dibentuk Latihan: Prediksi Kelulusan Mahasiswa 224
  • 225. Proses Prediksi Kelulusan Mahasiswa 225
  • 226. 1. Dataset: HeatingOil.csv 2. Pisahkan data menjadi dua secara otomatis (Split Data): data testing (10%) dan data training (90%) 3. Jadikan data training sebagai pembentuk model/pola/knowledge, dan data testing untuk pengujian model 4. Terapkan algoritma yang sesuai dan ukur performance dari model yang dibentuk Latihan: Estimasi Konsumsi Minyak 226
  • 227. 2.3.3 Pemisahan Data dan Evaluasi Model Otomatis dengan Cross-Validation 227
  • 228. • Metode cross-validation digunakan untuk menghindari overlapping pada data testing • Tahapan cross-validation: 1. Bagi data menjadi k subset yg berukuran sama 2. Gunakan setiap subset untuk data testing dan sisanya untuk data training • Disebut juga dengan k-fold cross-validation • Seringkali subset dibuat stratified (bertingkat) sebelum cross-validation dilakukan, karena stratifikasi akan mengurangi variansi dari estimasi Metode Cross-Validation 228
  • 229. Orange: k-subset (data testing) 10 Fold Cross-Validation 229 Eksperimen Dataset Akurasi 1 93% 2 91% 3 90% 4 93% 5 93% 6 91% 7 94% 8 93% 9 91% 10 90% Akurasi Rata-Rata 92%
  • 230. • Metode evaluasi standard: stratified 10-fold cross-validation • Mengapa 10? Hasil dari berbagai percobaan yang ekstensif dan pembuktian teoritis, menunjukkan bahwa 10-fold cross-validation adalah pilihan terbaik untuk mendapatkan hasil validasi yang akurat • 10-fold cross-validation akan mengulang pengujian sebanyak 10 kali dan hasil pengukuran adalah nilai rata-rata dari 10 kali pengujian 10 Fold Cross-Validation 230
  • 231. 1. Lakukan training pada data pemilu (datapemilukpu.xls) 2. Lakukan pengujian dengan menggunakan 10-fold X Validation 3. Ukur performance-nya dengan confusion matrix dan ROC Curve 4. Lakukan ujicoba, ubah algoritma menjadi Naive Bayes, k-NN, Random Forest (RF), Logistic Regression (LogR), analisis mana algoritma yang menghasilkan model yang lebih baik (akurasi tinggi) Latihan: Prediksi Elektabilitas Caleg 231 C4.5 NB k-NN LogR Accuracy 92.87% 79.34% 88.7% AUC 0.934 0.849 0.5
  • 232. Latihan: Komparasi Prediksi Harga Saham 232 • Gunakan dataset harga saham (hargasaham-training.xls) • Lakukan pengujian dengan 10-fold X Validation • Lakukan ujicoba dengan mengganti algoritma (GLM, LR, NN, DL, SVM), catat hasil RMSE yang keluar GLM LR NN DL SVM RMSE
  • 233. 2.3.4 Komparasi Algoritma Data Mining 233
  • 234. 1. Estimation (Estimasi): Linear Regression (LR), Neural Network (NN), Deep Learning (DL), Support Vector Machine (SVM), Generalized Linear Model (GLM), etc 2. Forecasting (Prediksi/Peramalan): Linear Regression (LR), Neural Network (NN), Deep Learning (DL), Support Vector Machine (SVM), Generalized Linear Model (GLM), etc 3. Classification (Klasifikasi): Decision Tree (CART, ID3, C4.5, Credal DT, Credal C4.5, Adaptative Credal C4.5), Naive Bayes (NB), K-Nearest Neighbor (kNN), Linear Discriminant Analysis (LDA), Logistic Regression (LogR), etc 4. Clustering (Klastering): K-Means, K-Medoids, Self-Organizing Map (SOM), Fuzzy C-Means (FCM), etc 5. Association (Asosiasi): FP-Growth, A Priori, Coefficient of Correlation, Chi Square, etc Metode Data Mining 234
  • 235. 1. Lakukan training pada data pemilu (datapemilukpu.xls) dengan menggunakan algoritma 1. Decision Tree (C4.5) 2. Naïve Bayes (NB) 3. K-Nearest Neighbor (K-NN) 2. Lakukan pengujian dengan menggunakan 10-fold X Validation Latihan: Prediksi Elektabilitas Caleg 235 DT NB K-NN Accuracy 92.45% 77.46% 88.72% AUC 0.851 0.840 0.5
  • 236. 236
  • 237. 1. Lakukan training pada data pemilu (datapemilukpu.xls) dengan menggunakan algoritma C4.5, NB dan K-NN 2. Lakukan pengujian dengan menggunakan 10-fold X Validation 3. Ukur performance-nya dengan confusion matrix dan ROC Curve 4. Uji beda dengan t-Test untuk mendapatkan model terbaik Latihan: Prediksi Elektabilitas Caleg 237
  • 238. 238
  • 239. • Komparasi Accuracy dan AUC • Uji Beda (t-Test) Values with a colored background are smaller than alpha=0.050, which indicate a probably significant difference between the mean values • Urutan model terbaik: 1. C4.5 2. k-NN 3. NB Hasil Prediksi Elektabilitas Caleg 239 C4.5 NB K-NN Accuracy 92.45% 77.46% 88.72% AUC 0.851 0.840 0.5 C4.5 NB kNN C4.5 NB kNN
  • 240. • Komparasi Accuracy dan AUC • Uji Beda (t-Test) Values with a white background are higher than alpha=0.050, which indicate a probably NO significant difference between the mean values • Urutan model terbaik: 1. C4.5 1. kNN 2. NB Hasil Prediksi Elektabilitas Caleg 240 C4.5 NB K-NN Accuracy 93.41% 79.72% 91.76% AUC 0.921 0.826 0.885 C4.5 NB kNN C4.5 NB kNN
  • 241. Latihan: Komparasi Prediksi Harga Saham 241 1. GLM 2. LR 3. NN 4. DL dan SVM • Gunakan dataset harga saham (hargasaham-training.xls) • Lakukan pengujian dengan 10- fold X Validation • Lakukan ujicoba dengan mengganti algoritma (GLM, LR, NN, DL, SVM), catat hasil RMSE yang keluar • Uji beda dengan t-Test
  • 242. 1. Statistik Deskriptif • Nilai mean (rata-rata), standar deviasi, varians, data maksimal, data minimal, dsb 2. Statistik Inferensi • Perkiraan dan estimasi • Pengujian Hipotesis Analisis Statistik 242
  • 243. Penggunaan Parametrik Non Parametrik Dua sampel saling berhubungan (Two Dependent samples) T Test Z Test Sign test Wilcoxon Signed-Rank Mc Nemar Change test Dua sampel tidak berhubungan (Two Independent samples) T Test Z Test Mann-Whitney U test Moses Extreme reactions Chi-Square test Kolmogorov-Smirnov test Walt-Wolfowitz runs Beberapa sampel berhubungan (Several Dependent Samples) Friedman test Kendall W test Cochran’s Q Beberapa sampel tidak Berhubungan (Several Independent Samples) Anova test (F test) Kruskal-Wallis test Chi-Square test Median test Statistik Inferensi (Pengujian Hipotesis) 243
  • 244. • Metode parametrik dapat dilakukan jika beberapa persyaratan dipenuhi, yaitu: • Sampel yang dianalisis haruslah berasal dari populasi yang berdistribusi normal • Jumlah data cukup banyak • Jenis data yang dianalisis adalah biasanya interval atau rasio Metode Parametrik 244
  • 245. • Metode ini dapat dipergunakan secara lebih luas, karena tidak mengharuskan datanya berdistribusi normal • Dapat dipakai untuk data nominal dan ordinal sehingga sangat berguna bagi para peneliti sosial untuk meneliti perilaku konsumen, sikap manusia, dsb • Cenderung lebih sederhana dibandingkan dengan metode parametrik • Selain keuntungannya, berikut kelemahan metode non parametrik: • Tidak adanya sistematika yang jelas seperti metode parametrik • Terlalu sederhana sehingga sering meragukan • Memakai tabel-tabel yang lebih bervariasi dibandingkan dengan tabel-tabel standar pada metode parametrik Metode Non Parametrik 245
  • 246. •Ho = tidak ada perbedaan signifikan •Ha = ada perbedaan signifikan alpha=0.05 Bila p < 0.05, maka Ho ditolak •Contoh: kasus p=0.03, maka dapat ditarik kesimpulan? Interpretasi Statistik 246
  • 247. 1. Lakukan training pada data mahasiswa (datakelulusanmahasiswa.xls) dengan menggunakan C4.5, ID3, NB, K-NN, RF dan LogR 2. Lakukan pengujian dengan menggunakan 10-fold X Validation 3. Uji beda dengan t-Test untuk mendapatkan model terbaik Latihan: Prediksi Kelulusan Mahasiswa 247
  • 248. • Komparasi Accuracy dan AUC • Uji Beda (t-Test) • Urutan model terbaik: 1. C4.5 2. NB 2. k-NN 2. LogR Hasil Prediksi Kelulusan Mahasiswa 248 C4.5 NB K-NN LogR Accuracy 91.55% 82.58% 83.63% 77.47% AUC 0.909 0.894 0.5 0.721 C4.5 NB kNN LogR C4.5 NB kNN LogR
  • 249. 1. Lakukan training pada data cpu (cpu.xls) dengan menggunakan algoritma linear regression, neural network dan support vector machine 2. Lakukan pengujian dengan XValidation (numerical) 3. Ukur performance-nya dengan menggunakan RMSE (Root Mean Square Error) 4. Urutan model terbaik: 1. LR 2. NN 3. SVM Latihan: Estimasi Performance CPU 249 LR NN SVM RMSE 54.676 55.192 94.676
  • 250. 250
  • 251. 1. Lakukan training pada data minyak pemanas (HeatingOil.csv) dengan menggunakan algoritma linear regression, neural network dan support vector machine, Deep Learning 2. Lakukan pengujian dengan XValidation (numerical) dan Uji beda dengan t-Test 3. Ukur performance-nya dengan menggunakan RMSE (Root Mean Square Error) Latihan: Estimasi Konsumsi Minyak 251 LR NN SVM DL RMSE
  • 252. 252 LR NN DL SVM LR NN DL SVM Urutan model terbaik: 1. NN dan DL 2. LR dan SVM
  • 253. 1. Lakukan training pada data pemilu (datapemilukpu.xls) dengan menggunakan algoritma Decision Tree, Naive Bayes, K-Nearest Neighbor, RandomForest, Logistic Regression 2. Lakukan pengujian dengan menggunakan XValidation 3. Ukur performance-nya dengan confusion matrix dan ROC Curve 4. Masukkan setiap hasil percobaan ke dalam file Excel Latihan: Prediksi Elektabilitas Caleg 253 DT NB K-NN RandFor LogReg Accuracy 92.21% 76.89% 89.63% AUC 0.851 0.826 0.5
  • 254. 1. Lakukan training pada data harga saham (hargasaham-training.xls) dengan neural network, linear regression, support vector machine 2. Lakukan pengujian dengan menggunakan XValidation Latihan: Prediksi Harga Saham 254 LR NN SVM RMSE
  • 255. 1. Lakukan training pada data iris (ambil dari repositories rapidminer) dengan menggunakan algoritma clustering k-means 2. Gunakan pilihan nilai untuk k, isikan dengan 3, 4, 5, 6, 7 3. Ukur performance-nya dengan Cluster Distance Performance, dari analisis Davies Bouldin Indeks (DBI), tentukan nilai k yang paling optimal Latihan: Klastering Jenis Bunga Iris 255 k=3 k=4 k=5 k=6 k=7 DBI 0.666 0.764 0.806 0.910 0.99
  • 256. • The Davies–Bouldin index (DBI) (introduced by David L. Davies and Donald W. Bouldin in 1979) is a metric for evaluating clustering algorithms • This is an internal evaluation scheme, where the validation of how well the clustering has been done is made using quantities and features inherent to the dataset • As a function of the ratio of the within cluster scatter, to the between cluster separation, a lower value will mean that the clustering is better • This affirms the idea that no cluster has to be similar to another, and hence the best clustering scheme essentially minimizes the Davies–Bouldin index • This index thus defined is an average over all the i clusters, and hence a good measure of deciding how many clusters actually exists in the data is to plot it against the number of clusters it is calculated over • The number i for which this value is the lowest is a good measure of the number of clusters the data could be ideally classified into Davies–Bouldin index (DBI) 256
  • 257. 1. Estimation: • Error: Root Mean Square Error (RMSE), MSE, MAPE, etc 2. Prediction/Forecasting (Prediksi/Peramalan): • Error: Root Mean Square Error (RMSE) , MSE, MAPE, etc 3. Classification: • Confusion Matrix: Accuracy • ROC Curve: Area Under Curve (AUC) 4. Clustering: • Internal Evaluation: Davies–Bouldin index, Dunn index, • External Evaluation: Rand measure, F-measure, Jaccard index, Fowlkes–Mallows index, Confusion matrix 5. Association: • Lift Charts: Lift Ratio • Precision and Recall (F-measure) Evaluasi Model Data Mining 257
  • 258. 1. Lakukan ujicoba terhadap semua dataset yang ada di folder datasets, dengan menggunakan berbagai metode data mining yang sesuai (estimasi, prediksi, klasifikasi, clustering, association) 2. Kombinasikan pengujian dengan pemecahan data training-testing, dan pengujian dengan menggunakan metode X validation 3. Ukur performance dari model yang terbentuk dengan menggunakan metode pengukuran sesuai dengan metode data mining yang dipilih 4. Jelaskan secara mendetail tahapan ujicoba yang dilakukan, kemudian lakukan analisis dan sintesis, dan buat laporan dalam bentuk slide 5. Presentasikan di depan kelas Tugas: Mengolah Semua Dataset 258
  • 259. • Technical Paper: • Judul: Application and Comparison of Classification Techniques in Controlling Credit Risk • Author: Lan Yu, Guoqing Chen, Andy Koronios, Shiwu Zhu, and Xunhua Guo • Download: http://romisatriawahono.net/lecture/dm/paper/ • Baca dan pahami paper di atas dan jelaskan apa yang dilakukan peneliti pada paper tersebut: 1. Object Penelitian 2. Masalah Penelitian 3. Tujuan Penelitian 4. Metode Penelitian 5. Hasil Penelitian Tugas: Mereview Paper 259
  • 260. • Technical Paper: • Judul: A Comparison Framework of Classification Models for Software Defect Prediction • Author: Romi Satria Wahono, Nanna Suryana Herman, Sabrina Ahmad • Publications: Adv. Sci. Lett. Vol. 20, No. 10-12, 2014 • Download: http://romisatriawahono.net/lecture/dm/paper • Baca dan pahami paper di atas dan jelaskan apa yang dilakukan peneliti pada paper tersebut: 1. Object Penelitian 2. Masalah Penelitian 3. Tujuan Penelitian 4. Metode Penelitian 5. Hasil Penelitian Tugas: Mereview Paper 260
  • 261. • Technical Paper: • Judul: An experimental comparison of classification algorithms for imbalanced credit scoring data sets • Author: Iain Brown and Christophe Mues • Publications: Expert Systems with Applications 39 (2012) 3446–3453 • Download: http://romisatriawahono.net/lecture/dm/paper • Baca dan pahami paper di atas dan jelaskan apa yang dilakukan peneliti pada paper tersebut: 1. Object Penelitian 2. Masalah Penelitian 3. Tujuan Penelitian 4. Metode Penelitian 5. Hasil Penelitian Tugas Mereview Paper 261
  • 262. • Cari dataset yang ada di sekitar kita • Lakukan penelitian berupa komparasi dari (minimal) 5 algoritma machine learning untuk memining knowledge dari dataset tersebut • Gunakan uji beda (baik parametrik dan non parametric) untuk analisis dan pembuatan ranking dari algoritma machine learning • Tulis makalah tentang penelitian yang kita buat • Contoh-contoh makalah komparasi ada di: http://romisatriawahono.net/lecture/dm/paper/method%20comparison/ • Upload seluruh file laporan ke Card di Trello.Com • Deadline: sehari sebelum mata kuliah berikutnya Tugas: Menulis Paper Penelitian 262
  • 263. • Ikuti template dan contoh paper dari: http://journal.ilmukomputer.org • Isi paper: • Abstract: Harus berisi obyek-masalah-metode-hasil • Introduction: Latar belakang masalah penelitian dan struktur paper • Related Work: Penelitian yang berhubungan • Theoretical Foundation: Landasan dari berbagai teori yang digunakan • Proposed Method: Metode yang diusulkan • Experimental Results: Hasil eksperimen • Conclusion: Kesimpulan dan future works Paper Formatting 263
  • 264. 1. Dataset – Methods – Knowledge 1. Dataset Main Golf (Klasifikasi) 2. Dataset Iris (Klasifikasi) 3. Dataset Iris (Klastering) 4. Dataset CPU (Estimasi) 5. Dataset Pemilu (Klasifikasi) 6. Dataset Heating Oil (Association) 7. Dataset Transaksi (Association) 8. Dataset Harga Saham (Forecasting) 2. Dataset – Methods – Knowledge – Evaluation 1. Manual 2. Data Split Operator 3. Cross Validation 3. Methods Comparison • Uji t-Test 4. Paper Reading 1. Lan Yu (DeLong Pearson Test) 2. Wahono (Friedman Test) Competency Check 264
  • 265. 2.4 Proses Data Mining berbasis Metodologi CRISP-DM 265
  • 266. • Dunia industri yang beragam bidangnya memerlukan proses yang standard yang mampu mendukung penggunaan data mining untuk menyelesaikan masalah bisnis • Proses tersebut harus dapat digunakan di lintas industry (cross-industry) dan netral secara bisnis, tool dan aplikasi yang digunakan, serta mampu menangani strategi pemecahan masalah bisnis dengan menggunakan data mining • Pada tahun 1996, lahirlah salah satu standard proses di dunia data mining yang kemudian disebut dengan: the Cross-Industry Standard Process for Data Mining (CRISP–DM) (Chapman, 2000) Data Mining Standard Process 266
  • 268. • Enunciate the project objectives and requirements clearly in terms of the business or research unit as a whole • Translate these goals and restrictions into the formulation of a data mining problem definition • Prepare a preliminary strategy for achieving these objectives • Designing what you are going to build 1. Business Understanding 268
  • 269. • Collect the data • Use exploratory data analysis to familiarize yourself with the data and discover initial insights • Evaluate the quality of the data • If desired, select interesting subsets that may contain actionable patterns 2. Data Understanding 269
  • 270. • Prepare from the initial raw data the final data set that is to be used for all subsequent phases • Select the cases and variables you want to analyze and that are appropriate for your analysis • Perform data cleaning, integration, reduction and transformation, so it is ready for the modeling tools 3. Data Preparation 270
  • 271. • Select and apply appropriate modeling techniques • Calibrate model settings to optimize results • Remember that often, several different techniques may be used for the same data mining problem • If necessary, loop back to the data preparation phase to bring the form of the data into line with the specific requirements of a particular data mining technique 4. Modeling 271
  • 272. • Evaluate the one or more models delivered in the modeling phase for quality and effectiveness before deploying them for use in the field • Determine whether the model in fact achieves the objectives set for it in the first phase • Establish whether some important facet of the business or research problem has not been accounted for sufficiently • Come to a decision regarding use of the data mining results 5. Evaluation 272
  • 273. • Make use of the models created: • model creation does not signify the completion of a project • Example of a simple deployment: • Generate a report • Example of a more complex deployment: • Implement a parallel data mining process in another department • For businesses, the customer often carries out the deployment based on your model 6. Deployment 273
  • 275. Studi Kasus CRISP-DM Heating Oil Consumption – Correlational Methods (Matthew North, Data Mining for the Masses 2nd Edition, 2016, Chapter 4 Correlational Methods, pp. 69-76) Dataset: HeatingOil.csv 275
  • 277. • Problems: • Sarah is a regional sales manager for a nationwide supplier of fossil fuels for home heating • Marketing performance is very poor and decreasing, while marketing spending is increasing • She feels a need to understand the types of behaviors and other factors that may influence the demand for heating oil in the domestic market • She recognizes that there are many factors that influence heating oil consumption, and believes that by investigating the relationship between a number of those factors, she will be able to better monitor and respond to heating oil demand, and also help her to design marketing strategy in the future • Objective: • To investigate the relationship between a number of factors that influence heating oil consumption 1. Business Understanding 277
  • 278. • In order to investigate her question, Sarah has enlisted our help in creating a correlation matrix of six attributes • Using employer’s data resources which are primarily drawn from the company’s billing database, we create a data set comprised of the following attributes: 1. Insulation: This is a density rating, ranging from one to ten, indicating the thickness of each home’s insulation. A home with a density rating of one is poorly insulated, while a home with a density of ten has excellent insulation 2. Temperature: This is the average outdoor ambient temperature at each home for the most recent year, measure in degree Fahrenheit 3. Heating_Oil: This is the total number of units of heating oil purchased by the owner of each home in the most recent year 4. Num_Occupants: This is the total number of occupants living in each home 5. Avg_Age: This is the average age of those occupants 6. Home_Size: This is a rating, on a scale of one to eight, of the home’s overall size. The higher the number, the larger the home 2. Data Understanding 278
  • 279. Data set: HeatingOil.csv 3. Data Preparation 279
  • 280. • Data set appears to be very clean with: • No missing values in any of the six attributes • No inconsistent data apparent in our ranges (Min-Max) or other descriptive statistics 3. Data Preparation 280
  • 282. • Hasil correlation matrix berupa tabel • Semakin tinggi nilainya (semakin tebal warna ungu), semakin tinggi tingkat korelasinya 4. Modeling 282
  • 284. • Atribut (faktor) yang paling signifikan berpengaruh (hubungan positif) pada konsumsi minyak pemanas (Heating Oil) adalah Average Age (Rata- Rata Umur) penghuni rumah • Atribut (faktor) kedua yang paling berpengaruh adalah Temperature (hubungan negatif) • Atribut (faktor) ketiga yang paling berpengaruh adalah Insulation (hubungan positif) • Atribut Home Size, pengaruhnya sangat kecil, sedangkan Num_Occupant boleh dikatakan tidak ada pengaruh ke konsumsi minyak pemanas 5. Evaluation 284
  • 285. • Grafik menunjukkan bahwa konsumsi minyak memiliki korelasi positif dengan rata-rata usia • Meskipun ada beberapa anomali juga terjadi: 1. Ada beberapa orang yang rata-rata usia tinggi, tapi kebutuhan minyaknya rendah (warna biru muda di kolom kiri bagian atas) 2. Ada beberapa orang yang rata-rata usia rendah, tapi kebutuhan minyaknya tinggi (warna merah di kolom kanan bagian bawah) 5. Evaluation 285 1 2
  • 286. 1. Grafik menunjukkan hubungan antara temperature dan insulation, dengan warna adalah konsumsi minyak (semakin merah kebutuhan minyak semakin tinggi) 2. Secara umum dapat dikatakan bahwa hubungan temperatur dengan insulation dan konsumsi minyak adalah negatif. Jadi temperatur semakin rendah, kebutuhan minyak semakin tinggi (kolom kiri bagian atas) ditunjukkan dengan banyak yang berwarna kuning dan merah 3. Insulation juga berhubungan negatif dengan temperatur, sehingga makin rendah temperatur, semakin butuh insulation 4. Beberapa anomali terdapat pada Insulation yang rendah nilainya, ada beberapa yang masih memerlukan minyak yang tinggi 5. Evaluation 286 2 dan 3 2 dan 3 4
  • 287. 1. Grafik tiga dimensi menunjukkan hubungan antara temperatur, rata-rata usia dan insulation 2. Warna menunjukkan kebutuhan minyak, semakin memerah maka semakin tinggi 3. Temperatur semakin tinggi semakin tidak butuh minyak (warna biru tua 4. Rata-rata usia dan insulation semakin tinggi semakin butuh minyak 5. Evaluation 287 4 2
  • 288. Dropping the Num_Occupants attribute • While the number of people living in a home might logically seem like a variable that would influence energy usage, in our model it did not correlate in any significant way with anything else • Sometimes there are attributes that don’t turn out to be very interesting 6. Deployment 288
  • 289. Adding additional attributes to the data set • It turned out that the number of occupants in the home didn’t correlate much with other attributes, but that doesn’t mean that other attributes would be equally uninteresting • For example, what if Sarah had access to the number of furnaces and/or boilers in each home? • Home_size was slightly correlated with Heating_Oil usage, so perhaps the number of instruments that consume heating oil in each home would tell an interesting story, or at least add to her insight 6. Deployment 289
  • 290. Investigating the role of home insulation • The Insulation rating attribute was fairly strongly correlated with a number of other attributes • There may be some opportunity there to partner with a company that specializes in adding insulation to existing homes 6. Deployment 290
  • 291. Focusing the marketing efforts to the city with low temperature and high average age of citizen • The temperature attribute was fairly strongly negative correlated with a heating oil consumption • The average age attribute was strongest positive correlated with a heating oil consumption 6. Deployment 291
  • 292. Adding greater granularity in the data set • This data set has yielded some interesting results, but it’s pretty general • We have used average yearly temperatures and total annual number of heating oil units in this model • But we also know that temperatures fluctuate throughout the year in most areas of the world, and thus monthly, or even weekly measures would not only be likely to show more detailed results of demand and usage over time, but the correlations between attributes would probably be more interesting • From our model, Sarah now knows how certain attributes interact with one another, but in the day-to-day business of doing her job, she’ll probably want to know about usage over time periods shorter than one year 6. Deployment 292
  • 293. Studi Kasus CRISP-DM Heating Oil Consumption – Linear Regression (Matthew North, Data Mining for the Masses 2nd Edition, 2016, Chapter 8 Linear Regression, pp. 159-171) Dataset: HeatingOil.csv Dataset: HeatingOil-scoring.csv http://romisatriawahono.net/lecture/dm/dataset/ 293
  • 296. • Problems: • Business is booming, her sales team is signing up thousands of new clients, and she wants to be sure the company will be able to meet this new level of demand • Sarah’s new data mining objective is pretty clear: she wants to anticipate demand for a consumable product • We will use a linear regression model to help her with her desired predictions. She has data, 1,218 observations that give an attribute profile for each home, along with those homes’ annual heating oil consumption • She wants to use this data set as training data to predict the usage that 42,650 new clients will bring to her company • She knows that these new clients’ homes are similar in nature to her existing client base, so the existing customers’ usage behavior should serve as a solid gauge for predicting future usage by new customers • Objective: • to predict the usage that 42,650 new clients will bring to her company 1. Business Understanding 296
  • 297. • Sarah has assembled separate Comma Separated Values file containing all of these same attributes, for her 42,650 new clients • She has provided this data set to us to use as the scoring data set in our model • Data set comprised of the following attributes: • Insulation: This is a density rating, ranging from one to ten, indicating the thickness of each home’s insulation. A home with a density rating of one is poorly insulated, while a home with a density of ten has excellent insulation • Temperature: This is the average outdoor ambient temperature at each home for the most recent year, measure in degree Fahrenheit • Heating_Oil: This is the total number of units of heating oil purchased by the owner of each home in the most recent year • Num_Occupants: This is the total number of occupants living in each home • Avg_Age: This is the average age of those occupants • Home_Size: This is a rating, on a scale of one to eight, of the home’s overall size. The higher the number, the larger the home 2. Data Understanding 297
  • 298. • Filter Examples: attribute value filter or custom filter • Avg_Age>=15.1 • Avg_Age<=72.2 • Deleted Records= 42650-42042 = 508 3. Data Preparation 298
  • 299. 299
  • 301. 4. Evaluation – Model Regresi 301
  • 302. 4. Evaluation – Hasil Prediksi 302
  • 304. 304
  • 305. • Karena bantuan data mining sebelumnya, Sarah akhirnya mendapatkan promosi menjadi VP marketing, yang mengelola ratusan marketer • Sarah ingin para marketer dapat memprediksi pelanggan potensial mereka masing-masing secara mandiri. Masalahnya, data HeatingOil.csv hanya boleh diakses oleh level VP (Sarah), dan tidak diperbolehkan diakses oleh marketer secara langsung • Sarah ingin masing-masing marketer membuat proses yang dapat mengestimasi kebutuhan konsumsi minyak dari client yang mereka approach, dengan menggunakan model yang sebelumnya dihasilkan oleh Sarah, meskipun tanpa mengakses data training (HeatingOil.csv) • Asumsikan bahwa data HeatingOil-Marketing.csv adalah data calon pelanggan yang berhasil di approach oleh salah satu marketingnya • Yang harus dilakukan Sarah adalah membuat proses untuk: 1. Mengkomparasi algoritma yang menghasilkan model yang memiliki akurasi tertinggi (LR, NN, SVM), gunakan 10 Fold X Validation 2. Menyimpan model terbaik ke dalam suatu file (operator Store) • Yang harus dilakukan Marketer adalah membuat proses untuk: 1. Membaca model yang dihasilkan Sarah (operator Retrieve) 2. Menerapkannya di data HeatingOil-Marketing.csv yang mereka miliki • Mari kita bantu Sarah dan Marketer membuat dua proses tersebut Latihan 305
  • 307. Proses Pengujian Data (Marketer) 307
  • 308. Studi Kasus CRISP-DM Kelulusan Mahasiswa di Universitas Suka Belajar Dataset: datakelulusanmahasiswa.xls 308
  • 310. • Problems: • Budi adalah Rektor di Universitas Suka Belajar • Universitas Suka Belajar memiliki masalah besar karena rasio kelulusan mahasiswa tiap angkatan sangat rendah • Budi ingin memahami dan membuat pola dari profile mahasiswa yang bisa lulus tepat waktu dan yang tidak lulus tepat waktu • Dengan pola tersebut, Budi bisa melakukan konseling, terapi, dan memberi peringatan dini kepada mahasiswa kemungkinan tidak lulus tepat waktu untuk memperbaiki diri, sehingga akhirnya bisa lulus tepat waktu • Objective: • Menemukan pola dari mahasiswa yang lulus tepat waktu dan tidak 1. Business Understanding 310
  • 311. • Untuk menyelesaikan masalah, Budi mengambil data dari sistem informasi akademik di universitasnya • Data-data dikumpulkan dari data profil mahasiswa dan indeks prestasi semester mahasiswa, dengan atribut seperti di bawah 1. NAMA 2. JENIS KELAMIN: Laki-Laki atau Perempuan 3. STATUS MAHASISWA: Mahasiswa atau Bekerja 4. UMUR: 5. STATUS NIKAH: Menikah atau Belum Menikah 6. IPS 1: Indeks Prestasi Semester 1 7. IPS 2: Indeks Prestasi Semester 1 8. IPS 3: Indeks Prestasi Semester 1 9. IPS 4: Indeks Prestasi Semester 1 10. IPS 5: Indeks Prestasi Semester 1 11. IPS 6: Indeks Prestasi Semester 1 12. IPS 7: Indeks Prestasi Semester 1 13. IPS 8: Indeks Prestasi Semester 1 14. IPK: Indeks Prestasi Kumulatif 15. STATUS KELULUSAN: Terlambat atau Tepat 2. Data Understanding 311
  • 313. • Terdapat 379 data mahasiswa dengan 15 atribut • Missing Value sebayak 10 data, dan tidak terdapat data noise 3. Data Preparation 313
  • 314. • Missing Value dipecahkan dengan menambahkan data dengan nilai rata-rata • Hasilnya adalah data bersih tanpa missing value 3. Data Preparation 314
  • 315. • Modelkan dataset dengan Decision Tree • Pola yang dihasilkan bisa berbentuk tree atau if-then 4. Modeling 315
  • 316. Hasil pola dari data berupa berupa decision tree (pohon keputusan) 4. Modeling 316
  • 317. Hasil pola dari data berupa berupa peraturan if-then 5. Evaluation 317
  • 318. • Atribut atau faktor yang paling berpengaruh adalah Status Mahasiswa, IPS2, IPS5, IPS1 • Atribut atau faktor yang tidak berpengaruh adalah Nama, Jenis Kelamin, Umur, IPS6, IPS7, IPS8 5. Evaluation 318
  • 319. • Budi membuat program peningkatan disiplin dan pendampingan ke mahasiswa di semester awal (1-2) dan semester 5, karena faktor yang paling menentukan kelulusan mahasiswa ada di dua semester itu • Budi membuat peraturan melarang mahasiswa bekerja paruh waktu di semester awal perkuliahan, karena beresiko tinggi di kelulusan tepat waktu • Budi membuat program kerja paruh waktu di dalam kampus, sehingga banyak pekerjaan kampus yang bisa intens ditangani, sambil mendidik mahasiswa supaya memiliki pengalaman kerja. Dan yang paling penting mahasiswa tidak meninggalkan kuliah karena pekerjaan • Budi memasukkan pola dan model yang terbentuk ke dalam sistem informasi akademik, secara berkala diupdate setiap semester. Sistem informasi akademik dibuat cerdas, sehingga bisa mengirimkan email analisis pola kelulusan secara otomatis ke mahasiswa sesuai profilnya 6. Deployment 319
  • 320. • Pahami dan lakukan eksperimen berdasarkan seluruh studi kasus yang ada di buku Data Mining for the Masses (Matthew North) • Pahami bahwa metode CRISP-DM membantu kita memahami penggunaan metode data mining yang lebih sesuai dengan kebutuhan organisasi Latihan 320
  • 321. • Analisis masalah dan kebutuhan yang ada di organisasi lingkungan sekitar anda • Kumpulkan dan review dataset yang tersedia, dan hubungkan masalah dan kebutuhan tadi dengan data yang tersedia (analisis dari 5 peran data mining) • Bila memungkinkan pilih beberapa peran sekaligus untuk mengolah data tersebut, misalnya: lakukan association (analisis faktor), sekaligus estimation atau clustering • Lakukan proses CRISP-DM untuk menyelesaikan masalah yang ada di organisasi sesuai dengan data yang didapatkan • Pada proses data preparation, lakukan data cleaning (replace missing value, replace, filter attribute) sehingga data siap dimodelkan • Lakukan juga komparasi algoritma untuk memilih algoritma terbaik • Rangkumkan dalam bentuk slide dengan contoh studi kasus Sarah yang menggunakan data mining untuk: • Menganalisis faktor yang berhubungan (matrix correlation) • Mengestimasi jumlah stok minyak (linear regression) Tugas Menyelesaikan Masalah Organisasi 321
  • 322. Studi Kasus CRISP-DM Profiling Tersangka Koruptor Dataset: Data LHKPN KPK 322
  • 323. 1. Prediksi Profil Tersangka Koruptor (Klasifikasi, Decision Tree) 2. Forecasting Jumlah Wajib Lapor di suatu Instansi atau suatu propinsi (Forecasting, Neural Network) 3. Prediksi Rekomendasi Hasil Pemeriksaan LHKPN (Klasifikasi, Decision Tree) Contoh Kasus Pengolahan Data LHKPN 323
  • 324. Prediksi Profil Tersangka Koruptor 324
  • 325. Pola Profil Tersangka Koruptor 325
  • 329. Pola Rekomendasi Hasil Pemeriksaan LHKPN 329
  • 330. 3. Persiapan Data 3.1 Data Cleaning 3.2 Data Reduction 3.3 Data Transformation and Data Discretization 3.4 Data Integration 330
  • 332. Measures for data quality: A multidimensional view • Accuracy: correct or wrong, accurate or not • Completeness: not recorded, unavailable, … • Consistency: some modified but some not, … • Timeliness: timely update? • Believability: how trustable the data are correct? • Interpretability: how easily the data can be understood? Why Preprocess the Data? 332
  • 333. 1. Data cleaning • Fill in missing values • Smooth noisy data • Identify or remove outliers • Resolve inconsistencies 2. Data reduction • Dimensionality reduction • Numerosity reduction • Data compression 3. Data transformation and data discretization • Normalization • Concept hierarchy generation 4. Data integration • Integration of multiple databases or files Major Tasks in Data Preprocessing 333
  • 334. Data preparation is more than half of every data mining process • Maxim of data mining: most of the effort in a data mining project is spent in data acquisition and preparation, and informal estimates vary from 50 to 80 percent • The purpose of data preparation is: 1. To put the data into a form in which the data mining question can be asked 2. To make it easier for the analytical techniques (such as data mining algorithms) to answer it Data Preparation Law (Data Mining Law 3) 334
  • 336. Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty, human or computer error, transmission error • Incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data • e.g., Occupation=“ ” (missing data) • Noisy: containing noise, errors, or outliers • e.g., Salary=“−10” (an error) • Inconsistent: containing discrepancies in codes or names • e.g., Age=“42”, Birthday=“03/07/2010” • Was rating “1, 2, 3”, now rating “A, B, C” • Discrepancy between duplicate records • Intentional (e.g., disguised missing data) • Jan. 1 as everyone’s birthday? Data Cleaning 336
  • 337. • Data is not always available • E.g., many tuples have no recorded value for several attributes, such as customer income in sales data • Missing data may be due to • equipment malfunction • inconsistent with other recorded data and thus deleted • data not entered due to misunderstanding • certain data may not be considered important at the time of entry • not register history or changes of the data • Missing data may need to be inferred Incomplete (Missing) Data 337
  • 339. • Jerry is the marketing manager for a small Internet design and advertising firm • Jerry’s boss asks him to develop a data set containing information about Internet users • The company will use this data to determine what kinds of people are using the Internet and how the firm may be able to market their services to this group of users • To accomplish his assignment, Jerry creates an online survey and places links to the survey on several popular Web sites • Within two weeks, Jerry has collected enough data to begin analysis, but he finds that his data needs to be denormalized • He also notes that some observations in the set are missing values or they appear to contain invalid values • Jerry realizes that some additional work on the data needs to take place before analysis begins. MissingDataSet.csv 339
  • 341. View of Data (Denormalized Data) 341
  • 343. • Ignore the tuple: • Usually done when class label is missing (when doing classification)—not effective when the % of missing values per attribute varies considerably • Fill in the missing value manually: • Tedious + infeasible? • Fill in it automatically with • A global constant: e.g., “unknown”, a new class?! • The attribute mean • The attribute mean for all samples belonging to the same class: smarter • The most probable value: inference-based such as Bayesian formula or decision tree How to Handle Missing Data? 343
  • 344. • Lakukan eksperimen mengikuti buku Matthew North, Data Mining for the Masses 2nd Edition, 2016, Chapter 3 Data Preparation 1. Handling Missing Data, pp. 34-48 (replace) 2. Data Reduction, pp. 48-51 (delete/filter) • Dataset: MissingDataSet.csv • Analisis metode preprocessing apa saja yang digunakan dan mengapa perlu dilakukan pada dataset tersebut? Latihan 344
  • 348. • Noise: random error or variance in a measured variable • Incorrect attribute values may be due to • Faulty data collection instruments • Data entry problems • Data transmission problems • Technology limitation • Inconsistency in naming convention • Other data problems which require data cleaning • Duplicate records • Incomplete data • Inconsistent data Noisy Data 348
  • 349. • Binning • First sort data and partition into (equal-frequency) bins • Then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. • Regression • Smooth by fitting the data into regression functions • Clustering • Detect and remove outliers • Combined computer and human inspection • Detect suspicious values and check by human (e.g., deal with possible outliers) How to Handle Noisy Data? 349
  • 350. • Data discrepancy detection • Use metadata (e.g., domain, range, dependency, distribution) • Check field overloading • Check uniqueness rule, consecutive rule and null rule • Use commercial tools • Data scrubbing: use simple domain knowledge (e.g., postal code, spell-check) to detect errors and make corrections • Data auditing: by analyzing data to discover rules and relationship to detect violators (e.g., correlation and clustering to find outliers) • Data migration and integration • Data migration tools: allow transformations to be specified • ETL (Extraction/Transformation/Loading) tools: allow users to specify transformations through a graphical user interface • Integration of the two processes • Iterative and interactive (e.g., Potter’s Wheels) Data Cleaning as a Process 350
  • 351. • Lakukan eksperimen mengikuti buku Matthew North, Data Mining for the Masses 2nd Edition, 2016, Chapter 3 Data Preparation, pp. 52-54 (Handling Inconsistence Data) • Dataset: MissingDataSet.csv • Analisis metode preprocessing apa saja yang digunakan dan mengapa perlu dilakukan pada dataset tersebut! Latihan 351
  • 352. 352
  • 354. • Impor data MissingDataValue-Noisy.csv • Gunakan Regular Expression (operator Replace) untuk mengganti semua noisy data pada atribut nominal menjadi “N” Latihan 354
  • 355. 1. Impor data MissingDataValue-Noisy-Multiple.csv 2. Gunakan operator Replace Missing Value untuk mengisi data kosong 3. Gunakan Regular Expression (operator Replace) untuk mengganti semua noisy data pada atribut nominal menjadi “N” 4. Gunakan operator Map untuk mengganti semua isian Face, FB dan Fesbuk menjadi Facebook Latihan 355
  • 356. 356
  • 357. 1 2 3 4 357 1. Impor data MissingDataValue-Noisy-Multiple.csv 2. operator Replace Missing Value untuk mengisi data kosong 3. operator Replace untuk mengganti semua noisy data pada atribut nominal menjadi “N” 4. operator Map untuk mengganti semua isian Face, FB dan Fesbuk menjadi Facebook 1 2 3 4
  • 358. Studi Kasus CRISP-DM Sport Skill – Discriminant Analysis (Matthew North, Data Mining for the Masses 2nd Edition, 2016, Chapter 7 Discriminant Analysis, pp. 123-143) Dataset: SportSkill-Training.csv Dataset: SportSkill-Scoring.csv 358
  • 359. • Motivation: • Gill runs a sports academy designed to help high school aged athletes achieve their maximum athletic potential. He focuses on four major sports: Football, Basketball, Baseball and Hockey • He has found that while many high school athletes enjoy participating in a number of sports in high school, as they begin to consider playing a sport at the college level, they would prefer to specialize in one sport • As he’s worked with athletes over the years, Gill has developed an extensive data set, and he now is wondering if he can use past performance from some of his previous clients to predict prime sports for up-and-coming high school athletes • By evaluating each athlete’s performance across a battery of test, Gill hopes we can help him figure out for which sport each athlete has the highest aptitude • Objective: • Ultimately, he hopes he can make a recommendation to each athlete as to the sport in which they should most likely choose to specialize 1. Business Understanding 359
  • 360. • Every athlete that has enrolled at Gill’s academy over the past several years has taken a battery test, which tested for a number of athletic and personal traits • Because the academy has been operating for some time, Gill has the benefit of knowing which of his former pupils have gone on to specialize in a single sport, and which sport it was for each of them 2. Data Understanding 360
  • 361. • Working with Gill, we gather the results of the batteries for all former clients who have gone on to specialize • Gill adds the sport each person specialized in, and we have a data set comprised of 493 observations containing the following attributes: 1. Age: .... 2. Strength: .... 3. Quickness: .... 4. Injury: .... 5. Vision: .... 6. Endurance: .... 7. Agility: .... 8. Decision Making: .... 9. Prime Sport: .... 2. Data Understanding 361
  • 362. • Filter Examples: attribute value filter • Decision_Making>=3 • Decision_Making<=100 • Deleted Records= 493-482=11 3. Data Preparation 362
  • 363. 1. Lakukan training pada data SportSkill- Training.csv dengan menggunakan C4.5, NB, K-NN dan LDA 2. Lakukan pengujian dengan menggunakan 10-fold X Validation 3. Uji beda dengan t-Test untuk mendapatkan model terbaik 4. Simpan model terbaik dari komparasi di atas dengan operator Write Model, dan kemudian Apply Model pada dataset SportSkill-Scoring.csv Latihan 363
  • 364. 364 DT NB k-NN LDA DT NB k-NN LDA
  • 365. 365
  • 366. 366
  • 368. • Data Reduction • Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same analytical results • Why Data Reduction? • A database/data warehouse may store terabytes of data • Complex data analysis take a very long time to run on the complete dataset • Data Reduction Methods 1. Dimensionality Reduction 1. Feature Extraction 2. Feature Selection 1. Filter Approach 2. Wrapper Approach 3. Embedded Approach 2. Numerosity Reduction (Data Reduction) • Regression and Log-Linear Models • Histograms, clustering, sampling Data Reduction Methods 368
  • 369. • Curse of dimensionality • When dimensionality increases, data becomes increasingly sparse • Density and distance between points, which is critical to clustering, outlier analysis, becomes less meaningful • The possible combinations of subspaces will grow exponentially • Dimensionality reduction • Avoid the curse of dimensionality • Help eliminate irrelevant features and reduce noise • Reduce time and space required in data mining • Allow easier visualization • Dimensionality Reduction Methods: 1. Feature Extraction: Wavelet transforms, Principal Component Analysis (PCA) 2. Feature Selection: Filter, Wrapper, Embedded 1. Dimensionality Reduction 369
  • 370. • Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors (principal components) that can be best used to represent data 1. Normalize input data: Each attribute falls within the same range 2. Compute k orthonormal (unit) vectors, i.e., principal components 3. Each input data (vector) is a linear combination of the k principal component vectors 4. The principal components are sorted in order of decreasing “significance” or strength 5. Since the components are sorted, the size of the data can be reduced by eliminating the weak components, i.e., those with low variance • Works for numeric data only Principal Component Analysis (Steps) 370
  • 371. • Lakukan eksperimen mengikuti buku Markus Hofmann (Rapid Miner - Data Mining Use Case) Chapter 4 (k-Nearest Neighbor Classification II) pp. 45-51 • Dataset: glass.data • Analisis metode preprocessing apa saja yang digunakan dan mengapa perlu dilakukan pada dataset tersebut! • Bandingkan akurasi dari k-NN dan PCA+k-NN Latihan 371
  • 372. 372
  • 373. 373
  • 374. Data Awal Sebelum PCA 374
  • 376. • Susun ulang proses yang mengkomparasi model yang dihasilkan oleh k-NN dan PCA + k-NN • Gunakan 10 Fold X Validation Latihan 376
  • 377. 377
  • 378. • Review operator apa saja yang bisa digunakan untuk feature extraction • Ganti PCA dengan metode feature extraction yang lain • Lakukan komparasi dan tentukan mana metode feature extraction terbaik untuk data Glass.data, gunakan 10- fold cross validation Latihan 378
  • 379. • Another way to reduce dimensionality of data • Redundant attributes • Duplicate much or all of the information contained in one or more other attributes • E.g., purchase price of a product and the amount of sales tax paid • Irrelevant attributes • Contain no information that is useful for the data mining task at hand • E.g., students' ID is often irrelevant to the task of predicting students' GPA Feature/Attribute Selection 379
  • 380. A number of proposed approaches for feature selection can broadly be categorized into the following three classifications: wrapper, filter, and embedded (Liu & Tu, 2004) 1. In the filter approach, statistical analysis of the feature set is required, without utilizing any learning model (Dash & Liu, 1997) 2. In the wrapper approach, a predetermined learning model is assumed, wherein features are selected that justify the learning performance of the particular learning model (Guyon & Elisseeff, 2003) 3. The embedded approach attempts to utilize the complementary strengths of the wrapper and filter approaches (Huang, Cai, & Xu, 2007) Feature Selection Approach 380
  • 381. Wrapper Approach vs Filter Approach 381 Wrapper Approach Filter Approach
  • 382. 1. Filter Approach: • information gain • chi square • log likehood rasio • etc 2. Wrapper Approach: • forward selection • backward elimination • randomized hill climbing • etc 3. Embedded Approach: • decision tree • weighted naïve bayes • etc Feature Selection Approach 382
  • 383. • Lakukan eksperimen mengikuti buku Markus Hofmann (Rapid Miner - Data Mining Use Case) Chapter 4 (k-Nearest Neighbor Classification II) • Ganti PCA dengan metode feature selection (filter), misalnya: • Information Gain • Chi Squared • etc • Cek di RapidMiner, operator apa saja yang bisa digunakan untuk mengurangi atau membobot atribute dari dataset! Latihan 383
  • 384. 384
  • 385. 385
  • 386. • Lakukan eksperimen mengikuti buku Markus Hofmann (Rapid Miner - Data Mining Use Case) Chapter 4 (k-Nearest Neighbor Classification II) • Ganti PCA dengan metode feature selection (wrapper), misalnya: • Backward Elimination • Forward Selection • etc • Ganti metode validasi dengan 10-Fold X Validation • Bandingkan akurasi dari k-NN dan BE+k-NN or FS+k-NN Latihan 386
  • 387. 387
  • 388. 388
  • 391. Hasil Komparasi Akurasi dan Signifikansi t-Test 391 k-NN k-NN+PCA k-NN+ICA k-NN+IG k-NN+IGR kNN + FS k-NN+BE Accuracy AUC
  • 392. 1. Lakukan training pada data mahasiswa (datakelulusanmahasiswa.xls) dengan menggunakan 3 algoritma klasifikasi (DT, NB, k-NN) 2. Analisis dan komparasi, mana algoritma klasifikasi yang menghasilkan model paling akurat (AK) 3. Lakukan feature selection dengan Information Gain (Filter), Forward Selection, Backward Elimination (Wrapper) untuk model yang paling akurat 4. Analisis dan komparasi, mana algoritma feature selection yang menghasilkan model paling akurat 5. Lakukan pengujian dengan menggunakan 10-fold X Validation Latihan: Prediksi Kelulusan Mahasiswa 392 AK AK+IG AK+FS AK+BE Accuracy 91.55 92.10 91.82 AUC 0.909 0.920 0.917
  • 393. 1. Lakukan training pada data mahasiswa (datakelulusanmahasiswa.xls) dengan menggunakan 4 algoritma klasifikasi (DT 2. Lakukan feature selection dengan Forward Selection untuk algoritma DT (DT+FS) 3. Lakukan feature selection dengan Backward Elimination untuk algoritma DT (DT+BE) 4. Lakukan pengujian dengan menggunakan 10-fold X Validation 5. Uji beda dengan t-Test untuk mendapatkan model terbaik (DT vs DT+FS vs DT+BE) Latihan: Prediksi Kelulusan Mahasiswa 393 DT DT+FS DT+BE Accuracy 91.55 92.10 91.82 AUC 0.909 0.920 0.917
  • 394. 394 DT DT+FS DT+BE Accuracy 91.55 92.10 91.82 AUC 0.909 0.920 0.917 no significant difference
  • 395. 1. Lakukan komparasi algoritma pada data pemilu (datapemilukpu.xls), sehingga didapatkan algoritma terbaik 2. Ambil algoritma terbaik dari langkah 1, kemudian lakukan feature selection dengan Forward Selection dan Backward Elimination 3. Tentukan kombinasi algoritma dan feature selection apa yang memiliki performa terbaik 4. Lakukan pengujian dengan menggunakan 10-fold X Validation 5. Uji beda dengan t-Test untuk mendapatkan model terbaik Latihan: Prediksi Elektabilitas Pemilu 395 A A + FS A + BE Accuracy AUC DT NB K-NN Accuracy AUC
  • 396. 1. Lakukan training pada data mahasiswa (datakelulusanmahasiswa.xls) dengan menggunakan DT, NB, K-NN 2. Lakukan dimension reduction dengan Forward Selection untuk ketiga algoritma di atas 3. Lakukan pengujian dengan menggunakan 10-fold X Validation 4. Uji beda dengan t-Test untuk mendapatkan model terbaik Latihan: Prediksi Kelulusan Mahasiswa 396 DT NB K-NN DT+FS NB+FS K-NN+FS Accuracy AUC
  • 397. There is No Free Lunch for the Data Miner (NFL-DM) The right model for a given application can only be discovered by experiment • Axiom of machine learning: if we knew enough about a problem space, we could choose or design an algorithm to find optimal solutions in that problem space with maximal efficiency • Arguments for the superiority of one algorithm over others in data mining rest on the idea that data mining problem spaces have one particular set of properties, or that these properties can be discovered by analysis and built into the algorithm • However, these views arise from the erroneous idea that, in data mining, the data miner formulates the problem and the algorithm finds the solution • In fact, the data miner both formulates the problem and finds the solution – the algorithm is merely a tool which the data miner uses to assist with certain steps in this process No Free Lunch Theory (Data Mining Law 4) 397
  • 398. Reduce data volume by choosing alternative, smaller forms of data representation 1. Parametric methods (e.g., regression) • Assume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers) • Ex.: Log-linear models—obtain value at a point in m-D space as the product on appropriate marginal subspaces 2. Non-parametric methods • Do not assume models • Major families: histograms, clustering, sampling, … 2. Numerosity Reduction 398
  • 400. • Linear regression • Data modeled to fit a straight line • Often uses the least-square method to fit the line • Multiple regression • Allows a response variable Y to be modeled as a linear function of multidimensional feature vector • Log-linear model • Approximates discrete multidimensional probability distributions Parametric Data Reduction: Regression and Log-Linear Models 400
  • 401. • Regression analysis: A collective name for techniques for the modeling and analysis of numerical data consisting of values of a dependent variable (also called response variable or measurement) and of one or more independent variables (aka. explanatory variables or predictors) • The parameters are estimated so as to give a "best fit" of the data • Most commonly the best fit is evaluated by using the least squares method, but other criteria have also been used • Used for prediction (including forecasting of time-series data), inference, hypothesis testing, and modeling of causal relationships Regression Analysis 401 x y = x + 1 X1 Y1 Y1’
  • 402. • Linear regression: Y = w X + b • Two regression coefficients, w and b, specify the line and are to be estimated by using the data at hand • Using the least squares criterion to the known values of Y1, Y2, …, X1, X2, …. • Multiple regression: Y = b0 + b1 X1 + b2 X2 • Many nonlinear functions can be transformed into the above • Log-linear models: • Approximate discrete multidimensional probability distributions • Estimate the probability of each point (tuple) in a multi-dimensional space for a set of discretized attributes, based on a smaller subset of dimensional combinations • Useful for dimensionality reduction and data smoothing Regress Analysis and Log-Linear Models 402
  • 403. • Divide data into buckets and store average (sum) for each bucket • Partitioning rules: • Equal-width: equal bucket range • Equal-frequency (or equal- depth) Histogram Analysis 403 0 5 10 15 20 25 30 35 40 10000 30000 50000 70000 90000
  • 404. • Partition data set into clusters based on similarity, and store cluster representation (e.g., centroid and diameter) only • Can be very effective if data is clustered but not if data is “smeared” • Can have hierarchical clustering and be stored in multi-dimensional index tree structures • There are many choices of clustering definitions and clustering algorithms Clustering 404
  • 405. • Sampling: obtaining a small sample s to represent the whole data set N • Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data • Key principle: Choose a representative subset of the data • Simple random sampling may have very poor performance in the presence of skew • Develop adaptive sampling methods, e.g., stratified sampling • Note: Sampling may not reduce database I/Os (page at a time) Sampling 405
  • 406. • Simple random sampling • There is an equal probability of selecting any particular item • Sampling without replacement • Once an object is selected, it is removed from the population • Sampling with replacement • A selected object is not removed from the population • Stratified sampling • Partition the data set, and draw samples from each partition (proportionally, i.e., approximately the same percentage of the data) • Used in conjunction with skewed data Types of Sampling 406
  • 407. Sampling: With or without Replacement 407 Raw Data
  • 408. Sampling: Cluster or Stratified Sampling 408 Raw Data Cluster/Stratified Sample
  • 409. • Stratification is the process of dividing members of the population into homogeneous subgroups before sampling • Suppose that in a company there are the following staff: • Male, full-time: 90 • Male, part-time: 18 • Female, full-time: 9 • Female, part-time: 63 • Total: 180 • We are asked to take a sample of 40 staff, stratified according to the above categories • An easy way to calculate the percentage is to multiply each group size by the sample size and divide by the total population: • Male, full-time = 90 × (40 ÷ 180) = 20 • Male, part-time = 18 × (40 ÷ 180) = 4 • Female, full-time = 9 × (40 ÷ 180) = 2 • Female, part-time = 63 × (40 ÷ 180) = 14 Stratified Sampling 409
  • 410. • Lakukan eksperimen mengikuti buku Matthew North, Data Mining for the Masses 2nd Edition, 2016, Chapter 7 Discriminant Analysis, pp. 125-143 • Datasets: • SportSkill-Training.csv • SportSkill-Scoring.csv • Analisis metode preprocessing apa saja yang digunakan dan mengapa perlu dilakukan pada dataset tersebut! Latihan 410
  • 411. 3.3 Data Transformation and Data Discretization 411
  • 412. • A function that maps the entire set of values of a given attribute to a new set of replacement values • Each old value can be identified with one of the new values • Methods: • Smoothing: Remove noise from data • Attribute/feature construction • New attributes constructed from the given ones • Aggregation: Summarization, data cube construction • Normalization: Scaled to fall within a smaller, specified range • min-max normalization • z-score normalization • normalization by decimal scaling • Discretization: Concept hierarchy climbing Data Transformation 412
  • 413. • Min-max normalization: to [new_minA, new_maxA] • Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is mapped to • Z-score normalization (μ: mean, σ: standard deviation): • Ex. Let μ = 54,000, σ = 16,000. Then • Normalization by decimal scaling Normalization 413 716 . 0 0 ) 0 0 . 1 ( 000 , 12 000 , 98 000 , 12 600 , 73      A A A A A A min new min new max new min max min v v _ ) _ _ ( '      A A v v     ' j v v 10 ' Where j is the smallest integer such that Max(|ν’|) < 1 225 . 1 000 , 16 000 , 54 600 , 73  
  • 414. • Three types of attributes • Nominal —values from an unordered set, e.g., color, profession • Ordinal —values from an ordered set, e.g., military or academic rank • Numeric —real numbers, e.g., integer or real numbers • Discretization: Divide the range of a continuous attribute into intervals • Interval labels can then be used to replace actual data values • Reduce data size by discretization • Supervised vs. unsupervised • Split (top-down) vs. merge (bottom-up) • Discretization can be performed recursively on an attribute • Prepare for further analysis, e.g., classification Discretization 414
  • 415. Typical methods: All the methods can be applied recursively • Binning: Top-down split, unsupervised • Histogram analysis: Top-down split, unsupervised • Clustering analysis: Unsupervised, top-down split or bottom-up merge • Decision-tree analysis: Supervised, top-down split • Correlation (e.g., 2) analysis: Unsupervised, bottom-up merge Data Discretization Methods 415
  • 416. • Equal-width (distance) partitioning • Divides the range into N intervals of equal size: uniform grid • if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B –A)/N. • The most straightforward, but outliers may dominate presentation • Skewed data is not handled well • Equal-depth (frequency) partitioning • Divides the range into N intervals, each containing approximately same number of samples • Good data scaling • Managing categorical attributes can be tricky Simple Discretization: Binning 416
  • 417. Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 • Partition into equal-frequency (equi-depth) bins: • Bin 1: 4, 8, 9, 15 • Bin 2: 21, 21, 24, 25 • Bin 3: 26, 28, 29, 34 • Smoothing by bin means: • Bin 1: 9, 9, 9, 9 • Bin 2: 23, 23, 23, 23 • Bin 3: 29, 29, 29, 29 • Smoothing by bin boundaries: • Bin 1: 4, 4, 4, 15 • Bin 2: 21, 21, 25, 25 • Bin 3: 26, 26, 26, 34 Binning Methods for Data Smoothing 417
  • 418. Discretization Without Using Class Labels (Binning vs. Clustering) 418 Data Equal interval width (binning) Equal frequency (binning) K-means clustering leads to better results
  • 419. • Classification (e.g., decision tree analysis) • Supervised: Given class labels, e.g., cancerous vs. benign • Using entropy to determine split point (discretization point) • Top-down, recursive split • Correlation analysis (e.g., Chi-merge: χ2-based discretization) • Supervised: use class information • Bottom-up merge: find the best neighboring intervals (those having similar distributions of classes, i.e., low χ2 values) to merge • Merge performed recursively, until a predefined stopping condition Discretization by Classification & Correlation Analysis 419
  • 420. • Lakukan eksperimen mengikuti buku Markus Hofmann (Rapid Miner - Data Mining Use Case) Chapter 5 (Naïve Bayes Classification I) • Dataset: crx.data • Analisis metode preprocessing apa saja yang digunakan dan mengapa perlu dilakukan pada dataset tersebut! • Bandingkan akurasi model apabila tidak menggunakan filter dan diskretisasi • Bandingkan pula apabila digunakan feature selection (wrapper) dengan Backward Elimination Latihan 420
  • 421. 421
  • 424. • Data integration: • Combines data from multiple sources into a coherent store • Schema Integration: e.g., A.cust-id  B.cust-# • Integrate metadata from different sources • Entity Identification Problem: • Identify real world entities from multiple data sources, e.g., Bill Clinton = William Clinton • Detecting and Resolving Data Value Conflicts • For the same real world entity, attribute values from different sources are different • Possible reasons: different representations, different scales, e.g., metric vs. British units Data Integration 424
  • 425. • Redundant data occur often when integration of multiple databases • Object identification: The same attribute or object may have different names in different databases • Derivable data: One attribute may be a “derived” attribute in another table, e.g., annual revenue • Redundant attributes may be able to be detected by correlation analysis and covariance analysis • Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality Handling Redundancy in Data Integration 425
  • 426. • Χ2 (chi-square) test • The larger the Χ2 value, the more likely the variables are related • The cells that contribute the most to the Χ2 value are those whose actual count is very different from the expected count • Correlation does not imply causality • # of hospitals and # of car-theft in a city are correlated • Both are causally linked to the third variable: population Correlation Analysis (Nominal Data) 426    Expected Expected Observed 2 2 ) ( 
  • 427. • Χ2 (chi-square) calculation (numbers in parenthesis are expected counts calculated based on the data distribution in the two categories) • It shows that like_science_fiction and play_chess are correlated in the group Chi-Square Calculation: An Example 427 Play chess Not play chess Sum (row) Like science fiction 250(90) 200(360) 450 Not like science fiction 50(210) 1000(840) 1050 Sum(col.) 300 1200 1500 93 . 507 840 ) 840 1000 ( 360 ) 360 200 ( 210 ) 210 50 ( 90 ) 90 250 ( 2 2 2 2 2          
  • 428. • Correlation coefficient (also called Pearson’s product moment coefficient) where n is the number of tuples, and are the respective means of A and B, σA and σB are the respective standard deviation of A and B, and Σ(aibi) is the sum of the AB cross-product • If rA,B > 0, A and B are positively correlated (A’s values increase as B’s). The higher, the stronger correlation • rA,B = 0: independent; rAB < 0: negatively correlated Correlation Analysis (Numeric Data) 428 B A n i i i B A n i i i B A n B A n b a n B b A a r     ) 1 ( ) ( ) 1 ( ) )( ( 1 1 ,            A B
  • 429. Visually Evaluating Correlation 429 Scatter plots showing the similarity from –1 to 1
  • 430. • Correlation measures the linear relationship between objects • To compute correlation, we standardize data objects, A and B, and then take their dot product Correlation 430 ) ( / )) ( ( ' A std A mean a a k k   ) ( / )) ( ( ' B std B mean b b k k   ' ' ) , ( B A B A n correlatio  
  • 431. • Covariance is similar to correlation where n is the number of tuples, and are the respective mean or expected values of A and B, σA and σB are the respective standard deviation of A and B • Positive covariance: If CovA,B > 0, then A and B both tend to be larger than their expected values • Negative covariance: If CovA,B < 0 then if A is larger than its expected value, B is likely to be smaller than its expected value • Independence: CovA,B = 0 but the converse is not true: • Some pairs of random variables may have a covariance of 0 but are not independent. Only under some additional assumptions (e.g., the data follow multivariate normal distributions) does a covariance of 0 imply independence Covariance (Numeric Data) 431 A B Correlation coefficient:
  • 432. • It can be simplified in computation as • Suppose two stocks A and B have the following values in one week: (2, 5), (3, 8), (5, 10), (4, 11), (6, 14). • Question: If the stocks are affected by the same industry trends, will their prices rise or fall together? • E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4 • E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6 • Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4 • Thus, A and B rise together since Cov(A, B) > 0 Covariance: An Example 432
  • 433. 1. Data quality: accuracy, completeness, consistency, timeliness, believability, interpretability 2. Data cleaning: e.g. missing/noisy values, outliers 3. Data reduction • Dimensionality reduction • Numerosity reduction 4. Data transformation and data discretization • Normalization 5. Data integration from multiple sources: • Entity identification problem • Remove redundancies • Detect inconsistencies Rangkuman 433
  • 434. • Buat tulisan ilmiah dari slide (ppt) yang sudah dibuat, dengan menggunakan template di http://journal.ilmukomputer.org • Struktur Paper mengikuti format di bawah: 1. Pendahuluan • Latar belakang masalah dan tujuan 2. Penelitian Yang Berhubungan • Penelitian lain yang melakukan hal yang mirip dengan yang kita lakukan 3. Metode Penelitian • Cara kita menganalisis data, jelaskan bahwa kita menggunakan CRISP-DM 4. Hasil dan Pembahasan • 4.1 Business Understanding • 4.2 Data Understanding • 4.3 Data Preparation • 4.4 Modeling • 4.5 Evaluation • 4.6 Deployment 5. Kesimpulan • Kesimpulan harus sesuai dengan tujuan 6. Daftar Referensi • Masukan daftar referensi yang digunakan Tugas Membuat Tulisan Ilmiah 434
  • 435. • Analisis masalah dan kebutuhan yang ada di organisasi lingkungan sekitar anda • Kumpulkan dan review dataset yang tersedia, dan hubungkan masalah dan kebutuhan tadi dengan data yang tersedia (analisis dari 5 peran data mining) • Bila memungkinkan pilih beberapa peran sekaligus untuk mengolah data tersebut, misalnya: lakukan association (analisis faktor), sekaligus estimation atau clustering • Lakukan proses CRISP-DM untuk menyelesaikan masalah yang ada di organisasi sesuai dengan data yang didapatkan • Pada proses data preparation, lakukan data cleaning (replace missing value, replace, filter attribute) sehingga data siap dimodelkan • Lakukan juga komparasi algoritma dan feature selection untuk memilih pola dan model terbaik • Rangkumkan evaluasi dari pola/model/knowledge yang dihasilkan dan relasikan hasil evaluasi dengan deployment yang dilakukan • Rangkumkan dalam bentuk slide dengan contoh studi kasus Sarah untuk membantu bidang marketing Tugas Menyelesaikan Masalah Organisasi 435
  • 436. 4. Algoritma Data Mining 4.1 Algoritma Klasifikasi 4.2 Algoritma Klastering 4.3 Algoritma Asosiasi 4.4 Algoritma Estimasi dan Forecasting 436
  • 439. • Basic algorithm (a greedy algorithm) 1. Tree is constructed in a top-down recursive divide-and- conquer manner 2. At start, all the training examples are at the root 3. Attributes are categorical (if continuous-valued, they are discretized in advance) 4. Examples are partitioned recursively based on selected attributes 5. Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain, gain ratio, gini index) • Conditions for stopping partitioning • All samples for a given node belong to the same class • There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf • There are no samples left Algorithm for Decision Tree Induction 439
  • 440. Brief Review of Entropy 440 m = 2
  • 441. • Select the attribute with the highest information gain • Let pi be the probability that an arbitrary tuple in D belongs to class Ci, estimated by | Ci, D|/|D| • Expected information (entropy) needed to classify a tuple in D: • Information needed (after using A to split D into v partitions) to classify D: • Information gained by branching on attribute A Attribute Selection Measure: Information Gain (ID3) 441 ) ( log ) ( 2 1 i m i i p p D Info     ) ( | | | | ) ( 1 j v j j A D Info D D D Info     (D) Info Info(D) Gain(A) A  
  • 442. Attribute Selection: Information Gain 442 • Class P: buys_computer = “yes” • Class N: buys_computer = “no” means “age <=30” has 5 out of 14 samples, with 2 yes’es and 3 no’s. Hence Similarly, age pi ni I(pi, ni) <=30 2 3 0.971 31…40 4 0 0 >40 3 2 0.971 694 . 0 ) 2 , 3 ( 14 5 ) 0 , 4 ( 14 4 ) 3 , 2 ( 14 5 ) (     I I I D Infoage 048 . 0 ) _ ( 151 . 0 ) ( 029 . 0 ) (    rating credit Gain student Gain income Gain 246 . 0 ) ( ) ( ) (    D Info D Info age Gain age age income student credit_rating buys_computer <=30 high no fair no <=30 high no excellent no 31…40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no ) 3 , 2 ( 14 5 I 940 . 0 ) 14 5 ( log 14 5 ) 14 9 ( log 14 9 ) 5 , 9 ( ) ( 2 2      I D Info
  • 443. • Let attribute A be a continuous-valued attribute • Must determine the best split point for A • Sort the value A in increasing order • Typically, the midpoint between each pair of adjacent values is considered as a possible split point • (ai+ai+1)/2 is the midpoint between the values of ai and ai+1 • The point with the minimum expected information requirement for A is selected as the split-point for A • Split: • D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is the set of tuples in D satisfying A > split-point Computing Information-Gain for Continuous- Valued Attributes 443
  • 444. 1. Siapkan data training 2. Pilih atribut sebagai akar 3. Buat cabang untuk tiap-tiap nilai 4. Ulangi proses untuk setiap cabang sampai semua kasus pada cabang memiliki kelas yg sama Tahapan Algoritma Decision Tree (ID3) 444     n i pi pi S Entropy 1 2 log * ) ( ) ( * | | | | ) ( ) , ( 1 i n i i S Entropy S S S Entropy A S Gain    
  • 445. 1. Siapkan data training 445
  • 446. • Untuk memilih atribut akar, didasarkan pada nilai Gain tertinggi dari atribut-atribut yang ada. Untuk mendapatkan nilai Gain, harus ditentukan terlebih dahulu nilai Entropy • Rumus Entropy: • S = Himpunan Kasus • n = Jumlah Partisi S • pi = Proporsi dari Si terhadap S • Rumus Gain: • S = Himpunan Kasus • A = Atribut • n = Jumlah Partisi Atribut A • | Si | = Jumlah Kasus pada partisi ke-i • | S | = Jumlah Kasus dalam S 2. Pilih atribut sebagai akar 446     n i pi pi S Entropy 1 2 log * ) ( ) ( * | | | | ) ( ) , ( 1 i n i i S Entropy S S S Entropy A S Gain    
  • 447. Perhitungan Entropy dan Gain Akar 447
  • 448. • Entropy Total • Entropy (Outlook) • Entropy (Temperature) • Entropy (Humidity) • Entropy (Windy) Penghitungan Entropy Akar 448
  • 449. Penghitungan Entropy Akar 449 NODE ATRIBUT JML KASUS (S) YA (Si) TIDAK (Si) ENTROPY GAIN 1 TOTAL 14 10 4 0,86312 OUTLOOK CLOUDY 4 4 0 0 RAINY 5 4 1 0,72193 SUNNY 5 2 3 0,97095 TEMPERATURE COOL 4 0 4 0 HOT 4 2 2 1 MILD 6 2 4 0,91830 HUMADITY HIGH 7 4 3 0,98523 NORMAL 7 7 0 0 WINDY FALSE 8 2 6 0,81128 TRUE 6 4 2 0,91830
  • 451. Penghitungan Gain Akar 451 NODE ATRIBUT JML KASUS (S) YA (Si) TIDAK (Si) ENTROPY GAIN 1 TOTAL 14 10 4 0,86312 OUTLOOK 0,25852 CLOUDY 4 4 0 0 RAINY 5 4 1 0,72193 SUNNY 5 2 3 0,97095 TEMPERATURE 0,18385 COOL 4 0 4 0 HOT 4 2 2 1 MILD 6 2 4 0,91830 HUMIDITY 0,37051 HIGH 7 4 3 0,98523 NORMAL 7 7 0 0 WINDY 0,00598 FALSE 8 2 6 0,81128 TRUE 6 4 2 0,91830
  • 452. • Dari hasil pada Node 1, dapat diketahui bahwa atribut dengan Gain tertinggi adalah HUMIDITY yaitu sebesar 0.37051 • Dengan demikian HUMIDITY dapat menjadi node akar • Ada 2 nilai atribut dari HUMIDITY yaitu HIGH dan NORMAL. Dari kedua nilai atribut tersebut, nilai atribut NORMAL sudah mengklasifikasikan kasus menjadi 1 yaitu keputusan-nya Yes, sehingga tidak perlu dilakukan perhitungan lebih lanjut • Tetapi untuk nilai atribut HIGH masih perlu dilakukan perhitungan lagi Gain Tertinggi Sebagai Akar 452 1. HUMIDITY 1.1 ????? Yes High Normal
  • 453. • Untuk memudahkan, dataset di filter dengan mengambil data yang memiliki kelembaban HUMADITY=HIGH untuk membuat table Node 1.1 2. Buat cabang untuk tiap-tiap nilai 453 OUTLOOK TEMPERATURE HUMIDITY WINDY PLAY Sunny Hot High FALSE No Sunny Hot High TRUE No Cloudy Hot High FALSE Yes Rainy Mild High FALSE Yes Sunny Mild High FALSE No Cloudy Mild High TRUE Yes Rainy Mild High TRUE No
  • 454. Perhitungan Entropi Dan Gain Cabang 454 NODE ATRIBUT JML KASUS (S) YA (Si) TIDAK (Si) ENTROPY GAIN 1.1 HUMADITY 7 3 4 0,98523 OUTLOOK 0,69951 CLOUDY 2 2 0 0 RAINY 2 1 1 1 SUNNY 3 0 3 0 TEMPERATURE 0,02024 COOL 0 0 0 0 HOT 3 1 2 0,91830 MILD 4 2 2 1 WINDY 0,02024 FALSE 4 2 2 1 TRUE 3 1 2 0,91830
  • 455. • Dari hasil pada Tabel Node 1.1, dapat diketahui bahwa atribut dengan Gain tertinggi adalah OUTLOOK yaitu sebesar 0.69951 • Dengan demikian OUTLOOK dapat menjadi node kedua • Artibut CLOUDY = YES dan SUNNY= NO sudah mengklasifikasikan kasus menjadi 1 keputusan, sehingga tidak perlu dilakukan perhitungan lebih lanjut • Tetapi untuk nilai atribut RAINY masih perlu dilakukan perhitungan lagi Gain Tertinggi Sebagai Node 1.1 455 1. HUMIDITY 1.1 OUTLOOK Yes High Normal Yes No 1.1.2 ????? Cloudy Rainy Sunny
  • 456. 3. Ulangi proses untuk setiap cabang sampai semua kasus pada cabang memiliki kelas yg sama 456 OUTLOOK TEMPERATURE HUMIDITY WINDY PLAY Rainy Mild High FALSE Yes Rainy Mild High TRUE No NODE ATRIBUT JML KASUS (S) YA (Si) TIDAK (Si) ENTROPY GAIN 1.2 HUMADITY HIGH & OUTLOOK RAINY 2 1 1 1 TEMPERATURE 0 COOL 0 0 0 0 HOT 0 0 0 0 MILD 2 1 1 1 WINDY 1 FALSE 1 1 0 0 TRUE 1 0 1 0
  • 457. • Dari tabel, Gain Tertinggi adalah WINDY dan menjadi node cabang dari atribut RAINY • Karena semua kasus sudah masuk dalam kelas • Jadi, pohon keputusan pada Gambar merupakan pohon keputusan terakhir yang terbentuk Gain Tertinggi Sebagai Node 1.1.2 457 1. HUMIDIT Y 1.1 OUTLOOK Yes High Normal Yes No 1.1.2 WINDY Cloudy Rainy Sunny Yes No False True
  • 458. • Training data set: Buys_computer Decision Tree Induction: An Example 458 age income student credit_rating buys_computer <=30 high no fair no <=30 high no excellent no 31…40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no
  • 459. • Information gain measure is biased towards attributes with a large number of values • C4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalization to information gain) • GainRatio(A) = Gain(A)/SplitInfo(A) • Ex. • gain_ratio(income) = 0.029/1.557 = 0.019 • The attribute with the maximum gain ratio is selected as the splitting attribute Gain Ratio for Attribute Selection (C4.5) 459 ) | | | | ( log | | | | ) ( 2 1 D D D D D SplitInfo j v j j A     
  • 460. • If a data set D contains examples from n classes, gini index, gini(D) is defined as where pj is the relative frequency of class j in D • If a data set D is split on A into two subsets D1 and D2, the gini index gini(D) is defined as • Reduction in Impurity: • The attribute provides the smallest ginisplit(D) (or the largest reduction in impurity) is chosen to split the node (need to enumerate all the possible splitting points for each attribute) Gini Index (CART) 460     n j p j D gini 1 2 1 ) ( ) ( | | | | ) ( | | | | ) ( 2 2 1 1 D gini D D D gini D D D giniA   ) ( ) ( ) ( D gini D gini A gini A   
  • 461. • Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no” • Suppose the attribute income partitions D into 10 in D1: {low, medium} and 4 in D2 Gini{low,high} is 0.458; Gini{medium,high} is 0.450. Thus, split on the {low,medium} (and {high}) since it has the lowest Gini index • All attributes are assumed continuous-valued • May need other tools, e.g., clustering, to get the possible split values • Can be modified for categorical attributes Computation of Gini Index 461 459 . 0 14 5 14 9 1 ) ( 2 2                 D gini ) ( 14 4 ) ( 14 10 ) ( 2 1 } , { D Gini D Gini D gini medium low income               
  • 462. The three measures, in general, return good results but • Information gain: • biased towards multivalued attributes • Gain ratio: • tends to prefer unbalanced splits in which one partition is much smaller than the others • Gini index: • biased to multivalued attributes • has difficulty when # of classes is large • tends to favor tests that result in equal-sized partitions and purity in both partitions Comparing Attribute Selection Measures 462
  • 463. • CHAID: a popular decision tree algorithm, measure based on χ2 test for independence • C-SEP: performs better than info. gain and gini index in certain cases • G-statistic: has a close approximation to χ2 distribution • MDL (Minimal Description Length) principle (i.e., the simplest solution is preferred): • The best tree as the one that requires the fewest # of bits to both (1) encode the tree, and (2) encode the exceptions to the tree • Multivariate splits (partition based on multiple variable combinations) • CART: finds multivariate splits based on a linear comb. of attrs. • Which attribute selection measure is the best? • Most give good results, none is significantly superior than others Other Attribute Selection Measures 463
  • 464. • Overfitting: An induced tree may overfit the training data • Too many branches, some may reflect anomalies due to noise or outliers • Poor accuracy for unseen samples • Two approaches to avoid overfitting 1. Prepruning: Halt tree construction early ̵ do not split a node if this would result in the goodness measure falling below a threshold • Difficult to choose an appropriate threshold 2. Postpruning: Remove branches from a “fully grown” tree -get a sequence of progressively pruned trees • Use a set of data different from the training data to decide which is the “best pruned tree” Overfitting and Tree Pruning 464
  • 466. • Relatively faster learning speed (than other classification methods) • Convertible to simple and easy to understand classification rules • Can use SQL queries for accessing databases • Comparable classification accuracy with other methods Why is decision tree induction popular? 466
  • 467. • Lakukan eksperimen mengikuti buku Matthew North, Data Mining for the Masses 2nd Edition, 2016, Chapter 10 (Decision Tree), p 195-217 • Datasets: • eReaderAdoption-Training.csv • eReaderAdoption-Scoring.csv • Analisis peran metode prunning pada decision tree dan hubungannya dengan nilai confidence • Analisis jenis decision tree apa saja yang digunakan dan mengapa perlu dilakukan pada dataset tersebut Latihan 467
  • 468. Motivation: • Richard works for a large online retailer • His company is launching a tablet soon, and he want to maximize the effectiveness of his marketing • They have a large number of customers, many of whom have purchased digital devices and other services previously • Richard has noticed that certain types of people were anxious to get new devices as soon as they became available, while other folks seemed content to wait to buy their electronic gadgets later • He’s wondering what makes some people motivated to buy something as soon as it comes out, while others are less driven to have the product right away Objectives: • To mine the customers’ consumer behaviors on the web site, in order to figure out which customers will buy the new tablet early, which ones will buy next, and which ones will buy later on 1 Business Understanding 468
  • 469. • Lakukan training pada data eReader Adoption (eReader-Training.csv) dengan menggunakan DT dengan 3 alternative criterion (Gain Ratio, Information Gain dan Gini Index) • Ujicoba masing-masing split criterion baik menggunakan prunning atau tidak • Lakukan pengujian dengan menggunakan 10-fold X Validation • Dari model terbaik, tentukan faktor (atribut) apa saja yang berpengaruh pada tingkat adopsi eReader Latihan 469 DTGR DTIG DTGI DTGR+Pr DTIG+Pr DTGI+Pr Accuracy 58.39 51.01 31.01
  • 470. • Lakukan feature selection dengan Forward Selection untuk ketiga algoritma di atas • Lakukan pengujian dengan menggunakan 10-fold X Validation • Dari model terbaik, tentukan faktor (atribut) apa saja yang berpengaruh pada tingkat adopsi eReader Latihan 470 DTGR DTIG DTGI DTGR+FS DTIG+FS DTGI+FS Accuracy 58.39 51.01 31.01 61.41 56.73 31.01
  • 471. 471
  • 473. • A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities • Foundation: Based on Bayes’ Theorem. • Performance: A simple Bayesian classifier, naïve Bayesian classifier, has comparable performance with decision tree and selected neural network classifiers • Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct — prior knowledge can be combined with observed data • Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured Bayesian Classification: Why? 473
  • 474. • Total probability Theorem: • Bayes’ Theorem: • Let X be a data sample (“evidence”): class label is unknown • Let H be a hypothesis that X belongs to class C • Classification is to determine P(H|X), (i.e., posteriori probability): the probability that the hypothesis holds given the observed data sample X • P(H) (prior probability): the initial probability • E.g., X will buy computer, regardless of age, income, … • P(X): probability that sample data is observed • P(X|H) (likelihood): the probability of observing the sample X, given that the hypothesis holds • E.g., Given that X will buy computer, the prob. that X is 31..40, medium income Bayes’ Theorem: Basics 474 ) ( ) 1 | ( ) ( i A P M i i A B P B P    ) ( / ) ( ) | ( ) ( ) ( ) | ( ) | ( X X X X X P H P H P P H P H P H P   
  • 475. • Given training data X, posteriori probability of a hypothesis H, P(H|X), follows the Bayes’ theorem • Informally, this can be viewed as posteriori = likelihood x prior/evidence • Predicts X belongs to Ci iff the probability P(Ci|X) is the highest among all the P(Ck|X) for all the k classes • Practical difficulty: It requires initial knowledge of many probabilities, involving significant computational cost Prediction Based on Bayes’ Theorem 475 ) ( / ) ( ) | ( ) ( ) ( ) | ( ) | ( X X X X X P H P H P P H P H P H P   
  • 476. • Let D be a training set of tuples and their associated class labels, and each tuple is represented by an n-D attribute vector X = (x1, x2, …, xn) • Suppose there are m classes C1, C2, …, Cm. • Classification is to derive the maximum posteriori, i.e., the maximal P(Ci|X) • This can be derived from Bayes’ theorem • Since P(X) is constant for all classes, only needs to be maximized Classification is to Derive the Maximum Posteriori 476 ) ( ) ( ) | ( ) | ( X X X P i C P i C P i C P  ) ( ) | ( ) | ( i C P i C P i C P X X 
  • 477. • A simplified assumption: attributes are conditionally independent (i.e., no dependence relation between attributes): • This greatly reduces the computation cost: Only counts the class distribution • If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk for Ak divided by |Ci, D| (# of tuples of Ci in D) • If Ak is continous-valued, P(xk|Ci) is usually computed based on Gaussian distribution with a mean μ and standard deviation σ and P(xk|Ci) is Naïve Bayes Classifier 477 ) | ( ... ) | ( ) | ( 1 ) | ( ) | ( 2 1 Ci x P Ci x P Ci x P n k Ci x P Ci P n k        X 2 2 2 ) ( 2 1 ) , , (          x e x g ) , , ( ) | ( i i C C k x g Ci P    X
  • 478. Naïve Bayes Classifier: Training Dataset 478 Class: C1:buys_computer = ‘yes’ C2:buys_computer = ‘no’ Data to be classified: X = (age <=30, income = medium, student = yes, credit_rating = fair) X  buy computer? age income student credit_rating buys_computer <=30 high no fair no <=30 high no excellent no 31…40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium No excellent yes 31…40 high Yes fair yes >40 medium No excellent no
  • 479. • P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 P(buys_computer = “no”) = 5/14= 0.357 • Compute P(X|Ci) for each class P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222 P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6 P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444 P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4 P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667 P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2 P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667 P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4 • X = (age <= 30 , income = medium, student = yes, credit_rating = fair) P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044 P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019 P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028 P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007 Therefore, X belongs to class (“buys_computer = yes”) Naïve Bayes Classifier: An Example 479 age income student credit_rating buys_computer <=30 high no fair no <=30 high no excellent no 31…40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no
  • 480. 1. Baca Data Training 2. Hitung jumlah class 3. Hitung jumlah kasus yang sama dengan class yang sama 4. Kalikan semua nilai hasil sesuai dengan data X yang dicari class-nya Tahapan Algoritma Naïve Bayes 480
  • 481. 1. Baca Data Training 481
  • 482. • X  Data dengan class yang belum diketahui • H  Hipotesis data X yang merupakan suatu class yang lebih spesifik • P (H|X)  Probabilitas hipotesis H berdasarkan kondisi X (posteriori probability) • P (H)  Probabilitas hipotesis H (prior probability) • P (X|H)  Probabilitas X berdasarkan kondisi pada hipotesis H • P (X)  Probabilitas X Teorema Bayes 482 ) ( / ) ( ) | ( ) ( ) ( ) | ( ) | ( X X X X X P H P H P P H P H P H P   
  • 483. • Terdapat 2 class dari data training tersebut, yaitu: • C1 (Class 1) Play = yes  9 record • C2 (Class 2) Play = no  5 record • Total = 14 record • Maka: • P (C1) = 9/14 = 0.642857143 • P (C2) = 5/14 = 0.357142857 • Pertanyaan: • Data X = (outlook=rainy, temperature=cool, humidity=high, windy=true) • Main golf atau tidak? 2. Hitung jumlah class/label 483
  • 484. • Untuk P(Ci) yaitu P(C1) dan P(C2) sudah diketahui hasilnya di langkah sebelumnya. • Selanjutnya Hitung P(X|Ci) untuk i = 1 dan 2 • P(outlook=“sunny”|play=“yes”)=2/9=0.222222222 • P(outlook=“sunny”|play=“no”)=3/5=0.6 • P(outlook=“overcast”|play=“yes”)=4/9=0.444444444 • P(outlook=“overcast”|play=“no”)=0/5=0 • P(outlook=“rainy”|play=“yes”)=3/9=0.333333333 • P(outlook=“rainy”|play=“no”)=2/5=0.4 3. Hitung jumlah kasus yang sama dengan class yang sama 484
  • 485. • Jika semua atribut dihitung, maka didapat hasil akhirnya seperti berikut ini: 3. Hitung jumlah kasus yang sama dengan class yang sama 485 Atribute Parameter No Yes Outlook value=sunny 0.6 0.2222222222222222 Outlook value=cloudy 0.0 0.4444444444444444 Outlook value=rainy 0.4 0.3333333333333333 Temperature value=hot 0.4 0.2222222222222222 Temperature value=mild 0.4 0.4444444444444444 Temperature value=cool 0.2 0.3333333333333333 Humidity value=high 0.8 0.3333333333333333 Humidity value=normal 0.2 0.6666666666666666 Windy value=false 0.4 0.6666666666666666 Windy value=true 0.6 0.3333333333333333
  • 486. • Pertanyaan: • Data X = (outlook=rainy, temperature=cool, humidity=high, windy=true) • Main Golf atau tidak? • Kalikan semua nilai hasil dari data X • P(X|play=“yes”) = 0.333333333* 0.333333333* 0.333333333*0.333333333 = 0.012345679 • P(X|play=“no”) = 0.4*0.2*0.8*0.6=0.0384 • P(X|play=“yes”)*P(C1) = 0.012345679*0.642857143 = 0.007936508 • P(X|play=“no”)*P(C2) = 0.0384*0.357142857 = 0.013714286 • Nilai “no” lebih besar dari nilai “yes” maka class dari data X tersebut adalah “No” 4. Kalikan semua nilai hasil sesuai dengan data X yang dicari class-nya 486
  • 487. • Naïve Bayesian prediction requires each conditional prob. be non-zero. Otherwise, the predicted prob. will be zero • Ex. Suppose a dataset with 1000 tuples, income=low (0), income= medium (990), and income = high (10) • Use Laplacian correction (or Laplacian estimator) • Adding 1 to each case Prob(income = low) = 1/1003 Prob(income = medium) = 991/1003 Prob(income = high) = 11/1003 • The “corrected” prob. estimates are close to their “uncorrected” counterparts Avoiding the Zero-Probability Problem 487    n k Ci xk P Ci X P 1 ) | ( ) | (
  • 488. • Advantages • Easy to implement • Good results obtained in most of the cases • Disadvantages • Assumption: class conditional independence, therefore loss of accuracy • Practically, dependencies exist among variables, e.g.: • Hospitals Patients Profile: age, family history, etc. • Symptoms: fever, cough etc., • Disease: lung cancer, diabetes, etc. • Dependencies among these cannot be modeled by Naïve Bayes Classifier • How to deal with these dependencies? Bayesian Belief Networks Naïve Bayes Classifier: Comments 488
  • 490. • Neural Network adalah suatu model yang dibuat untuk meniru fungsi belajar yang dimiliki otak manusia atau jaringan dari sekelompok unit pemroses kecil yang dimodelkan berdasarkan jaringan saraf manusia Neural Network 490
  • 491. • Model Perceptron adalah model jaringan yang terdiri dari beberapa unit masukan (ditambah dengan sebuah bias), dan memiliki sebuah unit keluaran • Fungsi aktivasi bukan hanya merupakan fungsi biner (0,1) melainkan bipolar (1,0,-1) • Untuk suatu harga threshold ѳ yang ditentukan: 1 Jika net > ѳ F (net) = 0 Jika – ѳ ≤ net ≤ ѳ -1 Jika net < - ѳ Neural Network 491
  • 492. Macam fungsi aktivasi yang dipakai untuk mengaktifkan net diberbagai jenis neural network: 1. Aktivasi linear, Rumus: y = sign(v) = v 2. Aktivasi step, Rumus: 3. Aktivasi sigmoid biner, Rumus: 4. Aktivasi sigmoid bipolar, Rumus: Fungsi Aktivasi 492
  • 493. 1. Inisialisasi semua bobot dan bias (umumnya wi = b = 0) 2. Selama ada element vektor masukan yang respon unit keluarannya tidak sama dengan target, lakukan: 2.1 Set aktivasi unit masukan xi = Si (i = 1,...,n) 2.2 Hitung respon unit keluaran: net = + b 1 Jika net > ѳ F (net) = 0 Jika – ѳ ≤ net ≤ ѳ -1 Jika net < - ѳ 2.3 Perbaiki bobot pola yang mengadung kesalahan menurut persamaan: Wi (baru) = wi (lama) + ∆w (i = 1,...,n) dengan ∆w = α t xi B (baru) = b(lama) + ∆ b dengan ∆b = α t Dimana: α = Laju pembelajaran (Learning rate) yang ditentukan ѳ = Threshold yang ditentukan t = Target 2.4 Ulangi iterasi sampai perubahan bobot (∆wn = 0) tidak ada Tahapan Algoritma Perceptron 493
  • 494. • Diketahui sebuah dataset kelulusan berdasarkan IPK untuk program S1: • Jika ada mahasiswa IPK 2.85 dan masih semester 1, maka masuk ke kedalam manakah status tersebut ? Studi Kasus 494 Status IPK Semester Lulus 2.9 1 Tidak Lulus 2.8 3 Tidak Lulus 2.3 5 Tidak lulus 2.7 6
  • 495. • Inisialisasi Bobot dan bias awal: b = 0 dan bias = 1 1: Inisialisasi Bobot 495 t X1 X2 1 2,9 1 -1 2.8 3 -1 2.3 5 -1 2,7 6
  • 496. • Treshold (batasan), θ = 0 , yang artinya : 1 Jika net > 0 F (net) = 0 Jika net = 0 -1 Jika net < 0 2.1: Set aktivasi unit masukan 496
  • 497. • Hitung Response Keluaran iterasi 1 • Perbaiki bobot pola yang mengandung kesalahan 2.2 - 2.3 Hitung Respon dan Perbaiki Bobot 497 MASUKAN TARGET y= PERUBAHAN BOBOT BOBOT BARU X1 X2 1 t NET f(NET) ∆W1 ∆W2 ∆b W1 W2 b INISIALISASI 0 0 0 2,9 1 1 1 0 0 2,9 1 1 2,9 7 1 2,8 3 1 -1 8,12 1 -2,8 -3 -1 0,1 4 0 2,3 5 1 -1 0,23 1 -2,3 -5 -1 -2,2 -1 -1 2,7 6 1 -1 -5,94 -1 0 0 0 -2,2 -1 -1
  • 498. • Hitung Response Keluaran iterasi 2 • Perbaiki bobot pola yang mengandung kesalahan 2.4 Ulangi iterasi sampai perubahan bobot (∆wn = 0) tidak ada (Iterasi 2) 498 MASUKAN TARGET y= PERUBAHAN BOBOT BOBOT BARU X1 X2 1 t NET f(NET) ∆W1 ∆W2 ∆b W1 W2 b INISIALISASI -2,2 -1 -1 2,9 1 1 1 -8,38 -1 2,9 1 1 0,7 0 0 2,8 3 1 -1 1,96 1 -2,8 -3 -1 -2,1 -3 -1 2,3 5 1 -1 -20,83 -1 0 0 0 -2,1 -3 -1 2,7 6 1 -1 -24,67 -1 0 0 0 -2,1 -3 -1
  • 499. • Hitung Response Keluaran iterasi 3 • Perbaiki bobot pola yang mengandung kesalahan • Untuk data IPK memiliki pola 0.8 x - 2 y = 0 dapat dihitung prediksinya menggunakan bobot yang terakhir didapat: V = X1*W1 + X2*W2 = 0,8 * 2,85 - 2*1 = 2,28 -2 = 0,28 Y = sign(V) = sign(0,28) = 1 (Lulus) 2.4 Ulangi iterasi sampai perubahan bobot (∆wn = 0) tidak ada (Iterasi 3) 499 MASUKAN TARGET y= PERUBAHAN BOBOT BOBOT BARU X1 X2 1 t NET f(NET) ∆W1 ∆W2 ∆b W1 W2 b INISIALISASI -2,1 -3 -1 2,9 1 1 1 -10,09 -1 2,9 1 1 0,8 -2 0 2,8 3 1 -1 -3,76 -1 0 0 0 0,8 -2 0 2,3 5 1 -1 -8,16 -1 0 0 0 0,8 -2 0 2,7 6 1 -1 -9,84 -1 0 0 0 0,8 -2 0
  • 500. • Lakukan eksperimen mengikuti buku Matthew North, Data Mining for the Masses 2nd Edition, 2016, Chapter 11 (Neural Network), p 219-228 • Dataset: • TeamValue-Training.csv • TeamValue-Scoring.csv • Pahami model neural network yang dihasilkan, perhatikan ketebalan garis neuron yang menyambungkan antar node Latihan 500
  • 501. Motivation: • Juan is a performance analyst for a major professional athletic team • His team has been steadily improving over recent seasons, and heading into the coming season management believes that by adding between two and four excellent players, the team will have an outstanding shot at achieving the league championship • They have tasked Juan with identifying their best options from among a list of 59 players that may be available to them • All of these players have experience; some have played professionally before and some have years of experience as amateurs • None are to be ruled out without being assessed for their potential ability to add star power and productivity to the existing team • The executives Juan works for are anxious to get going on contacting the most promising prospects, so Juan needs to quickly evaluate these athletes’ past performance and make recommendations based on his analysis Objectives: • To evaluate each of the 59 prospects’ past statistical performance in order to help him formulate recommendations based on his analysis 1. Business Understanding 501
  • 502. • Lakukan training dengan neural network untuk dataset TeamValue-Training.csv • Gunakan 10-fold cross validation • Lakukan adjusment terhadap hidden layer dan neuron size, misal: hidden layer saya buat 3, neuron size masing-masing 5 • Apa yang terjadi, apakah ada peningkatan akurasi? Latihan 502 NN NN (HL 2, NS 3) NN (HL 2, NS 5) NN (HL 3, NS 3) NN (HL 3, NS 5) NN (HL 4, NS 3) NN (HL 4, NS 5) Accuracy
  • 503. Hidden Layer Capabilities 0 Only capable of representing linear separable functions or decisions 1 Can approximate any function that contains a continuous mapping from one finite space to another 2 Can represent an arbitrary decision boundary to arbitrary accuracy with rational activation functions and can approximate any smooth mapping to any accuracy Penentuan Hidden Layer 503
  • 504. 1. Trial and Error 2. Rule of Thumb: • Between the size of the input layer and the size of the output layer • 2/3 the size of the input layer, plus the size of the output layer • Less than twice the size of the input layer 3. Search Algorithm: • Greedy • Genetic Algorithm • Particle Swarm Optimization • etc Penentuan Neuron Size 504
  • 505. Techniques to Improve Classification Accuracy: Ensemble Methods 505
  • 506. • Ensemble methods • Use a combination of models to increase accuracy • Combine a series of k learned models, M1, M2, …, Mk, with the aim of creating an improved model M* • Popular ensemble methods • Bagging: averaging the prediction over a collection of classifiers • Boosting: weighted vote with a collection of classifiers • Ensemble: combining a set of heterogeneous classifiers Ensemble Methods: Increasing the Accuracy 506
  • 507. • Analogy: Diagnosis based on multiple doctors’ majority vote • Training • Given a set D of d tuples, at each iteration i, a training set Di of d tuples is sampled with replacement from D (i.e., bootstrap) • A classifier model Mi is learned for each training set Di • Classification: classify an unknown sample X • Each classifier Mi returns its class prediction • The bagged classifier M* counts the votes and assigns the class with the most votes to X • Prediction: can be applied to the prediction of continuous values by taking the average value of each prediction for a given test tuple • Accuracy • Often significantly better than a single classifier derived from D • For noise data: not considerably worse, more robust • Proved improved accuracy in prediction Bagging: Boostrap Aggregation 507
  • 508. • Analogy: Consult several doctors, based on a combination of weighted diagnoses—weight assigned based on the previous diagnosis accuracy • How boosting works? 1. Weights are assigned to each training tuple 2. A series of k classifiers is iteratively learned 3. After a classifier Mi is learned, the weights are updated to allow the subsequent classifier, Mi+1, to pay more attention to the training tuples that were misclassified by Mi 4. The final M* combines the votes of each individual classifier, where the weight of each classifier's vote is a function of its accuracy • Boosting algorithm can be extended for numeric prediction • Comparing with bagging: Boosting tends to have greater accuracy, but it also risks overfitting the model to misclassified data Boosting 508
  • 509. 1. Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd) 2. Initially, all the weights of tuples are set the same (1/d) 3. Generate k classifiers in k rounds. At round i, 1. Tuples from D are sampled (with replacement) to form a training set Di of the same size 2. Each tuple’s chance of being selected is based on its weight 3. A classification model Mi is derived from Di 4. Its error rate is calculated using Di as a test set 5. If a tuple is misclassified, its weight is increased, o.w. it is decreased 4. Error rate: err(Xj) is the misclassification error of tuple Xj. Classifier Mi error rate is the sum of the weights of the misclassified tuples: 5. The weight of classifier Mi’s vote is Adaboost (Freund and Schapire, 1997) 509 ) ( ) ( 1 log i i M error M error     d j j i err w M error ) ( ) ( j X
  • 510. • Random Forest: • Each classifier in the ensemble is a decision tree classifier and is generated using a random selection of attributes at each node to determine the split • During classification, each tree votes and the most popular class is returned • Two Methods to construct Random Forest: 1. Forest-RI (random input selection): Randomly select, at each node, F attributes as candidates for the split at the node. The CART methodology is used to grow the trees to maximum size 2. Forest-RC (random linear combinations): Creates new attributes (or features) that are a linear combination of the existing attributes (reduces the correlation between individual classifiers) • Comparable in accuracy to Adaboost, but more robust to errors and outliers • Insensitive to the number of attributes selected for consideration at each split, and faster than bagging or boosting Random Forest (Breiman 2001) 510
  • 511. • Class-imbalance problem: Rare positive example but numerous negative ones, e.g., medical diagnosis, fraud, oil- spill, fault, etc. • Traditional methods assume a balanced distribution of classes and equal error costs: not suitable for class- imbalanced data • Typical methods for imbalance data in 2-class classification: 1. Oversampling: re-sampling of data from positive class 2. Under-sampling: randomly eliminate tuples from negative class 3. Threshold-moving: moves the decision threshold, t, so that the rare class tuples are easier to classify, and hence, less chance of costly false negative errors 4. Ensemble techniques: Ensemble multiple classifiers introduced above • Still difficult for class imbalance problem on multiclass tasks Classification of Class-Imbalanced Data Sets 511
  • 512. 4.2 Algoritma Klastering 4.2.1 Partitioning Methods 4.2.2 Hierarchical Methods 4.2.3 Density-Based Methods 4.2.4 Grid-Based Methods 512
  • 513. • Cluster: A collection of data objects • similar (or related) to one another within the same group • dissimilar (or unrelated) to the objects in other groups • Cluster analysis (or clustering, data segmentation, …) • Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters • Unsupervised learning: no predefined classes (i.e., learning by observations vs. learning by examples: supervised) • Typical applications • As a stand-alone tool to get insight into data distribution • As a preprocessing step for other algorithms What is Cluster Analysis? 513
  • 514. • Data reduction • Summarization: Preprocessing for regression, PCA, classification, and association analysis • Compression: Image processing: vector quantization • Hypothesis generation and testing • Prediction based on groups • Cluster & find characteristics/patterns for each group • Finding K-nearest Neighbors • Localizing search to one or a small number of clusters • Outlier detection: Outliers are often viewed as those “far away” from any cluster Applications of Cluster Analysis 514
  • 515. • Biology: taxonomy of living things: kingdom, phylum, class, order, family, genus and species • Information retrieval: document clustering • Land use: Identification of areas of similar land use in an earth observation database • Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs • City-planning: Identifying groups of houses according to their house type, value, and geographical location • Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults • Climate: understanding earth climate, find patterns of atmospheric and ocean • Economic Science: market research Clustering: Application Examples 515
  • 516. • Feature selection • Select info concerning the task of interest • Minimal information redundancy • Proximity measure • Similarity of two feature vectors • Clustering criterion • Expressed via a cost function or some rules • Clustering algorithms • Choice of algorithms • Validation of the results • Validation test (also, clustering tendency test) • Interpretation of the results • Integration with applications Basic Steps to Develop a Clustering Task 516
  • 517. • A good clustering method will produce high quality clusters • high intra-class similarity: cohesive within clusters • low inter-class similarity: distinctive between clusters • The quality of a clustering method depends on • the similarity measure used by the method • its implementation, and • Its ability to discover some or all of the hidden patterns Quality: What Is Good Clustering? 517
  • 518. • Dissimilarity/Similarity metric • Similarity is expressed in terms of a distance function, typically metric: d(i, j) • The definitions of distance functions are usually rather different for interval-scaled, boolean, categorical, ordinal ratio, and vector variables • Weights should be associated with different variables based on applications and data semantics • Quality of clustering: • There is usually a separate “quality” function that measures the “goodness” of a cluster. • It is hard to define “similar enough” or “good enough” • The answer is typically highly subjective Measure the Quality of Clustering 518
  • 519. • Partitioning criteria • Single level vs. hierarchical partitioning (often, multi-level hierarchical partitioning is desirable) • Separation of clusters • Exclusive (e.g., one customer belongs to only one region) vs. non- exclusive (e.g., one document may belong to more than one class) • Similarity measure • Distance-based (e.g., Euclidian, road network, vector) vs. connectivity-based (e.g., density or contiguity) • Clustering space • Full space (often when low dimensional) vs. subspaces (often in high-dimensional clustering) Considerations for Cluster Analysis 519
  • 520. • Scalability • Clustering all the data instead of only on samples • Ability to deal with different types of attributes • Numerical, binary, categorical, ordinal, linked, and mixture of these • Constraint-based clustering • User may give inputs on constraints • Use domain knowledge to determine input parameters • Interpretability and usability • Others • Discovery of clusters with arbitrary shape • Ability to deal with noisy data • Incremental clustering and insensitivity to input order • High dimensionality Requirements and Challenges 520
  • 521. • Partitioning approach: • Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors • Typical methods: k-means, k-medoids, CLARANS • Hierarchical approach: • Create a hierarchical decomposition of the set of data (or objects) using some criterion • Typical methods: Diana, Agnes, BIRCH, CAMELEON • Density-based approach: • Based on connectivity and density functions • Typical methods: DBSACN, OPTICS, DenClue • Grid-based approach: • based on a multiple-level granularity structure • Typical methods: STING, WaveCluster, CLIQUE Major Clustering Approaches 1 521
  • 522. • Model-based: • A model is hypothesized for each of the clusters and tries to find the best fit of that model to each other • Typical methods: EM, SOM, COBWEB • Frequent pattern-based: • Based on the analysis of frequent patterns • Typical methods: p-Cluster • User-guided or constraint-based: • Clustering by considering user-specified or application- specific constraints • Typical methods: COD (obstacles), constrained clustering • Link-based clustering: • Objects are often linked together in various ways • Massive links can be used to cluster objects: SimRank, LinkClus Major Clustering Approaches 2 522
  • 524. • Partitioning method: Partitioning a database D of n objects into a set of k clusters, such that the sum of squared distances is minimized (where ci is the centroid or medoid of cluster Ci) • Given k, find a partition of k clusters that optimizes the chosen partitioning criterion • Global optimal: exhaustively enumerate all partitions • Heuristic methods: k-means and k-medoids algorithms • k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented by the center of the cluster • k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster Partitioning Algorithms: Basic Concept 524 2 1 )) , ( ( i C p k i c p d E i     
  • 525. • Given k, the k-means algorithm is implemented in four steps: 1. Partition objects into k nonempty subsets 2. Compute seed points as the centroids of the clusters of the current partitioning (the centroid is the center, i.e., mean point, of the cluster) 3. Assign each object to the cluster with the nearest seed point 4. Go back to Step 2, stop when the assignment does not change The K-Means Clustering Method 525
  • 526. An Example of K-Means Clustering 526 K=2 Arbitrarily partition objects into k groups Update the cluster centroids Update the cluster centroids Reassign objects Loop if needed The initial data set  Partition objects into k nonempty subsets  Repeat  Compute centroid (i.e., mean point) for each partition  Assign each object to the cluster of its nearest centroid  Until no change
  • 527. 1. Pilih jumlah klaster k yang diinginkan 2. Inisialisasi k pusat klaster (centroid) secara random 3. Tempatkan setiap data atau objek ke klaster terdekat. Kedekatan dua objek ditentukan berdasar jarak. Jarak yang dipakai pada algoritma k-Means adalah Euclidean distance (d) • x = x1, x2, . . . , xn, dan y = y1, y2, . . . , yn merupakan banyaknya n atribut(kolom) antara 2 record 4. Hitung kembali pusat klaster dengan keanggotaan klaster yang sekarang. Pusat klaster adalah rata-rata (mean) dari semua data atau objek dalam klaster tertentu 5. Tugaskan lagi setiap objek dengan memakai pusat klaster yang baru. Jika pusat klaster sudah tidak berubah lagi, maka proses pengklasteran selesai. Atau, kembali lagi ke langkah nomor 3 sampai pusat klaster tidak berubah lagi (stabil) atau tidak ada penurunan yang signifikan dari nilai SSE (Sum of Squared Errors) Tahapan Algoritma k-Means 527         n i i i Euclidean y x y x d 1 2 ,
  • 528. 1. Tentukan jumlah klaster k=2 2. Tentukan centroid awal secara acak misal dari data disamping m1 =(1,1), m2=(2,1) 3. Tempatkan tiap objek ke klaster terdekat berdasarkan nilai centroid yang paling dekat selisihnya (jaraknya). Didapatkan hasil, anggota cluster1 = {A,E,G}, cluster2={B,C,D,F,H} Nilai SSE yaitu: Contoh Kasus – Iterasi 1 528   2 1 ,      k i C p i i m p d SSE
  • 529. 4. Menghitung nilai centroid yang baru 5. Tugaskan lagi setiap objek dengan memakai pusat klaster yang baru. Nilai SSE yang baru: Interasi 2 529         2 , 1 3 / 1 2 3 , 3 / 1 1 1 1       m         4 , 2 ; 6 , 3 5 / 1 2 3 3 3 , 5 / 2 4 5 4 3 2           m
  • 530. 4. Terdapat perubahan anggota cluster yaitu cluster1={A,E,G,H}, cluster2={B,C,D,F}, maka cari lagi nilai centroid yang baru yaitu: m1=(1,25;1,75) dan m2=(4;2,75) 5. Tugaskan lagi setiap objek dengan memakai pusat klaster yang baru Nilai SSE yang baru: Iterasi 3 530
  • 531. • Dapat dilihat pada tabel. Tidak ada perubahan anggota lagi pada masing-masing cluster • Hasil akhir yaitu: cluster1={A,E,G,H}, dan cluster2={B,C,D,F} Dengan nilai SSE = 6,25 dan jumlah iterasi 3 Hasil Akhir 531
  • 532. • Lakukan eksperimen mengikuti buku Matthew North, Data Mining for the Masses, 2012, Chapter 6 k-Means Clustering, pp. 91- 103 (CoronaryHeartDisease.csv) • Gambarkan grafik (chart) dan pilih Scatter 3D Color untuk menggambarkan data hasil klastering yang telah dilakukan • Analisis apa yang telah dilakukan oleh Sonia, dan apa manfaat k-Means clustering bagi pekerjaannya? Latihan 532
  • 533. • Lakukan pengukuran performance dengan menggunakan Cluster Distance Performance, untuk mendapatkan nilai Davies Bouldin Index (DBI) • Nilai DBI semakin rendah berarti cluster yang kita bentuk semakin baik Latihan 533
  • 534. • Lakukan klastering terhadap data IMFdata.csv (http://romisatriawahono.net/lecture/dm/dataset) Latihan 534
  • 535. • Strength: • Efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. • Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k)) • Comment: Often terminates at a local optimal • Weakness • Applicable only to objects in a continuous n-dimensional space • Using the k-modes method for categorical data • In comparison, k-medoids can be applied to a wide range of data • Need to specify k, the number of clusters, in advance (there are ways to automatically determine the best k (see Hastie et al., 2009) • Sensitive to noisy data and outliers • Not suitable to discover clusters with non-convex shapes Comments on the K-Means Method 535
  • 536. • Most of the variants of the k-means which differ in • Selection of the initial k means • Dissimilarity calculations • Strategies to calculate cluster means • Handling categorical data: k-modes • Replacing means of clusters with modes • Using new dissimilarity measures to deal with categorical objects • Using a frequency-based method to update modes of clusters • A mixture of categorical and numerical data: k-prototype method Variations of the K-Means Method 536
  • 537. • The k-means algorithm is sensitive to outliers! • Since an object with an extremely large value may substantially distort the distribution of the data • K-Medoids: • Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster What Is the Problem of the K-Means Method? 537 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
  • 538. 538 PAM: A Typical K-Medoids Algorithm 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Total Cost = 20 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 K=2 Arbitrary choose k object as initial medoids 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Assign each remainin g object to nearest medoids Randomly select a nonmedoid object,Oramdom Compute total cost of swapping 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Total Cost = 26 Swapping O and Oramdom If quality is improved. Do loop Until no change 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
  • 539. • K-Medoids Clustering: Find representative objects (medoids) in clusters • PAM (Partitioning Around Medoids, Kaufmann & Rousseeuw 1987) • Starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non-medoids if it improves the total distance of the resulting clustering • PAM works effectively for small data sets, but does not scale well for large data sets (due to the computational complexity) • Efficiency improvement on PAM • CLARA (Kaufmann & Rousseeuw, 1990): PAM on samples • CLARANS (Ng & Han, 1994): Randomized re-sampling The K-Medoid Clustering Method 539
  • 541. • Use distance matrix as clustering criteria • This method does not require the number of clusters k as an input, but needs a termination condition Hierarchical Clustering 541 Step 0 Step 1 Step 2 Step 3 Step 4 b d c e a a b d e c d e a b c d e Step 4 Step 3 Step 2 Step 1 Step 0 agglomerative (AGNES) divisive (DIANA)
  • 542. • Introduced in Kaufmann and Rousseeuw (1990) • Implemented in statistical packages, e.g., Splus • Use the single-link method and the dissimilarity matrix • Merge nodes that have the least dissimilarity • Go on in a non-descending fashion • Eventually all nodes belong to the same cluster AGNES (Agglomerative Nesting) 542 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
  • 543. Dendrogram: Shows How Clusters are Merged 543 Decompose data objects into a several levels of nested partitioning (tree of clusters), called a dendrogram A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster
  • 544. • Introduced in Kaufmann and Rousseeuw (1990) • Implemented in statistical analysis packages, e.g., Splus • Inverse order of AGNES • Eventually each node forms a cluster on its own DIANA (Divisive Analysis) 544 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
  • 545. • Single link: smallest distance between an element in one cluster and an element in the other, i.e., dist(Ki, Kj) = min(tip, tjq) • Complete link: largest distance between an element in one cluster and an element in the other, i.e., dist(Ki, Kj) = max(tip, tjq) • Average: avg distance between an element in one cluster and an element in the other, i.e., dist(Ki, Kj) = avg(tip, tjq) • Centroid: distance between the centroids of two clusters, i.e., dist(Ki, Kj) = dist(Ci, Cj) • Medoid: distance between the medoids of two clusters, i.e., dist(Ki, Kj) = dist(Mi, Mj) • Medoid: a chosen, centrally located object in the cluster Distance between Clusters 545 X X
  • 546. • Centroid: the “middle” of a cluster • Radius: square root of average distance from any point of the cluster to its centroid • Diameter: square root of average mean squared distance between all pairs of points in the cluster Centroid, Radius and Diameter of a Cluster (for numerical data sets) 546 N t N i ip m C ) ( 1    N m c ip t N i m R 2 ) ( 1     ) 1 ( 2 ) ( 1 1        N N iq t ip t N i N i m D
  • 548. • Clustering based on density (local cluster criterion), such as density-connected points • Major features: • Discover clusters of arbitrary shape • Handle noise • One scan • Need density parameters as termination condition • Several interesting studies: • DBSCAN: Ester, et al. (KDD’96) • OPTICS: Ankerst, et al (SIGMOD’99). • DENCLUE: Hinneburg & D. Keim (KDD’98) • CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based) Density-Based Clustering Methods 548
  • 549. • Two parameters: • Eps: Maximum radius of the neighbourhood • MinPts: Minimum number of points in an Eps- neighbourhood of that point • NEps(q): {p belongs to D | dist(p,q) ≤ Eps} • Directly density-reachable: A point p is directly density- reachable from a point q w.r.t. Eps, MinPts if • p belongs to NEps(q) • core point condition: |NEps (q)| ≥ MinPts Density-Based Clustering: Basic Concepts 549 MinPts = 5 Eps = 1 cm p q
  • 550. • Density-reachable: • A point p is density-reachable from a point q w.r.t. Eps, MinPts if there is a chain of points p1, …, pn, p1 = q, pn = p such that pi+1 is directly density- reachable from pi • Density-connected • A point p is density-connected to a point q w.r.t. Eps, MinPts if there is a point o such that both, p and q are density-reachable from o w.r.t. Eps and MinPts Density-Reachable and Density-Connected 550 p q p1 p q o
  • 551. • Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density- connected points • Discovers clusters of arbitrary shape in spatial databases with noise DBSCAN: Density-Based Spatial Clustering of Applications with Noise 551 Core Border Outlier Eps = 1cm MinPts = 5
  • 552. 1. Arbitrary select a point p 2. Retrieve all points density-reachable from p w.r.t. Eps and MinPts 3. If p is a core point, a cluster is formed 4. If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database 5. Continue the process until all of the points have been processed If a spatial index is used, the computational complexity of DBSCAN is O(nlogn), where n is the number of database objects. Otherwise, the complexity is O(n2) DBSCAN: The Algorithm 552
  • 553. DBSCAN: Sensitive to Parameters 553 http://webdocs.cs.ualberta.ca/~yaling/Cluster/Applet/Code/Cluster.html
  • 554. • OPTICS: Ordering Points To Identify the Clustering Structure • Ankerst, Breunig, Kriegel, and Sander (SIGMOD’99) • Produces a special order of the database wrt its density- based clustering structure • This cluster-ordering contains info equiv to the density- based clusterings corresponding to a broad range of parameter settings • Good for both automatic and interactive cluster analysis, including finding intrinsic clustering structure • Can be represented graphically or using visualization techniques OPTICS: A Cluster-Ordering Method (1999) 554
  • 555. • Index-based: k = # of dimensions, N: # of points • Complexity: O(N*logN) • Core Distance of an object p: the smallest value ε such that the ε-neighborhood of p has at least MinPts objects Let Nε(p): ε-neighborhood of p, ε is a distance value Core-distanceε, MinPts(p) = Undefined if card(Nε(p)) < MinPts MinPts-distance(p), otherwise • Reachability Distance of object p from core object q is the min radius value that makes p density-reachable from q Reachability-distanceε, MinPts(p, q) = Undefined if q is not a core object max(core-distance(q), distance (q, p)), otherwise OPTICS: Some Extension from DBSCAN 555
  • 556. Core Distance & Reachability Distance 556
  • 558. Density-Based Clustering: OPTICS & Applications: http://www.dbs.informatik.uni-muenchen.de/Forschung/KDD/Clustering/OPTICS/Demo 558
  • 560. • Using multi-resolution grid data structure • Several interesting methods • STING (a STatistical INformation Grid approach) by Wang, Yang and Muntz (1997) • WaveCluster by Sheikholeslami, Chatterjee, and Zhang (VLDB’98) • A multi-resolution clustering approach using wavelet method • CLIQUE: Agrawal, et al. (SIGMOD’98) • Both grid-based and subspace clustering Grid-Based Clustering Method 560
  • 561. • Wang, Yang and Muntz (VLDB’97) • The spatial area is divided into rectangular cells • There are several levels of cells corresponding to different levels of resolution STING: A Statistical Information Grid Approach 561
  • 562. • Each cell at a high level is partitioned into a number of smaller cells in the next lower level • Statistical info of each cell is calculated and stored beforehand and is used to answer queries • Parameters of higher level cells can be easily calculated from parameters of lower level cell • count, mean, s, min, max • type of distribution—normal, uniform, etc. • Use a top-down approach to answer spatial data queries • Start from a pre-selected layer—typically with a small number of cells • For each cell in the current level compute the confidence interval The STING Clustering Method 562
  • 563. • Remove the irrelevant cells from further consideration • When finish examining the current layer, proceed to the next lower level • Repeat this process until the bottom layer is reached • Advantages: • Query-independent, easy to parallelize, incremental update • O(K), where K is the number of grid cells at the lowest level • Disadvantages: • All the cluster boundaries are either horizontal or vertical, and no diagonal boundary is detected STING Algorithm and Its Analysis 563
  • 564. • Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98) • Automatically identifying subspaces of a high dimensional data space that allow better clustering than original space • CLIQUE can be considered as both density-based and grid-based • It partitions each dimension into the same number of equal length interval • It partitions an m-dimensional data space into non- overlapping rectangular units • A unit is dense if the fraction of total data points contained in the unit exceeds the input model parameter • A cluster is a maximal set of connected dense units within a subspace CLIQUE (Clustering In QUEst) 564
  • 565. 1. Partition the data space and find the number of points that lie inside each cell of the partition. 2. Identify the subspaces that contain clusters using the Apriori principle 3. Identify clusters 1. Determine dense units in all subspaces of interests 2. Determine connected dense units in all subspaces of interests. 4. Generate minimal description for the clusters 1. Determine maximal regions that cover a cluster of connected dense units for each cluster 2. Determination of minimal cover for each cluster CLIQUE: The Major Steps 565
  • 566. 566 Salary (10,000) 20 30 40 50 60 age 5 4 3 1 2 6 7 0 20 30 40 50 60 age 5 4 3 1 2 6 7 0 Vacation(w eek) age Vacation 30 50  = 3
  • 567. • Strength • automatically finds subspaces of the highest dimensionality such that high density clusters exist in those subspaces • insensitive to the order of records in input and does not presume some canonical data distribution • scales linearly with the size of input and has good scalability as the number of dimensions in the data increases • Weakness • The accuracy of the clustering result may be degraded at the expense of simplicity of the method Strength and Weakness of CLIQUE 567
  • 568. 4.3 Algoritma Asosiasi 4.3.1 Frequent Itemset Mining Methods 4.3.2 Pattern Evaluation Methods 568
  • 569. • Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set • First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent itemsets and association rule mining • Motivation: Finding inherent regularities in data • What products were often purchased together?— Beer and diapers?! • What are the subsequent purchases after buying a PC? • What kinds of DNA are sensitive to this new drug? • Can we automatically classify web documents? • Applications • Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis. What Is Frequent Pattern Analysis? 569
  • 570. • Freq. pattern: An intrinsic and important property of datasets • Foundation for many essential data mining tasks • Association, correlation, and causality analysis • Sequential, structural (e.g., sub-graph) patterns • Pattern analysis in spatiotemporal, multimedia, time- series, and stream data • Classification: discriminative, frequent pattern analysis • Cluster analysis: frequent pattern-based clustering • Data warehousing: iceberg cube and cube-gradient • Semantic data compression: fascicles • Broad applications Why Is Freq. Pattern Mining Important? 570
  • 571. • itemset: A set of one or more items • k-itemset X = {x1, …, xk} • (absolute) support, or, support count of X: Frequency or occurrence of an itemset X • (relative) support, s, is the fraction of transactions that contains X (i.e., the probability that a transaction contains X) • An itemset X is frequent if X’s support is no less than a minsup threshold Basic Concepts: Frequent Patterns 571 Customer buys diaper Customer buys both Customer buys beer Tid Items bought 10 Beer, Nuts, Diaper 20 Beer, Coffee, Diaper 30 Beer, Diaper, Eggs 40 Nuts, Eggs, Milk 50 Nuts, Coffee, Diaper, Eggs, Milk
  • 572. • Find all the rules X  Y with minimum support and confidence • support, s, probability that a transaction contains X  Y • confidence, c, conditional probability that a transaction having X also contains Y Let minsup = 50%, minconf = 50% Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3, {Beer, Diaper}:3 Basic Concepts: Association Rules 572 Customer buys diaper Customer buys both Customer buys beer Nuts, Eggs, Milk 40 Nuts, Coffee, Diaper, Eggs, Milk 50 Beer, Diaper, Eggs 30 Beer, Coffee, Diaper 20 Beer, Nuts, Diaper 10 Items bought Tid • Association rules: (many more!) • Beer  Diaper (60%, 100%) • Diaper  Beer (60%, 75%)
  • 573. • A long pattern contains a combinatorial number of sub- patterns, e.g., {a1, …, a100} contains (100 1) + (100 2) + … + (1 1 0 0 0 0) = 2100 – 1 = 1.27*1030 sub-patterns! • Solution: Mine closed patterns and max-patterns instead • An itemset X is closed if X is frequent and there exists no super-pattern Y ‫כ‬ X, with the same support as X (proposed by Pasquier, et al. @ ICDT’99) • An itemset X is a max-pattern if X is frequent and there exists no frequent super-pattern Y ‫כ‬ X (proposed by Bayardo @ SIGMOD’98) • Closed pattern is a lossless compression of freq. patterns • Reducing the # of patterns and rules Closed Patterns and Max-Patterns 573
  • 574. • Exercise. DB = {<a1, …, a100>, < a1, …, a50>} • Min_sup = 1. • What is the set of closed itemset? • <a1, …, a100>: 1 • < a1, …, a50>: 2 • What is the set of max-pattern? • <a1, …, a100>: 1 • What is the set of all patterns? • !! Closed Patterns and Max-Patterns 574
  • 575. • How many itemsets are potentially to be generated in the worst case? • The number of frequent itemsets to be generated is senstive to the minsup threshold • When minsup is low, there exist potentially an exponential number of frequent itemsets • The worst case: MN where M: # distinct items, and N: max length of transactions • The worst case complexty vs. the expected probability Ex. Suppose Walmart has 104 kinds of products • The chance to pick up one product 10-4 • The chance to pick up a particular set of 10 products: ~10-40 • What is the chance this particular set of 10 products to be frequent 103 times in 109 transactions? Computational Complexity of Frequent Itemset Mining 575
  • 576. 4.3.1 Frequent Itemset Mining Methods 576
  • 577. • Apriori: A Candidate Generation-and-Test Approach • Improving the Efficiency of Apriori • FPGrowth: A Frequent Pattern-Growth Approach • ECLAT: Frequent Pattern Mining with Vertical Data Format Scalable Frequent Itemset Mining Methods 577
  • 578. • The downward closure property of frequent patterns • Any subset of a frequent itemset must be frequent • If {beer, diaper, nuts} is frequent, so is {beer, diaper} • i.e., every transaction having {beer, diaper, nuts} also contains {beer, diaper} • Scalable mining methods: Three major approaches • Apriori (Agrawal & Srikant@VLDB’94) • Freq. pattern growth (FPgrowth—Han, Pei & Yin @SIGMOD’00) • Vertical data format approach (Charm—Zaki & Hsiao @SDM’02) The Downward Closure Property and Scalable Mining Methods 578
  • 579. • Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! (Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94) • Method: 1. Initially, scan DB once to get frequent 1-itemset 2. Generate length (k+1) candidate itemsets from length k frequent itemsets 3. Test the candidates against DB 4. Terminate when no frequent or candidate set can be generated Apriori: A Candidate Generation & Test Approach 579
  • 580. The Apriori Algorithm—An Example 580 Database TDB 1st scan C1 L1 L2 C2 C2 2nd scan C3 L3 3rd scan Tid Items 10 A, C, D 20 B, C, E 30 A, B, C, E 40 B, E Itemset sup {A} 2 {B} 3 {C} 3 {D} 1 {E} 3 Itemset sup {A} 2 {B} 3 {C} 3 {E} 3 Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} Itemset sup {A, B} 1 {A, C} 2 {A, E} 1 {B, C} 2 {B, E} 3 {C, E} 2 Itemset sup {A, C} 2 {B, C} 2 {B, E} 3 {C, E} 2 Itemset {B, C, E} Itemset sup {B, C, E} 2 Supmin = 2
  • 581. Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end return k Lk; The Apriori Algorithm (Pseudo-Code) 581
  • 582. • How to generate candidates? • Step 1: self-joining Lk • Step 2: pruning • Example of Candidate-generation • L3={abc, abd, acd, ace, bcd} • Self-joining: L3*L3 • abcd from abc and abd • acde from acd and ace • Pruning: • acde is removed because ade is not in L3 • C4 = {abcd} Implementation of Apriori 582
  • 583. • Why counting supports of candidates a problem? • The total number of candidates can be very huge • One transaction may contain many candidates • Method: • Candidate itemsets are stored in a hash-tree • Leaf node of hash-tree contains a list of itemsets and counts • Interior node contains a hash table • Subset function: finds all the candidates contained in a transaction How to Count Supports of Candidates? 583
  • 584. Counting Supports of Candidates Using Hash Tree 584 1,4,7 2,5,8 3,6,9 Subset function 2 3 4 5 6 7 1 4 5 1 3 6 1 2 4 4 5 7 1 2 5 4 5 8 1 5 9 3 4 5 3 5 6 3 5 7 6 8 9 3 6 7 3 6 8 Transaction: 1 2 3 5 6 1 + 2 3 5 6 1 2 + 3 5 6 1 3 + 5 6
  • 585. • SQL Implementation of candidate generation • Suppose the items in Lk-1 are listed in an order • Step 1: self-joining Lk-1 insert into Ck select p.item1, p.item2, …, p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1 q where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1 • Step 2: pruning forall itemsets c in Ck do forall (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck • Use object-relational extensions like UDFs, BLOBs, and Table functions for efficient implementation (S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: Alternatives and implications. SIGMOD’98) Candidate Generation: An SQL Implementation 585
  • 586. • Bottlenecks of the Apriori approach • Breadth-first (i.e., level-wise) search • Candidate generation and test • Often generates a huge number of candidates • The FPGrowth Approach (J. Han, J. Pei, and Y. Yin, SIGMOD’ 00) • Depth-first search • Avoid explicit candidate generation • Major philosophy: Grow long patterns from short ones using local frequent items only • “abc” is a frequent pattern • Get all transactions having “abc”, i.e., project DB on abc: DB|abc • “d” is a local frequent item in DB|abc  abcd is a frequent pattern Pattern-Growth Approach: Mining Frequent Patterns Without Candidate Generation 586
  • 587. Construct FP-tree from a Transaction Database 587 {} f:4 c:1 b:1 p:1 b:1 c:3 a:3 b:1 m:2 p:2 m:1 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 min_support = 3 TID Items bought (ordered) frequent items 100 {f, a, c, d, g, i, m, p} {f, c, a, m, p} 200 {a, b, c, f, l, m, o} {f, c, a, b, m} 300 {b, f, h, j, o, w} {f, b} 400 {b, c, k, s, p} {c, b, p} 500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} 1. Scan DB once, find frequent 1- itemset (single item pattern) 2. Sort frequent items in frequency descending order, f- list 3. Scan DB again, construct FP- tree F-list = f-c-a-b-m-p
  • 588. • Frequent patterns can be partitioned into subsets according to f-list • F-list = f-c-a-b-m-p • Patterns containing p • Patterns having m but no p • … • Patterns having c but no a nor b, m, p • Pattern f • Completeness and non-redundency Partition Patterns and Databases 588
  • 589. • Starting at the frequent item header table in the FP-tree • Traverse the FP-tree by following the link of each frequent item p • Accumulate all of transformed prefix paths of item p to form p’s conditional pattern base Find Patterns Having P From P-conditional Database 589 Conditional pattern bases item cond. pattern base c f:3 a fc:3 b fca:1, f:1, c:1 m fca:2, fcab:1 p fcam:2, cb:1 {} f:4 c:1 b:1 p:1 b:1 c:3 a:3 b:1 m:2 p:2 m:1 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3
  • 590. • For each pattern-base • Accumulate the count for each item in the base • Construct the FP-tree for the frequent items of the pattern base From Conditional Pattern-bases to Conditional FP-trees 590 m-conditional pattern base: fca:2, fcab:1 {} f:3 c:3 a:3 m-conditional FP-tree All frequent patterns relate to m m, fm, cm, am, fcm, fam, cam, fcam   {} f:4 c:1 b:1 p:1 b:1 c:3 a:3 b:1 m:2 p:2 m:1 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3
  • 591. Recursion: Mining Each Conditional FP-tree 591 {} f:3 c:3 a:3 m-conditional FP-tree Cond. pattern base of “am”: (fc:3) {} f:3 c:3 am-conditional FP-tree Cond. pattern base of “cm”: (f:3) {} f:3 cm-conditional FP-tree Cond. pattern base of “cam”: (f:3) {} f:3 cam-conditional FP-tree
  • 592. • Suppose a (conditional) FP-tree T has a shared single prefix-path P • Mining can be decomposed into two parts • Reduction of the single prefix path into one node • Concatenation of the mining results of the two parts A Special Case: Single Prefix Path in FP-tree 592  a2:n2 a3:n3 a1:n1 {} b1:m1 C1:k1 C2:k2 C3:k3 b1:m1 C1:k1 C2:k2 C3:k3 r1 + a2:n2 a3:n3 a1:n1 {} r1 =
  • 593. • Completeness • Preserve complete information for frequent pattern mining • Never break a long pattern of any transaction • Compactness • Reduce irrelevant info—infrequent items are gone • Items in frequency descending order: the more frequently occurring, the more likely to be shared • Never be larger than the original database (not count node-links and the count field) Benefits of the FP-tree Structure 593
  • 594. • Idea: Frequent pattern growth • Recursively grow frequent patterns by pattern and database partition • Method 1. For each frequent item, construct its conditional pattern-base, and then its conditional FP-tree 2. Repeat the process on each newly created conditional FP-tree 3. Until the resulting FP-tree is empty, or it contains only one path—single path will generate all the combinations of its sub-paths, each of which is a frequent pattern The Frequent Pattern Growth Mining Method 594
  • 595. • What about if FP-tree cannot fit in memory? DB projection • First partition a database into a set of projected DBs • Then construct and mine FP-tree for each projected DB • Parallel projection vs. partition projection techniques • Parallel projection • Project the DB in parallel for each frequent item • Parallel projection is space costly • All the partitions can be processed in parallel • Partition projection • Partition the DB based on the ordered frequent items • Passing the unprocessed parts to the subsequent partitions Scaling FP-growth by Database Projection 595
  • 596. • Parallel projection needs a lot of disk space • Partition projection saves it Partition-Based Projection 596 Tran. DB fcamp fcabm fb cbp fcamp p-proj DB fcam cb fcam m-proj DB fcab fca fca b-proj DB f cb … a-proj DB fc … c-proj DB f … f-proj DB … am-proj DB fc fc fc cm-proj DB f f f …
  • 597. FP-Growth vs. Apriori: Scalability With the Support Threshold 597 0 10 20 30 40 50 60 70 80 90 100 0 0.5 1 1.5 2 2.5 3 Support threshold(%) Run time(sec.) D1 FP-grow th runtime D1 Apriori runtime Data set T25I20D10K
  • 598. FP-Growth vs. Tree-Projection: Scalability with the Support Threshold 598 0 20 40 60 80 100 120 140 0 0.5 1 1.5 2 Support threshold (%) Runtime (sec.) D2 FP-growth D2 TreeProjection Data set T25I20D100K
  • 599. • Divide-and-conquer: • Decompose both the mining task and DB according to the frequent patterns obtained so far • Lead to focused search of smaller databases • Other factors • No candidate generation, no candidate test • Compressed database: FP-tree structure • No repeated scan of entire database • Basic ops: counting local freq items and building sub FP- tree, no pattern search and matching • A good open-source implementation and refinement of FPGrowth • FPGrowth+ (Grahne and J. Zhu, FIMI'03) Advantages of the Pattern Growth Approach 599
  • 600. • AFOPT (Liu, et al. @ KDD’03) • A “push-right” method for mining condensed frequent pattern (CFP) tree • Carpenter (Pan, et al. @ KDD’03) • Mine data sets with small rows but numerous columns • Construct a row-enumeration tree for efficient mining • FPgrowth+ (Grahne and Zhu, FIMI’03) • Efficiently Using Prefix-Trees in Mining Frequent Itemsets, Proc. ICDM'03 Int. Workshop on Frequent Itemset Mining Implementations (FIMI'03), Melbourne, FL, Nov. 2003 • TD-Close (Liu, et al, SDM’06) Further Improvements of Mining Methods 600
  • 601. • Mining closed frequent itemsets and max-patterns • CLOSET (DMKD’00), FPclose, and FPMax (Grahne & Zhu, Fimi’03) • Mining sequential patterns • PrefixSpan (ICDE’01), CloSpan (SDM’03), BIDE (ICDE’04) • Mining graph patterns • gSpan (ICDM’02), CloseGraph (KDD’03) • Constraint-based mining of frequent patterns • Convertible constraints (ICDE’01), gPrune (PAKDD’03) • Computing iceberg data cubes with complex measures • H-tree, H-cubing, and Star-cubing (SIGMOD’01, VLDB’03) • Pattern-growth-based Clustering • MaPle (Pei, et al., ICDM’03) • Pattern-Growth-Based Classification • Mining frequent and discriminative patterns (Cheng, et al, ICDE’07) Extension of Pattern Growth Mining Methodology 601
  • 602. 1. Penyiapan Dataset 2. Pencarian Frequent Itemset (Item yang sering muncul) 3. Dataset diurutkan Berdasarkan Priority 4. Pembuatan FP-Tree Berdasarkan Item yang sudah diurutkan 5. Pembangkitan Conditional Pattern Base 6. Pembangkitan Conditional FP-tree 7. Pembangkitan Frequent Pattern 8. Mencari Support 9. Mencari Confidence Tahapan Algoritma FP Growth 602
  • 604. 2. Pencarian Frequent Itemset 604
  • 605. 3. Dataset diurutkan Berdasarkan Priority 605
  • 607. 5. Pembangkitan Conditional Pattern Base 607
  • 609. 7. Pembangkitan Frequent Pattern 609
  • 611. 8. Mencari Support 2 Itemset 611
  • 612. 9. Mencari Confidence 2 Itemset 612
  • 613. 4.3.2 Pattern Evaluation Methods 613
  • 614. • play basketball  eat cereal [40%, 66.7%] is misleading • The overall % of students eating cereal is 75% > 66.7%. • play basketball  not eat cereal [20%, 33.3%] is more accurate, although with lower support and confidence • Measure of dependent/correlated events: lift Interestingness Measure: Correlations (Lift) 614 89 . 0 5000 / 3750 * 5000 / 3000 5000 / 2000 ) , (   C B lift Basketball Not basketball Sum (row) Cereal 2000 1750 3750 Not cereal 1000 250 1250 Sum(col.) 3000 2000 5000 ) ( ) ( ) ( B P A P B A P lift   33 . 1 5000 / 1250 * 5000 / 3000 5000 / 1000 ) , (   C B lift
  • 615. • “Buy walnuts  buy milk [1%, 80%]” is misleading if 85% of customers buy milk • Support and confidence are not good to indicate correlations • Over 20 interestingness measures have been proposed (see Tan, Kumar, Sritastava @KDD’02) • Which are good ones? Are lift and 2 Good Measures of Correlation? 615
  • 617. • Null-(transaction) invariance is crucial for correlation analysis • Lift and 2 are not null-invariant • 5 null-invariant measures Comparison of Interestingness Measures 617 617 Milk No Milk Sum (row) Coffee m, c ~m, c c No Coffee m, ~c ~m, ~c ~c Sum(col.) m ~m  Null-transactions w.r.t. m and c Null-invariant Subtle: They disagree Kulczynski measure (1927)
  • 618. Analysis of DBLP Coauthor Relationships 618 Advisor-advisee relation: Kulc: high, coherence: low, cosine: middle Recent DB conferences, removing balanced associations, low sup, etc. Tianyi Wu, Yuguo Chen and Jiawei Han, “Association Mining in Large Databases: A Re-Examination of Its Measures”, Proc. 2007 Int. Conf. Principles and Practice of Knowledge Discovery in Databases (PKDD'07), Sept. 2007
  • 619. • IR (Imbalance Ratio): measure the imbalance of two itemsets A and B in rule implications • Kulczynski and Imbalance Ratio (IR) together present a clear picture for all the three datasets D4 through D6 • D4 is balanced & neutral • D5 is imbalanced & neutral • D6 is very imbalanced & neutral Which Null-Invariant Measure Is Better? 619
  • 620. • Lakukan eksperimen mengikuti buku Matthew North, Data Mining for the Masses 2nd Edition, 2016, Chapter 5 (Association Rules), p 85-97 • Analisis, bagaimana data mining bisa bermanfaat untuk membantu Roger, seorang City Manager Latihan 620
  • 621. • Motivation: • Roger is a city manager for a medium-sized, but steadily growing city • The city has limited resources, and like most municipalities, there are more needs than there are resources • He feels like the citizens in the community are fairly active in various community organizations, and believes that he may be able to get a number of groups to work together to meet some of the needs in the community • He knows there are churches, social clubs, hobby enthusiasts and other types of groups in the community • What he doesn’t know is if there are connections between the groups that might enable natural collaborations between two or more groups that could work together on projects around town • Objectives: • To find out if there are any existing associations between the different types of groups in the area 1. Business Understanding 621
  • 622. 4.4 Algoritma Estimasi dan Forecasting 4.4.1 Linear Regression 4.4.2 Time Series Forecasting 622
  • 624. 1. Siapkan data 2. Identifikasi Atribut dan Label 3. Hitung X², Y², XY dan total dari masing- masingnya 4. Hitung a dan b berdasarkan persamaan yang sudah ditentukan 5. Buat Model Persamaan Regresi Linear Sederhana Tahapan Algoritma Linear Regression 624
  • 625. 1. Persiapan Data 625 Tanggal Rata-rata Suhu Ruangan (X) Jumlah Cacat (Y) 1 24 10 2 22 5 3 21 6 4 20 3 5 22 6 6 19 4 7 20 5 8 23 9 9 24 11 10 25 13
  • 626. Y = a + bX Dimana: Y = Variabel terikat (Dependen) X = Variabel tidak terikat (Independen) a = konstanta b = koefisien regresi (kemiringan); besaran Response yang ditimbulkan oleh variabel a = (Σy) (Σx²) – (Σx) (Σxy) n(Σx²) – (Σx)² b = n(Σxy) – (Σx) (Σy) n(Σx²) – (Σx)² 2. Identifikasikan Atribut dan Label 626
  • 627. 3. Hitung X², Y², XY dan total dari masing- masingnya 627 Tanggal Rata-rata Suhu Ruangan (X) Jumlah Cacat (Y) X2 Y2 XY 1 24 10 576 100 240 2 22 5 484 25 110 3 21 6 441 36 126 4 20 3 400 9 60 5 22 6 484 36 132 6 19 4 361 16 76 7 20 5 400 25 100 8 23 9 529 81 207 9 24 11 576 121 264 10 25 13 625 169 325 220 72 4876 618 1640
  • 628. • Menghitung Koefisien Regresi (a) a = (Σy) (Σx²) – (Σx) (Σxy) n(Σx²) – (Σx)² a = (72) (4876) – (220) (1640) 10 (4876) – (220)² a = -27,02 • Menghitung Koefisien Regresi (b) b = n(Σxy) – (Σx) (Σy) n(Σx²) – (Σx)² b = 10 (1640) – (220) (72) 10 (4876) – (220)² b = 1,56 4. Hitung a dan b berdasarkan persamaan yang sudah ditentukan 628
  • 629. Y = a + bX Y = -27,02 + 1,56X 5. Buatkan Model Persamaan Regresi Linear Sederhana 629
  • 630. 1. Prediksikan Jumlah Cacat Produksi jika suhu dalam keadaan tinggi (Variabel X), contohnya: 30°C Y = -27,02 + 1,56X Y = -27,02 + 1,56(30) =19,78 2. Jika Cacat Produksi (Variabel Y) yang ditargetkan hanya boleh 5 unit, maka berapakah suhu ruangan yang diperlukan untuk mencapai target tersebut? 5= -27,02 + 1,56X 1,56X = 5+27,02 X= 32,02/1,56 X =20,52 Jadi Prediksi Suhu Ruangan yang paling sesuai untuk mencapai target Cacat Produksi adalah sekitar 20,520C Pengujian 630
  • 631. 7.1.2 Studi Kasus CRISP-DM Heating Oil Consumption – Estimation (Matthew North, Data Mining for the Masses, 2012, Chapter 8 Estimation, pp. 127-140) Dataset: HeatingOil-Training.csv dan HeatingOil-Scoring.csv 631
  • 632. • Lakukan eksperimen mengikuti buku Matthew North, Data Mining for the Masses, 2012, Chapter 8 Estimation, pp. 127-140 tentang Heating Oil Consumption • Dataset: HeatingOil-Training.csv dan HeatingOil- Scoring.csv Latihan 632
  • 634. • Sarah, the regional sales manager is back for more help • Business is booming, her sales team is signing up thousands of new clients, and she wants to be sure the company will be able to meet this new level of demand, she now is hoping we can help her do some prediction as well • She knows that there is some correlation between the attributes in her data set (things like temperature, insulation, and occupant ages), and she’s now wondering if she can use the previous data set to predict heating oil usage for new customers • You see, these new customers haven’t begun consuming heating oil yet, there are a lot of them (42,650 to be exact), and she wants to know how much oil she needs to expect to keep in stock in order to meet these new customers’ demand • Can she use data mining to examine household attributes and known past consumption quantities to anticipate and meet her new customers’ needs? Context and Perspective 634
  • 635. • Sarah’s new data mining objective is pretty clear: she wants to anticipate demand for a consumable product • We will use a linear regression model to help her with her desired predictions • She has data, 1,218 observations that give an attribute profile for each home, along with those homes’ annual heating oil consumption • She wants to use this data set as training data to predict the usage that 42,650 new clients will bring to her company • She knows that these new clients’ homes are similar in nature to her existing client base, so the existing customers’ usage behavior should serve as a solid gauge for predicting future usage by new customers. 1. Business Understanding 635
  • 636. We create a data set comprised of the following attributes: • Insulation: This is a density rating, ranging from one to ten, indicating the thickness of each home’s insulation. A home with a density rating of one is poorly insulated, while a home with a density of ten has excellent insulation • Temperature: This is the average outdoor ambient temperature at each home for the most recent year, measure in degree Fahrenheit • Heating_Oil: This is the total number of units of heating oil purchased by the owner of each home in the most recent year • Num_Occupants: This is the total number of occupants living in each home • Avg_Age: This is the average age of those occupants • Home_Size: This is a rating, on a scale of one to eight, of the home’s overall size. The higher the number, the larger the home 2. Data Understanding 636
  • 637. • A CSV data set for this chapter’s example is available for download at the book’s companion web site (https://sites.google.com/site/dataminingforthemasses/) 3. Data Preparation 637
  • 647. 4.4.2 Time Series Forecasting 647
  • 648. • Time series forecasting is one of the oldest known predictive analytics techniques • It has existed and been in widespread use even before the term “predictive analytics” was ever coined • Independent or predictor variables are not strictly necessary for univariate time series forecasting, but are strongly recommended for multivariate time series • Time series forecasting methods: 1. Data Driven Method: There is no difference between a predictor and a target. Techniques such as time series averaging or smoothing are considered data-driven approaches to time series forecasting 2. Model Driven Method: Similar to “conventional” predictive models, which have independent and dependent variables, but with a twist: the independent variable is now time Time Series Forecasting 648
  • 649. • There is no difference between a predictor and a target • The predictor is also the target variable • Data Driven Methods: • Naïve Forecast • Simple Average • Moving Average • Weighted Moving Average • Exponential Smoothing • Holt’s Two-Parameter Exponential Smoothing Data Driven Methods 649
  • 650. • In model-driven methods, time is the predictor or independent variable and the time series value is the dependent variable • Model-based methods are generally preferable when the time series appears to have a “global” pattern • The idea is that the model parameters will be able to capture these patterns • Thus enable us to make predictions for any step ahead in the future under the assumption that this pattern is going to repeat • For a time series with local patterns instead of a global pattern, using the model-driven approach requires specifying how and when the patterns change, which is difficult Model Driven Methods 650
  • 651. • Linear Regression • Polynomial Regression • Linear Regression with Seasonality • Autoregression Models and ARIMA Model Driven Methods 651
  • 652. • RapidMiner’s approach to time series is based on two main data transformation processes • The first is windowing to transform the time series data into a generic data set: • This step will convert the last row of a window within the time series into a label or target variable • We apply any of the “learners” or algorithms to predict the target variable and thus predict the next time step in the series How to Implement 652
  • 653. • The parameters of the Windowing operator allow changing the size of the windows, the overlap between consecutive windows (step size), and the prediction horizon, which is used for forecasting • The prediction horizon controls which row in the raw data series ends up as the label variable in the transformed series Windowing Concept 653
  • 655. • Window size: Determines how many “attributes” are created for the cross-sectional data • Each row of the original time series within the window width will become a new attribute • We choose w = 6 • Step size: Determines how to advance the window • Let us use s = 1 • Horizon: Determines how far out to make the forecast • If the window size is 6 and the horizon is 1, then the seventh row of the original time series becomes the first sample for the “label” variable • Let us use h = 1 Windowing Operator Parameters 655
  • 656. • Lakukan training dengan menggunakan linear regression pada dataset hargasaham-training- uni.xls • Gunakan Split Data untuk memisahkan dataset di atas, 90% training dan 10% untuk testing • Harus dilakukan proses Windowing pada dataset • Plot grafik antara label dan hasil prediksi dengan menggunakan chart Latihan 656
  • 657. 657
  • 658. • Lakukan training dengan menggunakan linear regression pada dataset hargasaham-training.xls • Terapkan model yang dihasilkan untuk data hargasaham-testing-kosong.xls • Harus dilakukan proses Windowing pada dataset • Plot grafik antara label dan hasil prediksi dengan menggunakan chart Latihan 658
  • 659. 659
  • 660. 5. Text Mining 5.1 Text Mining Concepts 5.2 Text Clustering 5.3 Text Classification 5.4 Data Mining Law 660
  • 661. 5.1 Text Mining Concepts 661
  • 662. 1. Text Mining: • Mengolah data tidak terstruktur dalam bentuk text, web, social media, dsb • Menggunakan metode text processing untuk mengkonversi data tidak terstruktur menjadi terstruktur • Kemudian diolah dengan data mining 2. Data Mining: • Mengolah data terstruktur dalam bentuk tabel yang memiliki atribut dan kelas • Menggunakan metode data mining, yang terbagi menjadi metode estimasi, forecasting, klasifikasi, klastering atau asosiasi • Yang dasar berpikirnya menggunakan konsep statistika atau heuristik ala machine learning Data Mining vs Text Mining 662
  • 663. • The fundamental step is to convert text into semi-structured data • Then apply the data mining methods to classify, cluster, and predict How Text Mining Works 663 Text Processing
  • 664. Text Mining: Jejak Pornografi di Indonesia 664
  • 666. 1. Himpunan Data (Pahami dan Persiapkan Data) 2. Metode Data Mining (Pilih Metode Sesuai Karakter Data) 3. Pengetahuan (Pahami Model dan Pengetahuan yg Sesuai ) 4. Evaluation (Analisis Model dan Kinerja Metode) Proses Data Mining 666 DATA PREPROCESSING Data Cleaning Data Integration Data Reduction Data Transformation Text Processing MODELING Estimation Prediction Classification Clustering Association MODEL Formula Tree Cluster Rule Correlation KINERJA Akurasi Tingkat Error Jumlah Cluster MODEL Atribute/Faktor Korelasi Bobot
  • 667. • Words are separated by a special character: a blank space • Each word is called a token • The process of discretizing words within a document is called tokenization • For our purpose here, each sentence can be considered a separate document, although what is considered an individual document may depend upon the context • For now, a document here is simply a sequential collection of tokens Word, Token and Tokenization 667
  • 668. • We can impose some form of structure on this raw data by creating a matrix, where: • the columns consist of all the tokens found in the two documents • the cells of the matrix are the counts of the number of times a token appears • Each token is now an attribute in standard data mining parlance and each document is an example Matrix of Terms 668
  • 669. • Basically, unstructured raw data is now transformed into a format that is recognized, not only by the human users as a data table, but more importantly by all the machine learning algorithms which require such tables for training • This table is called a document vector or term document matrix (TDM) and is the cornerstone of the preprocessing required for text mining Term Document Matrix (TDM) 669
  • 670. • We could have also chosen to use the TF–IDF scores for each term to create the document vector • N is the number of documents that we are trying to mine • Nk is the number of documents that contain the keyword, k TF–IDF 670
  • 671. • In the two sample text documents was the occurrence of common words such as “a,” “this,” “and,” and other similar terms • Clearly in larger documents we would expect a larger number of such terms that do not really convey specific meaning • Most grammatical necessities such as articles, conjunctions, prepositions, and pronouns may need to be filtered before we perform additional analysis • Such terms are called stopwords and usually include most articles, conjunctions, pronouns, and prepositions • Stopword filtering is usually the second step that follows immediately after tokenization • Notice that our document vector has a significantly reduced size after applying standard English stopword filtering Stopwords 671
  • 672. • Lakukan googling dengan keyword: stopwords bahasa Indonesia • Download stopword bahasa Indonesia dan gunakan di Rapidminer Stopwords Bahasa Indonesia 672
  • 673. • Words such as “recognized,” “recognizable,” or “recognition” in different usages, but contextually they may all imply the same meaning, for example: • “Einstein is a well-recognized name in physics” • “The physicist went by the easily recognizable name of Einstein” • “Few other physicists have the kind of name recognition that Einstein has” • The so-called root of all these highlighted words is “recognize” • By reducing terms in a document to their basic stems, we can simplify the conversion of unstructured text to structured data because we now only take into account the occurrence of the root terms • This process is called stemming. The most common stemming technique for text mining in English is the Porter method (Porter, 1980) Stemming 673
  • 674. A Typical Sequence of Preprocessing Steps to Use in Text Mining 674
  • 675. • There are families of words in the spoken and written language that typically go together • The word “Good” is usually followed by either “Morning,” “Afternoon,” “Evening,” “Night,” or in Australia, “Day” • Grouping such terms, called n-grams, and analyzing them statistically can present new insights • Search engines use word n-gram models for a variety of applications, such as: • Automatic translation, identifying speech patterns, checking misspelling, entity detection, information extraction, among many different use cases N-Grams 675
  • 676. Rapidminer Process of Text Mining 676
  • 678. • Lakukan eksperimen mengikuti buku Matthew North (Data Mining for the Masses) Chapter 12 (Text Mining), 2012, p 189-215 • Datasets: Federalist Papers • Pahami alur text mining yang dilakukan dan sesuaikan dengan konsep yang sudah dipelajari Latihan 678
  • 679. • Motivation: • Gillian is a historian, and she has recently curated an exhibit on the Federalist Papers, the essays that were written and published in the late 1700’s • The essays were published anonymously under the author name ‘Publius’, and no one really knew at the time if ‘Publius’ was one individual or many • After Alexander Hamilton died in 1804, some notes were discovered that revealed that he (Hamilton), James Madison and John Jay had been the authors of the papers • The notes indicated specific authors for some papers, but not for others: • John Jay was revealed to be the author for papers 3, 4 and 5 • James Madison for paper 14 • Hamilton for paper 17 • Paper 18 had no author named, but there was evidence that Hamilton and Madison worked on that one together • Objective: • Gillian would like to analyze paper 18’s content in the context of the other papers with known authors, to see if she can generate some evidence that the suspected collaboration between Hamilton and Madison is in fact 1. Business Understanding 679
  • 680. • The Federalist Papers are available through a number of sources: • They have been re-published in book form, they are available on a number of different web sites • Their text is archived in many libraries throughout the world • Gillian’s data set is simple (6 dataset): • Federalist03_Jay.txt • Federalist04_Jay.txt • Federalist05_Jay.txt • Federalist14_Madison.txt • Federalist17_Hamilton.txt • Federalist18_Collaboration.txt (suspected) 2. Data Understanding 680
  • 681. Text Processing Extension Installation Modeling 681
  • 684. • Gillian feels confident that paper 18 is a collaboration that John Jay did not contribute to • His vocabulary and grammatical structure was quite different from those of Hamilton and Madison Evaluation 684
  • 685. • Lakukan eksperimen mengikuti buku Vijay Kotu (Predictive Analytics and Data Mining) Chapter 9 (Text Mining), Case Study 1: Keyword Clustering, p 284-287 • Datasets (file pages.txt): 1. https://www.cnnindonesia.com/olahraga 2. https://www.cnnindonesia.com/ekonomi • Gunakan stopword Bahasa Indonesia (ada di folder dataset), dengan operator Stopword (Dictionary) dan pilih file stopword-indonesia.txt • Untuk mempermudah, copy/paste file 09_Text_9.3.1_keyword_clustering_webmining. rmp ke Repository dan kemudian buka di Rapidminer • Pilih file pages.txt yang berisis URL pada Read URL Latihan 685
  • 686. 686
  • 687. Testing Model (Read Document) 687
  • 688. Testing Model (Get Page) 688
  • 690. • Dengan berbagai konsep dan teknik yang anda kuasai, lakukan text classification pada dataset polarity data - small • Gunakan algoritma Decision Tree untuk membentuk model • Ambil 1 artikel di dalam folder polaritydata - small – testing , misalnya dalam folder pos, uji apakah artikel tersebut diprediksi termasuk sentiment negative atau positive Latihan 690
  • 691. 691
  • 692. 692
  • 693. Ukur Akurasi dari polaritydata-small-testing 693
  • 694. • Dengan berbagai konsep dan teknik yang anda kuasai, lakukan text classification pada dataset polarity data • Terapkan beberapa metode feature selection, baik filter maupun wrapper • Lakukan komparasi terhadap berbagai algoritma klasifikasi, dan pilih yang terbaik Latihan 694
  • 695. • Lakukan eksperimen mengikuti buku Vijay Kotu (Predictive Analytics and Data Mining) Chapter 9 (Text Mining), Case Study 2: Predicting the Gender of Blog Authors, p 287-301 • Datasets: blog-gender-dataset.xslx • Split Data: 50% data training dan 50% data testing • Gunakan algoritma Naïve Bayes • Apply model yang dihasilkan untuk data testing • Ukur performance nya Latihan 695
  • 696. 696
  • 697. • Lakukan eksperimen mengikuti buku Vijay Kotu (Predictive Analytics and Data Mining) Chapter 9 (Text Mining), Case Study 2: Predicting the Gender of Blog Authors, p 287- 301 • Datasets: • blog-gender-dataset.xslx • blog-gender-dataset-testing.xlsx • Gunakan 10-fold X validation dan operator write model (read model), store (retrieve) Latihan 697
  • 698. 698
  • 699. 699
  • 700. 700
  • 701. 701
  • 702. 1. Jelaskan perbedaan antara data, informasi dan pengetahuan! 2. Jelaskan apa yang anda ketahui tentang data mining! 3. Sebutkan peran utama data mining! 4. Sebutkan pemanfaatan dari data mining di berbagai bidang! 5. Pengetahuan atau pola apa yang bisa kita dapatkan dari data di bawah? Post-Test 702 NIM Gender Nilai UN Asal Sekolah IPS1 IPS2 IPS3 IPS 4 ... Lulus Tepat Waktu 10001 L 28 SMAN 2 3.3 3.6 2.89 2.9 Ya 10002 P 27 SMAN 7 4.0 3.2 3.8 3.7 Tidak 10003 P 24 SMAN 1 2.7 3.4 4.0 3.5 Tidak 10004 L 26.4 SMAN 3 3.2 2.7 3.6 3.4 Ya ... 11000 L 23.4 SMAN 5 3.3 2.8 3.1 3.2 Ya
  • 703. 5.4 Data Mining Laws Tom Khabaza, Nine Laws of Data Mining, 2010 (http://khabaza.codimension.net/index_files/9laws.htm) 703
  • 704. 1. Business objectives are the origin of every data mining solution 2. Business knowledge is central to every step of the data mining process 3. Data preparation is more than half of every data mining process 4. There is no free lunch for the data miner 5. There are always patterns 6. Data mining amplifies perception in the business domain 7. Prediction increases information locally by generalisation 8. The value of data mining results is not determined by the accuracy or stability of predictive models 9. All patterns are subject to change Data Mining Laws 704 Tom Khabaza, Nine Laws of Data Mining, 2010 (http://khabaza.codimension.net/index_files/9laws.htm)
  • 705. Business objectives are the origin of every data mining solution • This defines the field of data mining: data mining is concerned with solving business problems and achieving business goals • Data mining is not primarily a technology; it is a process, which has one or more business objectives at its heart • Without a business objective, there is no data mining • The maxim: “Data Mining is a Business Process” 1 Business Goals Law 705
  • 706. Business knowledge is central to every step of the data mining process • A naive reading of CRISP-DM would see business knowledge used at the start of the process in defining goals, and at the end of the process in guiding deployment of results • This would be to miss a key property of the data mining process, that business knowledge has a central role in every step 2 Business Knowledge Law 706
  • 707. 1. Business understanding must be based on business knowledge, and so must the mapping of business objectives to data mining goals 2. Data understanding uses business knowledge to understand which data is related to the business problem, and how it is related 3. Data preparation means using business knowledge to shape the data so that the required business questions can be asked and answered 4. Modelling means using data mining algorithms to create predictive models and interpreting both the models and their behaviour in business terms – that is, understanding their business relevance 5. Evaluation means understanding the business impact of using the models 6. Deployment means putting the data mining results to work in a business process 2 Business Knowledge Law 707
  • 708. Data preparation is more than half of every data mining process • Maxim of data mining: most of the effort in a data mining project is spent in data acquisition and preparation, and informal estimates vary from 50 to 80 percent • The purpose of data preparation is: 1. To put the data into a form in which the data mining question can be asked 2. To make it easier for the analytical techniques (such as data mining algorithms) to answer it 3 Data Preparation Law 708
  • 709. There is No Free Lunch for the Data Miner (NFL-DM) The right model for a given application can only be discovered by experiment • Axiom of machine learning: if we knew enough about a problem space, we could choose or design an algorithm to find optimal solutions in that problem space with maximal efficiency • Arguments for the superiority of one algorithm over others in data mining rest on the idea that data mining problem spaces have one particular set of properties, or that these properties can be discovered by analysis and built into the algorithm • However, these views arise from the erroneous idea that, in data mining, the data miner formulates the problem and the algorithm finds the solution • In fact, the data miner both formulates the problem and finds the solution – the algorithm is merely a tool which the data miner uses to assist with certain steps in this process 4 No Free Lunch Theory 709
  • 710. • If the problem space were well-understood, the data mining process would not be needed • Data mining is the process of searching for as yet unknown connections • For a given application, there is not only one problem space • Different models may be used to solve different parts of the problem • The way in which the problem is decomposed is itself often the result of data mining and not known before the process begins • The data miner manipulates, or “shapes”, the problem space by data preparation, so that the grounds for evaluating a model are constantly shifting • There is no technical measure of value for a predictive model • The business objective itself undergoes revision and development during the data mining process • so that the appropriate data mining goals may change completely 4 No Free Lunch Theory 710
  • 711. There are always patterns • This law was first stated by David Watkins • There is always something interesting to be found in a business-relevant dataset, so that even if the expected patterns were not found, something else useful would be found • A data mining project would not be undertaken unless business experts expected that patterns would be present, and it should not be surprising that the experts are usually right 5 Watkins’ Law 711
  • 712. Data mining amplifies perception in the business domain • How does data mining produce insight? This law approaches the heart of data mining – why it must be a business process and not a technical one • Business problems are solved by people, not by algorithms • The data miner and the business expert “see” the solution to a problem, that is the patterns in the domain that allow the business objective to be achieved • Thus data mining is, or assists as part of, a perceptual process • Data mining algorithms reveal patterns that are not normally visible to human perception • The data mining process integrates these algorithms with the normal human perceptual process, which is active in nature • Within the data mining process, the human problem solver interprets the results of data mining algorithms and integrates them into their business understanding 6 Insight Law 712
  • 713. Prediction increases information locally by generalisation • “Predictive models” and “predictive analytics” means “predict the most likely outcome” • Other kinds of data mining models, such as clustering and association, are also characterised as “predictive”; this is a much looser sense of the term: • A clustering model might be described as “predicting” the group into which an individual falls • An association model might be described as “predicting” one or more attributes on the basis of those that are known • What is “prediction” in this sense? What do classification, regression, clustering and association algorithms and their resultant models have in common? • The answer lies in “scoring”, that is the application of a predictive model to a new example • The available information about the example in question has been increased, locally, on the basis of the patterns found by the algorithm and embodied in the model, that is on the basis of generalisation or induction 7 Prediction Law 713
  • 714. The value of data mining results is not determined by the accuracy or stability of predictive models • Accuracy and stability are useful measures of how well a predictive model makes its predictions • Accuracy means how often the predictions are correct • Stability means how much the predictions would change if the data used to create the model were a different sample from the same population • The value of a predictive model arises in two ways: • The model’s predictions drive improved (more effective) action • The model delivers insight (new knowledge) which leads to improved strategy 8 Value Law 714
  • 715. All patterns are subject to change • The patterns discovered by data mining do not last forever • In marketing and CRM applications of data mining, it is well-understood that patterns of customer behaviour are subject to change over time • Fashions change, markets and competition change, and the economy changes as a whole; for all these reasons, predictive models become out-of-date and should be refreshed regularly or when they cease to predict accurately • The same is true in risk and fraud-related applications of data mining. Patterns of fraud change with a changing environment and because criminals change their behaviour in order to stay ahead of crime prevention efforts 9 Law of Change 715
  • 716. • Analisis masalah dan kebutuhan yang ada di organisasi lingkungan sekitar anda • Kumpulkan dan review dataset yang tersedia, dan hubungkan masalah dan kebutuhan tadi dengan data yang tersedia (analisis dari 5 peran data mining) • Bila memungkinkan pilih beberapa peran sekaligus untuk mengolah data tersebut, misalnya: lakukan association (analisis faktor), sekaligus estimation atau clustering • Lakukan proses CRISP-DM untuk menyelesaikan masalah yang ada di organisasi sesuai dengan data yang didapatkan • Pada proses data preparation, lakukan data cleaning (replace missing value, replace, filter attribute) sehingga data siap dimodelkan • Lakukan juga komparasi algoritma dan feature selection untuk memilih pola dan model terbaik • Rangkumkan evaluasi dari pola/model/knowledge yang dihasilkan dan relasikan hasil evaluasi dengan deployment yang dilakukan • Rangkumkan dalam bentuk slide dengan contoh studi kasus Sarah untuk membantu bidang marketing Tugas Menyelesaikan Masalah Organisasi 716
  • 717. Organisasi Masalah Tujuan Dataset KPK • Sulitnya mengidentifikasi profil koruptor • Tidak patuhnya WL dalam LHKPN • Klasifikasi Profil Pelaku Korupsi • Asosiasi Atribut Pelaku Korupsi • Klasifikasi Kepatuhan LHKPN • Estimasi Penentuan Angka Tuntutan • LHKPN • Penuntuta n BSM Sulit mengidentifikasi faktor apa yang mempengaruhi kualitas pembiayaan Klasifikasi kualitas profil nasabah Data pembiayaan nasabah LKPP Banyaknya konsultasi dan pertanyaan dari berbagai instansi yg harus dijawab • Asosiasi pola pertanyaan instansi • Klasifikasi jenis pertanyaan Data konsultasi BPPK Sulitnya penanganan tweet dari masyarakat, apakah terkait pertanyaan, keluhan atau saran Klasifikasi dan Klastering text mining dari keluhan atau pertanyaan atau saran di media sosial Data twitter masyarakat Universitas Siliwangi Tingkat kelulusan tepat waktu belum maksimal (apakah dikarenakan faktor jurusan Klasifikasi data kelulusan mahasiswa Data mahasiswa Studi Kasus Organisasi 717
  • 718. Organisasi Masalah Tujuan Dataset Kemenkeu (DJPB) Sulit menentukan faktor refinement indicator kinerja 1. Seberapa erat hubungan antar komponen terhadap potensi penyempurnaan 2. Klastering data kinerja organsiasi Data kinerja organisasi Kemenkeu (DJPB) Sulit menentukan arah opini hasil audit kementerian 1. Melihat hubungan beberapa data terhadap opini 2. Klasifikasi profil kementerian Data profil kementerian Kemenkeu (DJPB) Banyaknya pelaporan kanwil yang harus dianalisis dengan beragam atribut 1. Melihat hubungan beberapa indikator laporan kanwil terhadap akurasi 2. Klastering data pelaporan kanwil 3. Klasifikasi akurasi pelaporan kanwil Data pelaporan kanwil Kemenkeu (DJPB) Sulit menentukan prioritas monitoring kanwil 1. Klastering data profil kanwil 2. Melihat hubungan beberapa atribut terhadap klaster profil kanwil Data transaksi dan profil kanwil Studi Kasus Organisasi 718
  • 719. Organisasi Masalah Tujuan Dataset Kemenkeu (SDM) Kebijakan masalah reward dan punishment untuk pegawai sering tidak efektif Klasifikasi profil pegawai yang sering telat dan disiplin, sehingga terdeteksi lebih dini Pegawai Kemenkeu (SDM) Rasio perempuan yang menjabat eselon 4/3/2/1 hanya 15%, padahal masuk PNS rasionya hampir imbang • Klasifikasi dan klastering profile pejabat eselon 4/3/2/1 • Asosiasi jabatan dan atribut profile pegawai Pegawai Bank Indonesia Peredaran uang palsu yang semakin banyak di Indonesia • Asosiasi jumlah peredaran uang palsu dengan profil wilayah Indonesia • Klastering wilayah peredaran uang palsu Peredaran Uang Palsu Adira Finance Rasio kredit macet yang semakin meninggi • Klasifikasi kualitas kreditur yang lancar dan macet • Forecasting jumlah kredit macet • Tingkat hubungan kredit macet dengan berbagai atribut Kreditur Studi Kasus Organisasi 719
  • 720. Organisasi Masalah Tujuan Dataset Kemsos Kompleksnya parameter penentuan tingkat kemiskinan rumah tangga di Indonesia Klasifikasi profil rumah tangga miskin di kabupaten Brebes Rumah tangga miskin di kabupaten Brebes Kemsos Sulitnya menentukan rumah tangga yang diprioritaskan menerima bantuan sosial Klastering profile rumah tangga miskin yang belum menerima bantuan Rumah tangga miskin di kabupaten Belitung Kemsos Sulitnya menentukan jenis penyakit kronis yang diprioritaskan menerima program Penerima Bantuan Iuran jaminan kesehatan (PBIJK) Klasifikasi penyakit kronis yang diderita anggota rumah tangga miskin Anggota rumah tangga di kabupaten Belitung Kemsos Sulitnya menentukan rumah tangga miskin di Indonesia Studi Kasus Organisasi 720
  • 721. Organisasi Masalah Tujuan Dataset Kemsos Kompleksnya parameter penentuan tingkat kemiskinan rumah tangga di Indonesia Klastering profil rumah tangga miskin di kabupaten Belitung Data Terpadu Kesejahteraan Sosial (DTKS) kabupaten Belitung Kemsos Penerima Program Keluarga Harapan (PKH) yang tidak tepat sasaran Klasifikasi faktor utama atribut yang berpengaruh pada penerima PKH Data Terpadu Kesejahteraan Sosial (DTKS) kabupaten Belitung Kemsos Penentuan kebijakan penerima bantuan program rehabilitasi sosial rumah tidak layak huni Klastering spesifikasi rumah pada data rumah tangga Data Terpadu Kesejahteraan Sosial (DTKS) Data Terpadu Kesejahteraan Sosial (DTKS) kabupaten Seram bagian barat Kemsos Banyaknya penerima bantuan sosial yang tidak tepat Klastering profil rumah tangga miskin dari Data Terpadu Data Terpadu Kesejahteraan Studi Kasus Organisasi 721