Datamining-lect2 - What is Data_ the Data Mining Pipeline. Preprocessing and Postprocessing. Samping and Normalization (1)
Datamining-lect2 - What is Data_ the Data Mining Pipeline. Preprocessing and Postprocessing. Samping and Normalization (1)
LECTURE 2
What is data?
The data mining pipeline
What is Data Mining?
• Data mining is the use of efficient techniques for the analysis of
very large collections of data and the extraction of useful and
possibly unexpected patterns in data.
some values. 10
10 No Single 90K Yes
2342345 0 1 0 25 0 1 0 30000 1
1234542 1 0 0 45 0 0 1 200000 0
1243535 0 0 1 43 0 0 0 150000 0
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Set data
• Each record is a set of items from a space of
possible items
• Example: Document data
• Also called bag-of-words representation
Doc Id Words
1 the, dog, followed, the, cat
2 the, cat, chased, the, cat
3 the, man, walked, the, dog
Vector representation of market-basket data
• Market-basket data can be represented, or thought
of, as numeric vector data
• The vector is defined over the set of all possible items
• The values are binary (the item appears or not in the set)
Diape
Coke
Brea
Beer
Milk
TID Items TID
r
1 Bread, Coke, Milk 1 1 1 1 0 0
2 Beer, Bread 2 1 0 0 1 0
3 Beer, Coke, Diaper, Milk 3 0 1 1 1 1
4 Beer, Bread, Diaper, Milk 4 1 0 1 1 1
5 Coke, Diaper, Milk 5 0 1 1 0 1
Sparsity: Most entries are zero. Most baskets contain few items
Vector representation of document data
• Document data can be represented, or thought of,
as numeric vector data
• The vector is defined over the set of all possible words
• The values are the counts (number of times a word
appears in the document)
follow
chase
walks
man
Doc
dog
the
cat
Doc Id Words Id
s
1 the, dog, follows, the, cat 1 2 1 1 1 0 0 0
2 the, cat, chases, the, cat 2 2 0 0 2 1 0 0
3 the, man, walks, the, dog 3 1 1 0 0 0 1 1
Sparsity: Most entries are zero. Most documents contain few of the words
Physical data storage
• Usually set data is stored in flat files
• One line per set
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
30 31 32
33 34 35
36 37 38 39 40 41 42 43 44 45 46
38 39 47 48
38 39 48 49 50 51 52 53 54 55 56 57 58
32 41 59 60 61 62
3 39 48
• I heard so many good things about this place so I was pretty juiced to try it. I'm
from Cali and I heard Shake Shack is comparable to IN-N-OUT and I gotta say, Shake
Shake wins hands down. Surprisingly, the line was short and we waited about 10 MIN.
to order. I ordered a regular cheeseburger, fries and a black/white shake. So
yummerz. I love the location too! It's in the middle of the city and the view is
breathtaking. Definitely one of my favorite places to eat in NYC.
• I'm from California and I must say, Shake Shack is better than IN-N-OUT, all day,
err'day.
Ordered Data
• Genomic sequence data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
5 4
Representation
• Adjacency matrix
• Very sparse, very wasteful, but useful conceptually
2
0 1 1 0 0
1 0 0 0 0
1
A 0 1 0 1 0 3
0 0 0 0 1
0 0 0 0 0
5 4
Representation
• Adjacency list
• Not so easy to maintain
2
1: [2, 3]
2: [1, 3]
1
3: [1, 2, 4] 3
4: [3, 5]
5: [4]
5 4
Representation
• List of pairs
• The simplest and most efficient representation
2
(1,2)
(2,3)
1
(1,3) 3
(3,4)
(4,5)
5 4
Types of data: summary
• Numeric data: Each object is a point in a
multidimensional space
• Categorical data: Each object is a vector of
categorical values
• Set data: Each object is a set of values (with or
without counts)
• Sets can also be represented as binary vectors, or
vectors of counts
• Ordered sequences: Each object is an ordered
sequence of values.
• Graph data: A collection of pairwise relationships
The data analysis pipeline
Mining is not the only step in the analysis process
Data
Collection
Data Result
Data Mining
Preprocessing Post-processing
The data analysis pipeline
Data
Collection
Data Result
Data Mining
Preprocessing Post-processing
Data Result
Data Mining
Preprocessing Post-processing
Data Result
Data Mining
Preprocessing Post-processing
Data
Collection
Data Result
Data Mining
Preprocessing Post-processing
• Reservoir Sampling:
• Standard interview question for many companies
Reservoir sampling
• Algorithm: With probability 1/k select the k-th item
of the stream and replace the previous choice.
• Proof
• What is the probability of the k-th item to be selected?
• I'm from California and I must say, Shake Shack is better than IN-N-OUT, all day,
err'day.
• Would I pay $15+ for a burger here? No. But for the price point they are asking for,
this is a definite bang for your buck (though for some, the opportunity cost of
waiting in line might outweigh the cost savings) Thankfully, I came in before the
lunch swarm descended and I ordered a shake shack (the special burger with the patty
+ fried cheese & portabella topping) and a coffee milk shake. The beef patty was
very juicy and snugly packed within a soft potato roll. On the downside, I could do
without the fried portabella-thingy, as the crispy taste conflicted with the juicy,
tender burger. How does shake shack compare with in-and-out or 5-guys? I say a very
close tie, and I think it comes down to personal affliations. On the shake side, true
to its name, the shake was well churned and very thick and luscious. The coffee
flavor added a tangy taste and complemented the vanilla shake well. Situated in an
open space in NYC, the open air sitting allows you to munch on your burger while
watching people zoom by around the city. It's an oddly calming experience, or perhaps
it was the food coma I was slowly falling into. Great place with food at a great
price.
First cut
• Do simple processing to “normalize” the data (remove punctuation, make
into lower case, clear white spaces, other?)
• Break into words, keep the most popular words
the 27514 the 16710 the 16010 the 14241
and 14508 and 9139 and 9504 and 8237
i 13088 a 8583 i 7966 a 8182
a 12152 i 8415 to 6524 i 7001
to 10672 to 7003 a 6370 to 6727
of 8702 in 5363 it 5169 of 4874
ramen 8518 it 4606 of 5159 you 4515
was 8274 of 4365 is 4519 it 4308
is 6835 is 4340 sauce 4020 is 4016
it 6802 burger 432 in 3951 was 3791
in 6402 was 4070 this 3519 pastrami 3748
for 6145 for 3441 was 3453 in 3508
but 5254 but 3284 for 3327 for 3424
that 4540 shack 3278 you 3220 sandwich 2928
you 4366 shake 3172 that 2769 that 2728
with 4181 that 3005 but 2590 but 2715
pork 4115 you 2985 food 2497 on 2247
my 3841 my 2514 on 2350 this 2099
this 3487 line 2389 my 2311 my 2064
wait 3184 this 2242 cart 2236 with 2040
not 3016 fries 2240 chicken 2220 not 1655
we 2984 on 2204 with 2195 your 1622
at 2980 are 2142 rice 2049 so 1610
on 2922 with 2095 so 1825 have 1585
First cut
• Do simple processing to “normalize” the data (remove punctuation, make
into lower case, clear white spaces, other?)
• Break into words, keep the most popular words
the 27514 the 16710 the 14241
the 16010
and 14508 and 9139 and 8237
and 9504
i 13088 a 8583 a 8182
i 7966
a 12152 i 8415 i 7001
to 6524
to 10672 to 7003 to 6727
a 6370
of 8702 in 5363 of 4874
it 5169
ramen 8518 it 4606 you 4515
of 5159
was 8274 of 4365 it 4308
is 4519
is 6835 is 4340 is 4016
sauce 4020
it 6802 burger 432 was 3791
in 3951
in 6402 was 4070 pastrami 3748
this 3519
for 6145 for 3441 in 3508
was 3453
but 5254 but 3284 for 3424
for 3327
that 4540 shack 3278 sandwich 2928
you 3220
you 4366 shake 3172 that 2728
that 2769
with 4181 that 3005 but 2715
but 2590
pork 4115 you 2985 on 2247
food 2497
my 3841 my 2514 this 2099
on 2350
this 3487 line 2389
this 2242
Most frequent words are stop words
my 2311
my 2064
with 2040
wait 3184 cart 2236
not 3016 fries 2240 not 1655
chicken 2220
we 2984 on 2204 your 1622
with 2195
at 2980 are 2142 so 1610
rice 2049
on 2922 with 2095 have 1585
so 1825
Second cut
• Remove stop words
• Stop-word lists can be found online.
a,about,above,after,again,against,all,am,an,and,any,are,aren't,as,at,be,be
cause,been,before,being,below,between,both,but,by,can't,cannot,could,could
n't,did,didn't,do,does,doesn't,doing,don't,down,during,each,few,for,from,f
urther,had,hadn't,has,hasn't,have,haven't,having,he,he'd,he'll,he's,her,he
re,here's,hers,herself,him,himself,his,how,how's,i,i'd,i'll,i'm,i've,if,in
,into,is,isn't,it,it's,its,itself,let's,me,more,most,mustn't,my,myself,no,
nor,not,of,off,on,once,only,or,other,ought,our,ours,ourselves,out,over,own
,same,shan't,she,she'd,she'll,she's,should,shouldn't,so,some,such,than,tha
t,that's,the,their,theirs,them,themselves,then,there,there's,these,they,th
ey'd,they'll,they're,they've,this,those,through,to,too,under,until,up,very
,was,wasn't,we,we'd,we'll,we're,we've,were,weren't,what,what's,when,when's
,where,where's,which,while,who,who's,whom,why,why's,with,won't,would,would
n't,you,you'd,you'll,you're,you've,your,yours,yourself,yourselves,
Second cut
• Remove stop words
• Stop-word lists can be found online.
ramen 8572 burger 4340 sauce 4023 pastrami 3782
pork 4152 shack 3291 food 2507 sandwich 2934
wait 3195 shake 3221 cart 2239 place 1480
good 2867 line 2397 chicken 2238 good 1341
place 2361 fries 2260 rice 2052 get 1251
noodles 2279 good 1920 hot 1835 katz's 1223
ippudo 2261 burgers 1643 white 1782 just 1214
buns 2251 wait 1508 line 1755 like 1207
broth 2041 just 1412 good 1629 meat 1168
like 1902 cheese 1307 lamb 1422 one 1071
just 1896 like 1204 halal 1343 deli 984
get 1641 food 1175 just 1338 best 965
time 1613 get 1162 get 1332 go 961
one 1460 place 1159 one 1222 ticket 955
really 1437 one 1118 like 1096 food 896
go 1366 long 1013 place 1052 sandwiches 813
food 1296 go 995 go 965 can 812
bowl 1272 time 951 can 878 beef 768
can 1256 park 887 night 832 order 720
great 1172 can 860 time 794 pickles 699
best 1167 best 849 long 792 time 662
people 790
Second cut
• Remove stop words
• Stop-word lists can be found online.
ramen 8572 burger 4340 sauce 4023 pastrami 3782
pork 4152 shack 3291 food 2507 sandwich 2934
wait 3195 shake 3221 cart 2239 place 1480
good 2867 line 2397 chicken 2238 good 1341
place 2361 fries 2260 rice 2052 get 1251
noodles 2279 good 1920 hot 1835 katz's 1223
ippudo 2261 burgers 1643 white 1782 just 1214
buns 2251 wait 1508 line 1755 like 1207
broth 2041 just 1412 good 1629 meat 1168
like 1902 cheese 1307 lamb 1422 one 1071
just 1896 like 1204 halal 1343 deli 984
get 1641 food 1175 just 1338 best 965
time 1613 get 1162 get 1332 go 961
one 1460 place 1159 one 1222 ticket 955
really 1437 one 1118 like 1096 food 896
go 1366 long 1013 place 1052 sandwiches 813
food 1296
bowl 1272 Commonly used words in reviews, not so interesting
go 995
time 951
go 965
can 878
can 812
beef 768
can 1256 park 887 night 832 order 720
great 1172 can 860 time 794 pickles 699
best 1167 best 849 long 792 time 662
people 790
IDF
• Important words are the ones that are unique to the document
(differentiating) compared to the rest of the collection
• All reviews use the word “like”. This is not interesting
• We want the words that characterize the specific restaurant
• Too frequent data (stop words), too infrequent (errors?), erroneous data, missing data,
outliers
• How should we weight the different pieces of data?
• We should make our decisions clear since they affect our findings.
30 0.8 90
32 0.5 80
24 0.3 95
Normalization
• Subtract the minimum value and divide by the
difference of the maximum value and minimum
value for each attribute
• Brings everything in the [0,1] range, minimum is zero
30 0.8 90
32 0.5 80
24 0.3 95
Normalization
• Are these documents similar?
new value = (old value – mean row value) [/ (max row value –min row value)]
7000
6000
5000
4000
3000
2000
1000
0
0 5000 10000 15000 20000 25000 30000 35000
10
1
1 10 100 1000 10000 100000
10000
1000
100
10
1
1 10 100 1000 10000 100000
0.8 0.4 1
0.7 0.35 0.9
0.8
0.6 0.3
0.7
0.5 0.25 0.6
0.4 0.2 0.5
0.3 0.15 0.4
0.3
0.2 0.1
0.2
0.1 0.05 0.1
0 0 0
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
The importance of correct representation
• Putting all three plots together makes it more clear to see
the differences
1
0.9
0.8
0.7
0.6
Series1
0.5
Series3
0.4 Series5
0.3
0.2
0.1
0
0 10 20 30 40 50 60 70 80 90 100
• Green falls more slowly. Blue and Red seem more or less
the same
The importance of correct representation
• Making the plot in log-log space makes the differences more
clear 1
1 10 100
0.001
0.000001
0.000000001
1E-12 Series1
Series3
1E-15
Series5
1E-18
1E-21
1E-24
1E-27
1E-30
1
1
10 0.9
0.9
20 0.8
0.8
30 0.7
0.7
40 0.6
0.6
Points
50 0.5
0.5
y
60 0.4
0.4
70 0.3
0.3
80 0.2
0.2
90 0.1
0.1
100 0
0 20 40 60 80 100 Similarity
0 0.2 0.4 0.6 0.8 1
Points
x
Words
Before clustering After clustering
Heatmaps
A very popular way to visualize data
http://projects.oregonlive.com/ucc-shooting/gun-deaths.php
Statistical Significance
• When we extract knowledge from a large dataset we
need to make sure that what we found is not an artifact of
randomness
• E.g., we find that many people buy milk and toilet paper
together.
• But many (more) people buy milk and toilet paper independently
• Statistical tests compare the results of an experiment
with those generated by a null hypothesis
• E.g., a null hypothesis is that people select items independently.
• A result is interesting if it cannot be produced by
randomness.
• An important problem is to define the null hypothesis correctly:
What is random?
91
Meaningfulness of Answers