SlideShare a Scribd company logo
Approximate methods for scalable data mining
Approximate methods for
scalable data mining
Andrew Clegg
Data Analytics & Visualization Team
Pearson Technology
Twitter: @andrew_clegg
Approximate methods for scalable data mining l 24/04/133
What are approximate methods?
Trading accuracy for scalability
• Often use probabilistic data structures
– a.k.a. sketches or signatures
• Mostly stream-friendly
– Allow you to query data you haven’t even kept!
• Generally simple to parallelize
• Predictable error rate (can be tuned)
Approximate methods for scalable data mining l 24/04/134
What are approximate methods?
Trading accuracy for scalability
• Represent characteristics or summary of data
• Use much less space than full dataset (generally via hashing tricks)
– Can alleviate disk, memory, network bottlenecks
• Generally incur more CPU load than exact methods
– This may not be true in a distributed system, overall
○ [de]serialization for example
– Many data-centric systems have CPU to spare anyway
Approximate methods for scalable data mining l 24/04/135
Why approximate methods?
A real-life example
Icons from Dropline Neu! http://findicons.com/pack/1714/dropline_neu
Counting unique terms in time buckets across ElasticSearch shards
Cluster
nodes Master
node
Unique terms
per bucket
per shard
Globally
unique terms
per bucket
Client
Number of globally
unique terms per
bucket
Approximate methods for scalable data mining l 24/04/136
Why approximate methods?
A real-life example
Icons from Dropline Neu! http://findicons.com/pack/1714/dropline_neu
But what if each bucket contains a LOT of terms?
… and what if there are
too many to fit in
memory?
Memory
cost
CPU cost to
serialize
Network
transfer cost
CPU cost to
deserialize
CPU & memory cost to
merge & count sets
Approximate methods for scalable data mining l 24/04/137
Cardinality estimation
Approximate distinct counts
Intuitive explanation
Long runs of trailing 0s in random bit strings are rare.
But the more bit strings you look at, the more likely you
are to see a long one.
So “longest run of trailing 0s seen” can be used as
an estimator of “number of unique bit strings
seen”.
01110001
11101010
00100101
11001100
11110100
11101100
00010100
00000001
00000010
10001110
01110100
01101010
01111111
00100010
00110000
00001010
01000100
01111010
01011101
00000100
Approximate methods for scalable data mining l 24/04/138
Cardinality estimation
Probabilistic counting: basic algorithm
Counting the items
• Let n = 0
• For each input item:
– Hash item into bit string
– Count trailing zeroes in bit string
– If this count > n:
○ Let n = count
Calculating the count
• n = longest run of trailing 0s seen
• Estimated cardinality (“count distinct”) =
2^n … that’s it!
This is an estimate, but not a great one. But…
Approximate methods for scalable data mining l 24/04/139
HyperLogLog algorithm
Billions of distinct values in 1.5KB of RAM with 2% relative error
Image: http://www.aggregateknowledge.com/science/blog/hll.html
Cool properties
• Stream-friendly: no need to keep data
• Error rates are predictable and tunable
• Size and speed stay constant
• Trivial to parallelize
– Combine two HLL counters by taking
the max of each register
Approximate methods for scalable data mining l 24/04/1310
Resources on cardinality estimation
HyperLogLog paper: http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf
Java implementations: https://github.com/clearspring/stream-lib/
Algebird implements HyperLogLog (and much more!) in Scalding: https://github.com/twitter/algebird
Simmer wraps Algebird in Hadoop Streaming command line: https://github.com/avibryant/simmer
Our ElasticSearch plugin: https://github.com/ptdavteam/elasticsearch-approx-plugin
MetaMarkets blog: http://metamarkets.com/2012/fast-cheap-and-98-right-cardinality-estimation-for-big-data/
Aggregate Knowledge blog, including JavaScript implementation and D3 visualization:
http://blog.aggregateknowledge.com/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/
Approximate methods for scalable data mining l 24/04/1311
Bloom filters
Set membership test with chance of false positives
Image: http://en.wikipedia.org/wiki/Bloom_filter
At least one 0 means w
definitely isn’t in set.
All 1s would mean w
probably is in set.
Hash each item n times ⇒
indices into bit field.
Approximate methods for scalable data mining l 24/04/1312
Count-min sketch
Frequency histogram estimation with chance of over-counting
A1 +1 +1
A2 +1 +1
A3 +2
“foo”
h1 h2 h3
“bar”
h1 h2 h3
More hashes / arrays ⇒
reduced chance of
overcounting
count(“foo”) =
min(1, 1, 2) =
1
Approximate methods for scalable data mining l 24/04/1313
Random hyperplanes
Locality-sensitive hashing for approximate nearest neighbours
Hash(Item1) = 011
Hash(Item2) = 001
As the cosine distance
decreases, the probability
of a hash match increases
Item1
h1 h2
h3
Item2
θ
Bitwise hamming distance
correlates with cosine
distance
Approximate methods for scalable data mining l 24/04/1314
Feature hashing
High-dimensional machine learning without feature dictionary
“reduce”
“the”
“size”
“of”
“your”
“feature”
“vector”
“with”
“this”
“one”
“weird”
“old”
“trick”
h(“reduce”) = 9
h(“the”) = 3
h(“size”) = 1
. . .
+1
+1
+1
Effect of collisions on overall
classification accuracy is
surprisingly small!
Multiple hashes, or 1-bit
“sign hash”, can reduce
collisions effects if necessary
Approximate methods for scalable data mining l 24/04/1315
Thanks for listening
And some further reading…
Great ebook available free from:
http://infolab.stanford.edu/~ullman/mmds.html
Ad

Recommended

Approximate methods for scalable data mining (long version)
Approximate methods for scalable data mining (long version)
Andrew Clegg
 
Latency SLOs done right
Latency SLOs done right
Fred Moyer
 
OSTU - Sake Blok on TShark Statistics
OSTU - Sake Blok on TShark Statistics
Denny K Miu
 
Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak
Probabilistic Data Structures and Approximate Solutions Oleksandr Pryymak
PyData
 
Probabilistic Data Structures and Approximate Solutions
Probabilistic Data Structures and Approximate Solutions
Oleksandr Pryymak
 
Probabilistic data structure
Probabilistic data structure
Thinh Dang
 
Approximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming Applications
Debasish Ghosh
 
An introduction to probabilistic data structures
An introduction to probabilistic data structures
Miguel Ping
 
HyperLogLog and friends
HyperLogLog and friends
Simon Lia-Jonassen
 
Tech talk Probabilistic Data Structure
Tech talk Probabilistic Data Structure
Rishabh Dugar
 
Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015
Big Data Spain
 
Count-Distinct Problem
Count-Distinct Problem
Kai Zhang
 
Beyond PFCount: Shrif Nada
Beyond PFCount: Shrif Nada
Redis Labs
 
2013 open analytics_countingv3
2013 open analytics_countingv3
Open Analytics
 
Probabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. Cardinality
Andrii Gakhov
 
Distributed algorithms for big data @ GeeCon
Distributed algorithms for big data @ GeeCon
Duyhai Doan
 
Flink Forward Berlin 2017: David Rodriguez - The Approximate Filter, Join, an...
Flink Forward Berlin 2017: David Rodriguez - The Approximate Filter, Join, an...
Flink Forward
 
Scribed lec8
Scribed lec8
Praveen Kumar
 
HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardin...
HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardin...
Sunny Kr
 
Data streaming algorithms
Data streaming algorithms
Sandeep Joshi
 
Probabilistic data structures
Probabilistic data structures
shrinivasvasala
 
Hyper loglog
Hyper loglog
Eugene Kostieiev
 
2013 py con awesome big data algorithms
2013 py con awesome big data algorithms
c.titus.brown
 
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Andrii Gakhov
 
Counting (Using Computer)
Counting (Using Computer)
roshmat
 
2013 open analytics_countingv3
2013 open analytics_countingv3
abramsm
 
Large-scale real-time analytics for everyone
Large-scale real-time analytics for everyone
Pavel Kalaidin
 
Probabilistic data structures
Probabilistic data structures
Yoav chernobroda
 
cnc-processing-centers-centateq-p-110-en.pdf
cnc-processing-centers-centateq-p-110-en.pdf
AmirStern2
 
You are not excused! How to avoid security blind spots on the way to production
You are not excused! How to avoid security blind spots on the way to production
Michele Leroux Bustamante
 

More Related Content

Similar to Approximate methods for scalable data mining (20)

HyperLogLog and friends
HyperLogLog and friends
Simon Lia-Jonassen
 
Tech talk Probabilistic Data Structure
Tech talk Probabilistic Data Structure
Rishabh Dugar
 
Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015
Big Data Spain
 
Count-Distinct Problem
Count-Distinct Problem
Kai Zhang
 
Beyond PFCount: Shrif Nada
Beyond PFCount: Shrif Nada
Redis Labs
 
2013 open analytics_countingv3
2013 open analytics_countingv3
Open Analytics
 
Probabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. Cardinality
Andrii Gakhov
 
Distributed algorithms for big data @ GeeCon
Distributed algorithms for big data @ GeeCon
Duyhai Doan
 
Flink Forward Berlin 2017: David Rodriguez - The Approximate Filter, Join, an...
Flink Forward Berlin 2017: David Rodriguez - The Approximate Filter, Join, an...
Flink Forward
 
Scribed lec8
Scribed lec8
Praveen Kumar
 
HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardin...
HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardin...
Sunny Kr
 
Data streaming algorithms
Data streaming algorithms
Sandeep Joshi
 
Probabilistic data structures
Probabilistic data structures
shrinivasvasala
 
Hyper loglog
Hyper loglog
Eugene Kostieiev
 
2013 py con awesome big data algorithms
2013 py con awesome big data algorithms
c.titus.brown
 
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Andrii Gakhov
 
Counting (Using Computer)
Counting (Using Computer)
roshmat
 
2013 open analytics_countingv3
2013 open analytics_countingv3
abramsm
 
Large-scale real-time analytics for everyone
Large-scale real-time analytics for everyone
Pavel Kalaidin
 
Probabilistic data structures
Probabilistic data structures
Yoav chernobroda
 
Tech talk Probabilistic Data Structure
Tech talk Probabilistic Data Structure
Rishabh Dugar
 
Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015
Big Data Spain
 
Count-Distinct Problem
Count-Distinct Problem
Kai Zhang
 
Beyond PFCount: Shrif Nada
Beyond PFCount: Shrif Nada
Redis Labs
 
2013 open analytics_countingv3
2013 open analytics_countingv3
Open Analytics
 
Probabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. Cardinality
Andrii Gakhov
 
Distributed algorithms for big data @ GeeCon
Distributed algorithms for big data @ GeeCon
Duyhai Doan
 
Flink Forward Berlin 2017: David Rodriguez - The Approximate Filter, Join, an...
Flink Forward Berlin 2017: David Rodriguez - The Approximate Filter, Join, an...
Flink Forward
 
HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardin...
HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardin...
Sunny Kr
 
Data streaming algorithms
Data streaming algorithms
Sandeep Joshi
 
Probabilistic data structures
Probabilistic data structures
shrinivasvasala
 
2013 py con awesome big data algorithms
2013 py con awesome big data algorithms
c.titus.brown
 
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Andrii Gakhov
 
Counting (Using Computer)
Counting (Using Computer)
roshmat
 
2013 open analytics_countingv3
2013 open analytics_countingv3
abramsm
 
Large-scale real-time analytics for everyone
Large-scale real-time analytics for everyone
Pavel Kalaidin
 
Probabilistic data structures
Probabilistic data structures
Yoav chernobroda
 

Recently uploaded (20)

cnc-processing-centers-centateq-p-110-en.pdf
cnc-processing-centers-centateq-p-110-en.pdf
AmirStern2
 
You are not excused! How to avoid security blind spots on the way to production
You are not excused! How to avoid security blind spots on the way to production
Michele Leroux Bustamante
 
9-1-1 Addressing: End-to-End Automation Using FME
9-1-1 Addressing: End-to-End Automation Using FME
Safe Software
 
Python Conference Singapore - 19 Jun 2025
Python Conference Singapore - 19 Jun 2025
ninefyi
 
Connecting Data and Intelligence: The Role of FME in Machine Learning
Connecting Data and Intelligence: The Role of FME in Machine Learning
Safe Software
 
Turning the Page – How AI is Exponentially Increasing Speed, Accuracy, and Ef...
Turning the Page – How AI is Exponentially Increasing Speed, Accuracy, and Ef...
Impelsys Inc.
 
Enhance GitHub Copilot using MCP - Enterprise version.pdf
Enhance GitHub Copilot using MCP - Enterprise version.pdf
Nilesh Gule
 
Smarter Aviation Data Management: Lessons from Swedavia Airports and Sweco
Smarter Aviation Data Management: Lessons from Swedavia Airports and Sweco
Safe Software
 
FIDO Seminar: New Data: Passkey Adoption in the Workforce.pptx
FIDO Seminar: New Data: Passkey Adoption in the Workforce.pptx
FIDO Alliance
 
PyCon SG 25 - Firecracker Made Easy with Python.pdf
PyCon SG 25 - Firecracker Made Easy with Python.pdf
Muhammad Yuga Nugraha
 
Enabling BIM / GIS integrations with Other Systems with FME
Enabling BIM / GIS integrations with Other Systems with FME
Safe Software
 
Raman Bhaumik - Passionate Tech Enthusiast
Raman Bhaumik - Passionate Tech Enthusiast
Raman Bhaumik
 
“From Enterprise to Makers: Driving Vision AI Innovation at the Extreme Edge,...
“From Enterprise to Makers: Driving Vision AI Innovation at the Extreme Edge,...
Edge AI and Vision Alliance
 
Information Security Response Team Nepal_npCERT_Vice_President_Sudan_Jha.pdf
Information Security Response Team Nepal_npCERT_Vice_President_Sudan_Jha.pdf
ICT Frame Magazine Pvt. Ltd.
 
FIDO Seminar: Authentication for a Billion Consumers - Amazon.pptx
FIDO Seminar: Authentication for a Billion Consumers - Amazon.pptx
FIDO Alliance
 
Can We Use Rust to Develop Extensions for PostgreSQL? (POSETTE: An Event for ...
Can We Use Rust to Develop Extensions for PostgreSQL? (POSETTE: An Event for ...
NTT DATA Technology & Innovation
 
Securing Account Lifecycles in the Age of Deepfakes.pptx
Securing Account Lifecycles in the Age of Deepfakes.pptx
FIDO Alliance
 
AI vs Human Writing: Can You Tell the Difference?
AI vs Human Writing: Can You Tell the Difference?
Shashi Sathyanarayana, Ph.D
 
FME for Distribution & Transmission Integrity Management Program (DIMP & TIMP)
FME for Distribution & Transmission Integrity Management Program (DIMP & TIMP)
Safe Software
 
OWASP Barcelona 2025 Threat Model Library
OWASP Barcelona 2025 Threat Model Library
PetraVukmirovic
 
cnc-processing-centers-centateq-p-110-en.pdf
cnc-processing-centers-centateq-p-110-en.pdf
AmirStern2
 
You are not excused! How to avoid security blind spots on the way to production
You are not excused! How to avoid security blind spots on the way to production
Michele Leroux Bustamante
 
9-1-1 Addressing: End-to-End Automation Using FME
9-1-1 Addressing: End-to-End Automation Using FME
Safe Software
 
Python Conference Singapore - 19 Jun 2025
Python Conference Singapore - 19 Jun 2025
ninefyi
 
Connecting Data and Intelligence: The Role of FME in Machine Learning
Connecting Data and Intelligence: The Role of FME in Machine Learning
Safe Software
 
Turning the Page – How AI is Exponentially Increasing Speed, Accuracy, and Ef...
Turning the Page – How AI is Exponentially Increasing Speed, Accuracy, and Ef...
Impelsys Inc.
 
Enhance GitHub Copilot using MCP - Enterprise version.pdf
Enhance GitHub Copilot using MCP - Enterprise version.pdf
Nilesh Gule
 
Smarter Aviation Data Management: Lessons from Swedavia Airports and Sweco
Smarter Aviation Data Management: Lessons from Swedavia Airports and Sweco
Safe Software
 
FIDO Seminar: New Data: Passkey Adoption in the Workforce.pptx
FIDO Seminar: New Data: Passkey Adoption in the Workforce.pptx
FIDO Alliance
 
PyCon SG 25 - Firecracker Made Easy with Python.pdf
PyCon SG 25 - Firecracker Made Easy with Python.pdf
Muhammad Yuga Nugraha
 
Enabling BIM / GIS integrations with Other Systems with FME
Enabling BIM / GIS integrations with Other Systems with FME
Safe Software
 
Raman Bhaumik - Passionate Tech Enthusiast
Raman Bhaumik - Passionate Tech Enthusiast
Raman Bhaumik
 
“From Enterprise to Makers: Driving Vision AI Innovation at the Extreme Edge,...
“From Enterprise to Makers: Driving Vision AI Innovation at the Extreme Edge,...
Edge AI and Vision Alliance
 
Information Security Response Team Nepal_npCERT_Vice_President_Sudan_Jha.pdf
Information Security Response Team Nepal_npCERT_Vice_President_Sudan_Jha.pdf
ICT Frame Magazine Pvt. Ltd.
 
FIDO Seminar: Authentication for a Billion Consumers - Amazon.pptx
FIDO Seminar: Authentication for a Billion Consumers - Amazon.pptx
FIDO Alliance
 
Can We Use Rust to Develop Extensions for PostgreSQL? (POSETTE: An Event for ...
Can We Use Rust to Develop Extensions for PostgreSQL? (POSETTE: An Event for ...
NTT DATA Technology & Innovation
 
Securing Account Lifecycles in the Age of Deepfakes.pptx
Securing Account Lifecycles in the Age of Deepfakes.pptx
FIDO Alliance
 
AI vs Human Writing: Can You Tell the Difference?
AI vs Human Writing: Can You Tell the Difference?
Shashi Sathyanarayana, Ph.D
 
FME for Distribution & Transmission Integrity Management Program (DIMP & TIMP)
FME for Distribution & Transmission Integrity Management Program (DIMP & TIMP)
Safe Software
 
OWASP Barcelona 2025 Threat Model Library
OWASP Barcelona 2025 Threat Model Library
PetraVukmirovic
 
Ad

Approximate methods for scalable data mining

  • 2. Approximate methods for scalable data mining Andrew Clegg Data Analytics & Visualization Team Pearson Technology Twitter: @andrew_clegg
  • 3. Approximate methods for scalable data mining l 24/04/133 What are approximate methods? Trading accuracy for scalability • Often use probabilistic data structures – a.k.a. sketches or signatures • Mostly stream-friendly – Allow you to query data you haven’t even kept! • Generally simple to parallelize • Predictable error rate (can be tuned)
  • 4. Approximate methods for scalable data mining l 24/04/134 What are approximate methods? Trading accuracy for scalability • Represent characteristics or summary of data • Use much less space than full dataset (generally via hashing tricks) – Can alleviate disk, memory, network bottlenecks • Generally incur more CPU load than exact methods – This may not be true in a distributed system, overall ○ [de]serialization for example – Many data-centric systems have CPU to spare anyway
  • 5. Approximate methods for scalable data mining l 24/04/135 Why approximate methods? A real-life example Icons from Dropline Neu! http://findicons.com/pack/1714/dropline_neu Counting unique terms in time buckets across ElasticSearch shards Cluster nodes Master node Unique terms per bucket per shard Globally unique terms per bucket Client Number of globally unique terms per bucket
  • 6. Approximate methods for scalable data mining l 24/04/136 Why approximate methods? A real-life example Icons from Dropline Neu! http://findicons.com/pack/1714/dropline_neu But what if each bucket contains a LOT of terms? … and what if there are too many to fit in memory? Memory cost CPU cost to serialize Network transfer cost CPU cost to deserialize CPU & memory cost to merge & count sets
  • 7. Approximate methods for scalable data mining l 24/04/137 Cardinality estimation Approximate distinct counts Intuitive explanation Long runs of trailing 0s in random bit strings are rare. But the more bit strings you look at, the more likely you are to see a long one. So “longest run of trailing 0s seen” can be used as an estimator of “number of unique bit strings seen”. 01110001 11101010 00100101 11001100 11110100 11101100 00010100 00000001 00000010 10001110 01110100 01101010 01111111 00100010 00110000 00001010 01000100 01111010 01011101 00000100
  • 8. Approximate methods for scalable data mining l 24/04/138 Cardinality estimation Probabilistic counting: basic algorithm Counting the items • Let n = 0 • For each input item: – Hash item into bit string – Count trailing zeroes in bit string – If this count > n: ○ Let n = count Calculating the count • n = longest run of trailing 0s seen • Estimated cardinality (“count distinct”) = 2^n … that’s it! This is an estimate, but not a great one. But…
  • 9. Approximate methods for scalable data mining l 24/04/139 HyperLogLog algorithm Billions of distinct values in 1.5KB of RAM with 2% relative error Image: http://www.aggregateknowledge.com/science/blog/hll.html Cool properties • Stream-friendly: no need to keep data • Error rates are predictable and tunable • Size and speed stay constant • Trivial to parallelize – Combine two HLL counters by taking the max of each register
  • 10. Approximate methods for scalable data mining l 24/04/1310 Resources on cardinality estimation HyperLogLog paper: http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf Java implementations: https://github.com/clearspring/stream-lib/ Algebird implements HyperLogLog (and much more!) in Scalding: https://github.com/twitter/algebird Simmer wraps Algebird in Hadoop Streaming command line: https://github.com/avibryant/simmer Our ElasticSearch plugin: https://github.com/ptdavteam/elasticsearch-approx-plugin MetaMarkets blog: http://metamarkets.com/2012/fast-cheap-and-98-right-cardinality-estimation-for-big-data/ Aggregate Knowledge blog, including JavaScript implementation and D3 visualization: http://blog.aggregateknowledge.com/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/
  • 11. Approximate methods for scalable data mining l 24/04/1311 Bloom filters Set membership test with chance of false positives Image: http://en.wikipedia.org/wiki/Bloom_filter At least one 0 means w definitely isn’t in set. All 1s would mean w probably is in set. Hash each item n times ⇒ indices into bit field.
  • 12. Approximate methods for scalable data mining l 24/04/1312 Count-min sketch Frequency histogram estimation with chance of over-counting A1 +1 +1 A2 +1 +1 A3 +2 “foo” h1 h2 h3 “bar” h1 h2 h3 More hashes / arrays ⇒ reduced chance of overcounting count(“foo”) = min(1, 1, 2) = 1
  • 13. Approximate methods for scalable data mining l 24/04/1313 Random hyperplanes Locality-sensitive hashing for approximate nearest neighbours Hash(Item1) = 011 Hash(Item2) = 001 As the cosine distance decreases, the probability of a hash match increases Item1 h1 h2 h3 Item2 θ Bitwise hamming distance correlates with cosine distance
  • 14. Approximate methods for scalable data mining l 24/04/1314 Feature hashing High-dimensional machine learning without feature dictionary “reduce” “the” “size” “of” “your” “feature” “vector” “with” “this” “one” “weird” “old” “trick” h(“reduce”) = 9 h(“the”) = 3 h(“size”) = 1 . . . +1 +1 +1 Effect of collisions on overall classification accuracy is surprisingly small! Multiple hashes, or 1-bit “sign hash”, can reduce collisions effects if necessary
  • 15. Approximate methods for scalable data mining l 24/04/1315 Thanks for listening And some further reading… Great ebook available free from: http://infolab.stanford.edu/~ullman/mmds.html