SlideShare a Scribd company logo
Intro to Apache MahoutGrant IngersollLucid Imaginationhttp://www.lucidimagination.com
Anyone Here Use Machine Learning?Any users of:Google?Search?Priority Inbox?Facebook?Twitter?LinkedIn?
TopicsBackground and Use CasesWhat can you do in Mahout?Where’s the community at?ResourcesK-Means in Hadoop (time permitting)
Definition“Machine Learning is programming computers to optimize a performance criterion using example data or past experience”Intro. To Machine Learning by E. AlpaydinSubset of Artificial IntelligenceLots of related fields:Information RetrievalStatsBiologyLinear algebraMany more
Common Use CasesRecommend friends/dates/productsClassify content into predefined groupsFind similar contentFind associations/patterns in actions/behaviorsIdentify key topics/summarize textDocuments and CorporaDetect anomalies/fraudRanking search resultsOthers?
Apache MahoutAn Apache Software Foundation project to create scalable machine learning libraries under the Apache Software Licensehttp://mahout.apache.orgWhy Mahout?Many Open Source ML libraries either:Lack CommunityLack Documentation and ExamplesLack ScalabilityLack the Apache LicenseOr are research-orientedDefinition:http://dictionary.reference.com/browse/mahout
What does scalable mean to us?Goal: Be as fast and efficient as possible given the intrinsic design of the algorithmSome algorithms won’t scale to massive machine clustersOthers fit logically on a Map Reduce framework like Apache HadoopStill others will need different distributed programming modelsOthers are already fast (SGD)Be pragmatic
Sampling of Who uses Mahout?https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout
What Can I do with Mahout Right Now?3C + FPM + O = Mahout
Collaborative FilteringExtensive framework for collaborative filtering (recommenders)RecommendersUser basedItem basedOnline and Offline supportOffline can utilize HadoopMany different Similarity measuresCosine, LLR, Tanimoto, Pearson, others
ClusteringDocument levelGroup documents based on a notion of similarityK-Means, Fuzzy K-Means, Dirichlet, Canopy, Mean-Shift, EigenCuts (Spectral)All Map/ReduceDistance MeasuresManhattan, Euclidean, otherTopic Modeling Cluster words across documents to identify topicsLatent Dirichlet Allocation (M/R)
CategorizationPlace new items into predefined categories:Sports, politics, entertainmentRecommendersImplementationsNaïve Bayes (M/R)Compl. Naïve Bayes (M/R)Decision Forests (M/R)Linear Regression (Seq. but Fast!)See Chapter 17 of Mahout in Action for Shop It To Me use case:
http://awe.sm/5FyNeFreq. Pattern MiningIdentify frequently co-occurrent itemsUseful for:Query RecommendationsApple -> iPhone, orange, OS XRelated product placementBasket AnalysisMap/Reducehttp://www.amazon.com
OtherPrimitive Collections!Collocations (M/R)Math libraryVectors, Matrices, etc.Noise Reduction via Singular Value Decomp (M/R)
Prepare Data from Raw contentData Sources:Lucene integrationbin/mahout lucene.vector…Document Vectorizerbin/mahout seqdirectory …bin/mahout seq2sparse …ProgrammaticallySee the Utils module in Mahout and the Iterator<Vector> classesDatabaseFile system
How to: Command LineMost algorithms have a Driver program$MAHOUT_HOME/bin/mahout.shhelps with most tasksPrepare the DataDifferent algorithms require different setupRun the algorithmSingle NodeHadoopPrint out the results or incorporate into applicationSeveral helper classes: LDAPrintTopics, ClusterDumper, etc.
What’s Happening Now?Unified Framework for Clustering and Classification0.5 release on the horizon (May?)Working towards 1.0 release by focusing on:Tests, examples, documentationAPI cleanup and consistencyGearing up for Google Summer of CodeNew M/R work for Hidden Markov Models
SummaryMachine learning is all over the web todayMahout is about scalable machine learningMahout has functionality for many of today’s common machine learning tasksMany Mahout implementations use Hadoop
Resourceshttp://mahout.apache.orghttp://cwiki.apache.org/MAHOUT{user|dev}@mahout.apache.orghttp://svn.apache.org/repos/asf/mahout/trunkhttp://hadoop.apache.org
Resources“Mahout in Action” Owen, Anil, Dunning and Friedmanhttp://awe.sm/5FyNe“Introducing Apache Mahout” http://www.ibm.com/developerworks/java/library/j-mahout/“Taming Text” by Ingersoll, Morton, Farris“Programming Collective Intelligence” by Toby Segaran“Data Mining - Practical Machine Learning Tools and Techniques” by Ian H. Witten and Eibe Frank“Data-Intensive Text Processing with MapReduce” by Jimmy Lin and  Chris Dyer
K-MeansClustering AlgorithmNicely parallelizable!http://en.wikipedia.org/wiki/K-means_clustering
K-Means in Map-ReduceInput:Mahout Vectors representing the original contentEither:A predefined set of initial centroids (Can be from Canopy)--k – The number of clusters to produceIterateDo the centroid calculation (more in a moment)Clustering Step (optional)OutputCentroids (as Mahout Vectors)Points for each Centroid (if Clustering Step was taken)
Map-Reduce IterationEach Iteration calculates the Centroids using:KMeansMapperKMeansCombinerKMeansReducerClustering StepCalculate the points for each Centroid using:KMeansClusterMapper
KMeansMapperDuring Setup:Load the initial Centroids (or the Centroids from the last iteration)Map PhaseFor each inputCalculate it’s distance from each Centroid and output the closest oneDistance Measures are pluggableManhattan, Euclidean, Squared Euclidean, Cosine, others
KMeansReducerSetup:Load up clustersConvergence informationPartial sums from KMeansCombiner (more in a moment)Reduce PhaseSum all the vectors in the cluster to produce a new CentroidCheck for ConvergenceOutput cluster
KMeansCombinerJust like KMeansReducer, but only produces partial sum of the cluster based on the data local to the Mapper
KMeansClusterMapperSome applications only care about what the Centroids are, so this step is optionalSetup:Load up the clusters and the DistanceMeasure usedMap PhaseCalculate which Cluster the point belongs toOutput <ClusterId, Vector>
Ad

More Related Content

What's hot (20)

Apache Mahout
Apache MahoutApache Mahout
Apache Mahout
Save Manos
 
Apache mahout
Apache mahoutApache mahout
Apache mahout
Puneet Gupta
 
Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用
James Chen
 
Mahout
MahoutMahout
Mahout
Edureka!
 
Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014 Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014
Cataldo Musto
 
Apache Mahout
Apache MahoutApache Mahout
Apache Mahout
Ajit Koti
 
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantApache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow Elephant
Grant Ingersoll
 
Introduction to Apache Mahout
Introduction to Apache MahoutIntroduction to Apache Mahout
Introduction to Apache Mahout
Aman Adhikari
 
Mahout part2
Mahout part2Mahout part2
Mahout part2
Yasmine Gaber
 
Next directions in Mahout's recommenders
Next directions in Mahout's recommendersNext directions in Mahout's recommenders
Next directions in Mahout's recommenders
sscdotopen
 
mahout introduction
mahout  introductionmahout  introduction
mahout introduction
changgeng Zhang
 
Hands on Mahout!
Hands on Mahout!Hands on Mahout!
Hands on Mahout!
OSCON Byrum
 
Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark
Scalable Collaborative Filtering Recommendation Algorithms on Apache SparkScalable Collaborative Filtering Recommendation Algorithms on Apache Spark
Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark
Evan Casey
 
Orchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache MahoutOrchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache Mahout
aneeshabakharia
 
Intro to Mahout
Intro to MahoutIntro to Mahout
Intro to Mahout
Uri Lavi
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
DataStax Academy
 
Mahout classification presentation
Mahout classification presentationMahout classification presentation
Mahout classification presentation
Naoki Nakatani
 
Mahout Introduction BarCampDC
Mahout Introduction BarCampDCMahout Introduction BarCampDC
Mahout Introduction BarCampDC
Drew Farris
 
Apache Mahout Architecture Overview
Apache Mahout Architecture OverviewApache Mahout Architecture Overview
Apache Mahout Architecture Overview
Stefano Dalla Palma
 
Collaborative Filtering and Recommender Systems By Navisro Analytics
Collaborative Filtering and Recommender Systems By Navisro AnalyticsCollaborative Filtering and Recommender Systems By Navisro Analytics
Collaborative Filtering and Recommender Systems By Navisro Analytics
Navisro Analytics
 
Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用
James Chen
 
Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014 Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014
Cataldo Musto
 
Apache Mahout
Apache MahoutApache Mahout
Apache Mahout
Ajit Koti
 
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantApache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow Elephant
Grant Ingersoll
 
Introduction to Apache Mahout
Introduction to Apache MahoutIntroduction to Apache Mahout
Introduction to Apache Mahout
Aman Adhikari
 
Next directions in Mahout's recommenders
Next directions in Mahout's recommendersNext directions in Mahout's recommenders
Next directions in Mahout's recommenders
sscdotopen
 
Hands on Mahout!
Hands on Mahout!Hands on Mahout!
Hands on Mahout!
OSCON Byrum
 
Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark
Scalable Collaborative Filtering Recommendation Algorithms on Apache SparkScalable Collaborative Filtering Recommendation Algorithms on Apache Spark
Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark
Evan Casey
 
Orchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache MahoutOrchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache Mahout
aneeshabakharia
 
Intro to Mahout
Intro to MahoutIntro to Mahout
Intro to Mahout
Uri Lavi
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
DataStax Academy
 
Mahout classification presentation
Mahout classification presentationMahout classification presentation
Mahout classification presentation
Naoki Nakatani
 
Mahout Introduction BarCampDC
Mahout Introduction BarCampDCMahout Introduction BarCampDC
Mahout Introduction BarCampDC
Drew Farris
 
Apache Mahout Architecture Overview
Apache Mahout Architecture OverviewApache Mahout Architecture Overview
Apache Mahout Architecture Overview
Stefano Dalla Palma
 
Collaborative Filtering and Recommender Systems By Navisro Analytics
Collaborative Filtering and Recommender Systems By Navisro AnalyticsCollaborative Filtering and Recommender Systems By Navisro Analytics
Collaborative Filtering and Recommender Systems By Navisro Analytics
Navisro Analytics
 

Viewers also liked (20)

Understanding Mahout classification documentation
Understanding Mahout  classification documentationUnderstanding Mahout  classification documentation
Understanding Mahout classification documentation
Naveen Kumar
 
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)
Jee Vang, Ph.D.
 
Kmeans in-hadoop
Kmeans in-hadoopKmeans in-hadoop
Kmeans in-hadoop
Tianwei Liu
 
K fold validation
K fold validationK fold validation
K fold validation
Masrur Ahmed
 
Recomendación con Mahout sobre Cassandra
Recomendación con Mahout sobre CassandraRecomendación con Mahout sobre Cassandra
Recomendación con Mahout sobre Cassandra
Jose Felix Hernandez Barrio
 
Filtros Colaborativos y Sistemas de Recomendación
Filtros Colaborativos y Sistemas de RecomendaciónFiltros Colaborativos y Sistemas de Recomendación
Filtros Colaborativos y Sistemas de Recomendación
Gabriel Huecas
 
Java WebServices JaxWS - JaxRs
Java WebServices JaxWS - JaxRsJava WebServices JaxWS - JaxRs
Java WebServices JaxWS - JaxRs
Hernan Rengifo
 
Hadoop Design and k -Means Clustering
Hadoop Design and k -Means ClusteringHadoop Design and k -Means Clustering
Hadoop Design and k -Means Clustering
George Ang
 
Apache Mahout Algorithms
Apache Mahout AlgorithmsApache Mahout Algorithms
Apache Mahout Algorithms
mozgkarakaya
 
Final Presentation for Pattern Recognition
Final Presentation for Pattern RecognitionFinal Presentation for Pattern Recognition
Final Presentation for Pattern Recognition
davidglenEE
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduce
Varad Meru
 
Modelo del dominio
Modelo del dominioModelo del dominio
Modelo del dominio
SCMU AQP
 
Parallel-kmeans
Parallel-kmeansParallel-kmeans
Parallel-kmeans
Tien-Yang (Aiden) Wu
 
Big data Clustering Algorithms And Strategies
Big data Clustering Algorithms And StrategiesBig data Clustering Algorithms And Strategies
Big data Clustering Algorithms And Strategies
Farzad Nozarian
 
Unidad 10 Mad Diagrama De Clases
Unidad 10 Mad Diagrama De ClasesUnidad 10 Mad Diagrama De Clases
Unidad 10 Mad Diagrama De Clases
Sergio Sanchez
 
IR
IRIR
IR
Girish Khanzode
 
Neural Networks with Google TensorFlow
Neural Networks with Google TensorFlowNeural Networks with Google TensorFlow
Neural Networks with Google TensorFlow
Darshan Patel
 
Modelos de Base de Datos
Modelos de Base de DatosModelos de Base de Datos
Modelos de Base de Datos
Axel Mérida
 
Modelo relacional
Modelo relacionalModelo relacional
Modelo relacional
gerardo_gauna
 
Yahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user groupYahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user group
Hadoop User Group
 
Understanding Mahout classification documentation
Understanding Mahout  classification documentationUnderstanding Mahout  classification documentation
Understanding Mahout classification documentation
Naveen Kumar
 
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)
Jee Vang, Ph.D.
 
Kmeans in-hadoop
Kmeans in-hadoopKmeans in-hadoop
Kmeans in-hadoop
Tianwei Liu
 
Filtros Colaborativos y Sistemas de Recomendación
Filtros Colaborativos y Sistemas de RecomendaciónFiltros Colaborativos y Sistemas de Recomendación
Filtros Colaborativos y Sistemas de Recomendación
Gabriel Huecas
 
Java WebServices JaxWS - JaxRs
Java WebServices JaxWS - JaxRsJava WebServices JaxWS - JaxRs
Java WebServices JaxWS - JaxRs
Hernan Rengifo
 
Hadoop Design and k -Means Clustering
Hadoop Design and k -Means ClusteringHadoop Design and k -Means Clustering
Hadoop Design and k -Means Clustering
George Ang
 
Apache Mahout Algorithms
Apache Mahout AlgorithmsApache Mahout Algorithms
Apache Mahout Algorithms
mozgkarakaya
 
Final Presentation for Pattern Recognition
Final Presentation for Pattern RecognitionFinal Presentation for Pattern Recognition
Final Presentation for Pattern Recognition
davidglenEE
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduce
Varad Meru
 
Modelo del dominio
Modelo del dominioModelo del dominio
Modelo del dominio
SCMU AQP
 
Big data Clustering Algorithms And Strategies
Big data Clustering Algorithms And StrategiesBig data Clustering Algorithms And Strategies
Big data Clustering Algorithms And Strategies
Farzad Nozarian
 
Unidad 10 Mad Diagrama De Clases
Unidad 10 Mad Diagrama De ClasesUnidad 10 Mad Diagrama De Clases
Unidad 10 Mad Diagrama De Clases
Sergio Sanchez
 
Neural Networks with Google TensorFlow
Neural Networks with Google TensorFlowNeural Networks with Google TensorFlow
Neural Networks with Google TensorFlow
Darshan Patel
 
Modelos de Base de Datos
Modelos de Base de DatosModelos de Base de Datos
Modelos de Base de Datos
Axel Mérida
 
Yahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user groupYahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user group
Hadoop User Group
 
Ad

Similar to Intro to Mahout -- DC Hadoop (20)

Mahout and Distributed Machine Learning 101
Mahout and Distributed Machine Learning 101Mahout and Distributed Machine Learning 101
Mahout and Distributed Machine Learning 101
John Ternent
 
OSCON: Apache Mahout - Mammoth Scale Machine Learning
OSCON: Apache Mahout - Mammoth Scale Machine LearningOSCON: Apache Mahout - Mammoth Scale Machine Learning
OSCON: Apache Mahout - Mammoth Scale Machine Learning
Robin Anil
 
Seattle Scalability Mahout
Seattle Scalability MahoutSeattle Scalability Mahout
Seattle Scalability Mahout
Jake Mannix
 
Data Science.pptx
Data Science.pptxData Science.pptx
Data Science.pptx
TrainerAnalogicx
 
BigData
BigDataBigData
BigData
Shankar R
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
Duncan Hull
 
DB-IR-ranking
DB-IR-rankingDB-IR-ranking
DB-IR-ranking
FELIX75
 
DB and IR Integration
DB and IR IntegrationDB and IR Integration
DB and IR Integration
Marco A Torres
 
Distributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark Meetup
Vijay Srinivas Agneeswaran, Ph.D
 
Bigdata
BigdataBigdata
Bigdata
Shankar R
 
Vital AI: Big Data Modeling
Vital AI: Big Data ModelingVital AI: Big Data Modeling
Vital AI: Big Data Modeling
Vital.AI
 
Machine Learning for (JVM) Developers
Machine Learning for (JVM) DevelopersMachine Learning for (JVM) Developers
Machine Learning for (JVM) Developers
Mateusz Dymczyk
 
AI Presentation 1
AI Presentation 1AI Presentation 1
AI Presentation 1
Mustafa Kuğu
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and Hadoop
Josh Patterson
 
Big Data Ecosystem
Big Data EcosystemBig Data Ecosystem
Big Data Ecosystem
Ivo Vachkov
 
Big data analytics 1
Big data analytics 1Big data analytics 1
Big data analytics 1
gauravsc36
 
Expressiveness, Simplicity and Users
Expressiveness, Simplicity and UsersExpressiveness, Simplicity and Users
Expressiveness, Simplicity and Users
greenwop
 
H2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User GroupH2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User Group
Sri Ambati
 
Natural Language Processing & Semantic Models in an Imperfect World
Natural Language Processing & Semantic Modelsin an Imperfect WorldNatural Language Processing & Semantic Modelsin an Imperfect World
Natural Language Processing & Semantic Models in an Imperfect World
Vital.AI
 
Python in big data world
Python in big data worldPython in big data world
Python in big data world
Rohit
 
Mahout and Distributed Machine Learning 101
Mahout and Distributed Machine Learning 101Mahout and Distributed Machine Learning 101
Mahout and Distributed Machine Learning 101
John Ternent
 
OSCON: Apache Mahout - Mammoth Scale Machine Learning
OSCON: Apache Mahout - Mammoth Scale Machine LearningOSCON: Apache Mahout - Mammoth Scale Machine Learning
OSCON: Apache Mahout - Mammoth Scale Machine Learning
Robin Anil
 
Seattle Scalability Mahout
Seattle Scalability MahoutSeattle Scalability Mahout
Seattle Scalability Mahout
Jake Mannix
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
Duncan Hull
 
DB-IR-ranking
DB-IR-rankingDB-IR-ranking
DB-IR-ranking
FELIX75
 
Vital AI: Big Data Modeling
Vital AI: Big Data ModelingVital AI: Big Data Modeling
Vital AI: Big Data Modeling
Vital.AI
 
Machine Learning for (JVM) Developers
Machine Learning for (JVM) DevelopersMachine Learning for (JVM) Developers
Machine Learning for (JVM) Developers
Mateusz Dymczyk
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and Hadoop
Josh Patterson
 
Big Data Ecosystem
Big Data EcosystemBig Data Ecosystem
Big Data Ecosystem
Ivo Vachkov
 
Big data analytics 1
Big data analytics 1Big data analytics 1
Big data analytics 1
gauravsc36
 
Expressiveness, Simplicity and Users
Expressiveness, Simplicity and UsersExpressiveness, Simplicity and Users
Expressiveness, Simplicity and Users
greenwop
 
H2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User GroupH2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User Group
Sri Ambati
 
Natural Language Processing & Semantic Models in an Imperfect World
Natural Language Processing & Semantic Modelsin an Imperfect WorldNatural Language Processing & Semantic Modelsin an Imperfect World
Natural Language Processing & Semantic Models in an Imperfect World
Vital.AI
 
Python in big data world
Python in big data worldPython in big data world
Python in big data world
Rohit
 
Ad

More from Grant Ingersoll (20)

Solr for Data Science
Solr for Data ScienceSolr for Data Science
Solr for Data Science
Grant Ingersoll
 
This Ain't Your Parent's Search Engine
This Ain't Your Parent's Search EngineThis Ain't Your Parent's Search Engine
This Ain't Your Parent's Search Engine
Grant Ingersoll
 
Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4
Grant Ingersoll
 
Intro to Search
Intro to SearchIntro to Search
Intro to Search
Grant Ingersoll
 
Open Source Search FTW
Open Source Search FTWOpen Source Search FTW
Open Source Search FTW
Grant Ingersoll
 
Crowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and HadoopCrowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and Hadoop
Grant Ingersoll
 
What's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.xWhat's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.x
Grant Ingersoll
 
Taming Text
Taming TextTaming Text
Taming Text
Grant Ingersoll
 
Leveraging Solr and Mahout
Leveraging Solr and MahoutLeveraging Solr and Mahout
Leveraging Solr and Mahout
Grant Ingersoll
 
Scalable Machine Learning with Hadoop
Scalable Machine Learning with HadoopScalable Machine Learning with Hadoop
Scalable Machine Learning with Hadoop
Grant Ingersoll
 
Large Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in ActionLarge Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in Action
Grant Ingersoll
 
Apache Lucene 4
Apache Lucene 4Apache Lucene 4
Apache Lucene 4
Grant Ingersoll
 
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene EcosystemOpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene Ecosystem
Grant Ingersoll
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Grant Ingersoll
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Grant Ingersoll
 
Bet you didn't know Lucene can...
Bet you didn't know Lucene can...Bet you didn't know Lucene can...
Bet you didn't know Lucene can...
Grant Ingersoll
 
Starfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsStarfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data Analytics
Grant Ingersoll
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and Solr
Grant Ingersoll
 
Intelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and FriendsIntelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and Friends
Grant Ingersoll
 
TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr HadoopTriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr Hadoop
Grant Ingersoll
 
This Ain't Your Parent's Search Engine
This Ain't Your Parent's Search EngineThis Ain't Your Parent's Search Engine
This Ain't Your Parent's Search Engine
Grant Ingersoll
 
Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4
Grant Ingersoll
 
Crowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and HadoopCrowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and Hadoop
Grant Ingersoll
 
What's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.xWhat's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.x
Grant Ingersoll
 
Leveraging Solr and Mahout
Leveraging Solr and MahoutLeveraging Solr and Mahout
Leveraging Solr and Mahout
Grant Ingersoll
 
Scalable Machine Learning with Hadoop
Scalable Machine Learning with HadoopScalable Machine Learning with Hadoop
Scalable Machine Learning with Hadoop
Grant Ingersoll
 
Large Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in ActionLarge Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in Action
Grant Ingersoll
 
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene EcosystemOpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene Ecosystem
Grant Ingersoll
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Grant Ingersoll
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Grant Ingersoll
 
Bet you didn't know Lucene can...
Bet you didn't know Lucene can...Bet you didn't know Lucene can...
Bet you didn't know Lucene can...
Grant Ingersoll
 
Starfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsStarfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data Analytics
Grant Ingersoll
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and Solr
Grant Ingersoll
 
Intelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and FriendsIntelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and Friends
Grant Ingersoll
 
TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr HadoopTriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr Hadoop
Grant Ingersoll
 

Recently uploaded (20)

IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 

Intro to Mahout -- DC Hadoop

  • 1. Intro to Apache MahoutGrant IngersollLucid Imaginationhttp://www.lucidimagination.com
  • 2. Anyone Here Use Machine Learning?Any users of:Google?Search?Priority Inbox?Facebook?Twitter?LinkedIn?
  • 3. TopicsBackground and Use CasesWhat can you do in Mahout?Where’s the community at?ResourcesK-Means in Hadoop (time permitting)
  • 4. Definition“Machine Learning is programming computers to optimize a performance criterion using example data or past experience”Intro. To Machine Learning by E. AlpaydinSubset of Artificial IntelligenceLots of related fields:Information RetrievalStatsBiologyLinear algebraMany more
  • 5. Common Use CasesRecommend friends/dates/productsClassify content into predefined groupsFind similar contentFind associations/patterns in actions/behaviorsIdentify key topics/summarize textDocuments and CorporaDetect anomalies/fraudRanking search resultsOthers?
  • 6. Apache MahoutAn Apache Software Foundation project to create scalable machine learning libraries under the Apache Software Licensehttp://mahout.apache.orgWhy Mahout?Many Open Source ML libraries either:Lack CommunityLack Documentation and ExamplesLack ScalabilityLack the Apache LicenseOr are research-orientedDefinition:http://dictionary.reference.com/browse/mahout
  • 7. What does scalable mean to us?Goal: Be as fast and efficient as possible given the intrinsic design of the algorithmSome algorithms won’t scale to massive machine clustersOthers fit logically on a Map Reduce framework like Apache HadoopStill others will need different distributed programming modelsOthers are already fast (SGD)Be pragmatic
  • 8. Sampling of Who uses Mahout?https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout
  • 9. What Can I do with Mahout Right Now?3C + FPM + O = Mahout
  • 10. Collaborative FilteringExtensive framework for collaborative filtering (recommenders)RecommendersUser basedItem basedOnline and Offline supportOffline can utilize HadoopMany different Similarity measuresCosine, LLR, Tanimoto, Pearson, others
  • 11. ClusteringDocument levelGroup documents based on a notion of similarityK-Means, Fuzzy K-Means, Dirichlet, Canopy, Mean-Shift, EigenCuts (Spectral)All Map/ReduceDistance MeasuresManhattan, Euclidean, otherTopic Modeling Cluster words across documents to identify topicsLatent Dirichlet Allocation (M/R)
  • 12. CategorizationPlace new items into predefined categories:Sports, politics, entertainmentRecommendersImplementationsNaïve Bayes (M/R)Compl. Naïve Bayes (M/R)Decision Forests (M/R)Linear Regression (Seq. but Fast!)See Chapter 17 of Mahout in Action for Shop It To Me use case:
  • 13. http://awe.sm/5FyNeFreq. Pattern MiningIdentify frequently co-occurrent itemsUseful for:Query RecommendationsApple -> iPhone, orange, OS XRelated product placementBasket AnalysisMap/Reducehttp://www.amazon.com
  • 14. OtherPrimitive Collections!Collocations (M/R)Math libraryVectors, Matrices, etc.Noise Reduction via Singular Value Decomp (M/R)
  • 15. Prepare Data from Raw contentData Sources:Lucene integrationbin/mahout lucene.vector…Document Vectorizerbin/mahout seqdirectory …bin/mahout seq2sparse …ProgrammaticallySee the Utils module in Mahout and the Iterator<Vector> classesDatabaseFile system
  • 16. How to: Command LineMost algorithms have a Driver program$MAHOUT_HOME/bin/mahout.shhelps with most tasksPrepare the DataDifferent algorithms require different setupRun the algorithmSingle NodeHadoopPrint out the results or incorporate into applicationSeveral helper classes: LDAPrintTopics, ClusterDumper, etc.
  • 17. What’s Happening Now?Unified Framework for Clustering and Classification0.5 release on the horizon (May?)Working towards 1.0 release by focusing on:Tests, examples, documentationAPI cleanup and consistencyGearing up for Google Summer of CodeNew M/R work for Hidden Markov Models
  • 18. SummaryMachine learning is all over the web todayMahout is about scalable machine learningMahout has functionality for many of today’s common machine learning tasksMany Mahout implementations use Hadoop
  • 20. Resources“Mahout in Action” Owen, Anil, Dunning and Friedmanhttp://awe.sm/5FyNe“Introducing Apache Mahout” http://www.ibm.com/developerworks/java/library/j-mahout/“Taming Text” by Ingersoll, Morton, Farris“Programming Collective Intelligence” by Toby Segaran“Data Mining - Practical Machine Learning Tools and Techniques” by Ian H. Witten and Eibe Frank“Data-Intensive Text Processing with MapReduce” by Jimmy Lin and Chris Dyer
  • 22. K-Means in Map-ReduceInput:Mahout Vectors representing the original contentEither:A predefined set of initial centroids (Can be from Canopy)--k – The number of clusters to produceIterateDo the centroid calculation (more in a moment)Clustering Step (optional)OutputCentroids (as Mahout Vectors)Points for each Centroid (if Clustering Step was taken)
  • 23. Map-Reduce IterationEach Iteration calculates the Centroids using:KMeansMapperKMeansCombinerKMeansReducerClustering StepCalculate the points for each Centroid using:KMeansClusterMapper
  • 24. KMeansMapperDuring Setup:Load the initial Centroids (or the Centroids from the last iteration)Map PhaseFor each inputCalculate it’s distance from each Centroid and output the closest oneDistance Measures are pluggableManhattan, Euclidean, Squared Euclidean, Cosine, others
  • 25. KMeansReducerSetup:Load up clustersConvergence informationPartial sums from KMeansCombiner (more in a moment)Reduce PhaseSum all the vectors in the cluster to produce a new CentroidCheck for ConvergenceOutput cluster
  • 26. KMeansCombinerJust like KMeansReducer, but only produces partial sum of the cluster based on the data local to the Mapper
  • 27. KMeansClusterMapperSome applications only care about what the Centroids are, so this step is optionalSetup:Load up the clusters and the DistanceMeasure usedMap PhaseCalculate which Cluster the point belongs toOutput <ClusterId, Vector>

Editor's Notes

  • #10: 3C: The three C’s: clustering, classification and collaborative filteringFPM: Frequent patternset miningO: Other (math, collections, etc.)
  • #26: Convergence just checks to see how far the centroid has moved from the previous centroid