SlideShare a Scribd company logo
http://purdygoodengineering.com http://anant.us
Accumulo and Spark
With MLLib and GraphX
http://purdygoodengineering.com http://anant.us
Introduction
● Section 1: Understanding the Technology
○ Big Picture
○ Accumulo
○ Spark
○ Example Code
● Section 2: Use Cases
○ Multi-Tenant Data Processing
○ Machine Learning / Graph Processing in Spark
○ Example ML + Graph on Business Data
● Questions and Answers
● Contact Information
http://purdygoodengineering.com http://anant.us
● Section 1: Understanding the Technology
○ Big Picture
■ Why Accumulo
■ Why Spark
○ Accumulo
■ Key/Value Structure
■ Table Structure
■ Cell Level Security
■ Splits
■ Reads (scans)
■ Writes (upserts)
■ Deletes
○ Spark
■ Batch/Streaming
■ Machine Learning
■ Graph Processing
○ Example Code
■ Writing to Accumulo
■ Reading from Accumulo
■ Shell
Section 1: Understanding the Technology
http://purdygoodengineering.com http://anant.us
Section 1: Big Picture
● Accumulo
○ Scalable, sorted, distributed key/value store with cell level security
● Spark
○ General compute engine for large-scale data processing
■ Batch Processing
■ Streaming
■ Machine Learning Library
■ Graph Processing
● Use Spark for Compute and Accumulo for storage for a security distributed
scalable solution
http://purdygoodengineering.com http://anant.us
● Section 1: Understanding the Technology
○ Big Picture
■ Why Accumulo
■ Why Spark
○ Accumulo
■ Key/Value Structure
■ Table Structure
■ Cell Level Security
■ Splits
■ Reads (scans)
■ Writes (upserts)
■ Deletes
○ Spark
■ Batch/Streaming
■ Machine Learning
■ Graph Processing
○ Example Code
■ Writing to Accumulo
■ Reading from Accumulo
■ Shell
Section 1: Understanding the Technology
http://purdygoodengineering.com http://anant.us
Section 1: Accumulo: Key Structure
(image from accumulo.apache.org)
http://purdygoodengineering.com http://anant.us
Section 1: Accumulo: Key Structure
Accumulo
Table
Design
RDBM
Table
Design
http://purdygoodengineering.com http://anant.us
Section 1: Accumulo: Table Structure
● Each table has many tablets (distributed across nodes)
● Tablet servers are replicated (default is 3)
● Each row resides on the same tablets
○ A Row Id design strategy needs to ensure binning is
evenly distributed
○ Each table has “splits” which determine binning
○ If Row Ids are still too large; a sharding strategy is
required
http://purdygoodengineering.com http://anant.us
Section 1: Accumulo: Cell Level Security
● Each cell (or field) has its own access control determined
by visibility
● Each user has authorizations which correspond to
visibilities
● Only fields with visibilities which a user has authorization
to access can be retrieved by that user
● Visibilities have limited logic such as AND and OR
○ e.g. private | system public & dna_partner
http://purdygoodengineering.com http://anant.us
Section 1: Splits
● Each table has a default split
● Splits can be added to tables
● Accumulo auto splits when tablets get to large
● Table splits and tablet max size can is configurable
● Row ids are generally hashed to support distribution
● Example splits based on hashing
○ 0,1,2,3,4,5,6,7,8,9,a,b,c,d,e,f
http://purdygoodengineering.com http://anant.us
Section 1: Accumulo Reads
● Reads (are scans)
○ Scanner
○ BatchScanner (parallelizes over ranges)
● MapReduce/Spark
○ AccumuloInputFormat (one field at a time)
○ AccumuloRowInputFormat (one row at a time)
http://purdygoodengineering.com http://anant.us
Section 1: Accumulo: Writes
● Writes
○ Writer
○ BatchWriter (parallelizes over tablets)
● MapReduce/Spark
○ AccumuloOutputFormat
○ AccumuloFileOutputFormat (bulk ingest)
● Both use Mutations to write to accumulo
http://purdygoodengineering.com http://anant.us
Section 1: Accumulo: Mutations (write and delete)
● Mutations are used to write and delete
● Mutation.put (to write)
● Mutation.putDelete (to delete)
● Writes are Upserts (insert or updates)
http://purdygoodengineering.com http://anant.us
Section 1: Accumulo
● accumulo.apache.org
● Download accumulo
● Examples
● Documentation
Concerned about scalling; how about 4T Nodes, 70T edges
in a graph => see link
http://www.pdl.cmu.edu/SDI/2013/slides/big_graph_nsa_rd_2
013_56002v1.pdf
http://purdygoodengineering.com http://anant.us
● Section 1: Understanding the Technology
○ Big Picture
■ Why Accumulo
■ Why Spark
○ Accumulo
■ Key/Value Structure
■ Table Structure
■ Cell Level Security
■ Splits
■ Reads (scans)
■ Writes (upserts)
■ Deletes
○ Spark
■ Batch/Streaming
■ Machine Learning
■ Graph Processing
○ Example Code
■ Writing to Accumulo
■ Reading from Accumulo
■ Shell
Section 1: Understanding the Technology
http://purdygoodengineering.com http://anant.us
Section 1: Spark: MapReduce first
● Hadoop MapReduce (batch processing)
○ Mapping
○ Reducing
○ Chain jobs
○ 95% IO (each job must read/write to disk)
○ scalable
http://purdygoodengineering.com http://anant.us
Section 1: Spark
● Batch Processing - MapReduce (many more functions)
● Streaming - mini batch processing
● Machine Learning - MLLib
● Graph Processing - GraphX
● Many Languages - (Java, Scala, Python, R)
http://purdygoodengineering.com http://anant.us
Section 1: Spark
● spark.apache.org
● Download spark
● Example code
● Documentation
http://purdygoodengineering.com http://anant.us
○ Spark
■ Batch/Streaming
■ Machine Learning
■ Graph Processing
○ Example Code
■ Writing to Accumulo
■ Reading from Accumulo
■ Shell
Section 1: Understanding the Technology
● Section 1: Understanding the Technology
○ Big Picture
■ Why Accumulo
■ Why Spark
○ Accumulo
■ Key/Value Structure
■ Table Structure
■ Cell Level Security
■ Splits
■ Reads (scans)
■ Writes (upserts)
■ Deletes
http://purdygoodengineering.com http://anant.us
Section 1: Example Code
Simple Examples for bookkeeping with spark and accumulo
https://github.com/matthewpurdy/purdy-good/tree/master/purdy-good-spark/purdy-good-spark-accumulo
http://purdygoodengineering.com http://anant.us
Section 2: Use Case(s) Machine Learning and
Graph Processing
● Multi-Tenant Data Processing
● Machine Learning / Graph Processing in Spark
● Example Usecase of ML + Graph on Business Data
http://purdygoodengineering.com http://anant.us
Section 2: Multi-Tenant Data Processing Needs
Customer (C) (P) & (C) Provider (P)
Team Customer Private Customer Data
shared w/ Provider
Private Provider Data
for Economy of Scale
Sales
Marketing
IBM Indicators
Relationships
Classification
Classification Model
Relationship Graph
Marketing
Finance
Apple Indicators
Correlation
Prediction
Correlation Model
Prediction Model
Sales
Marketing
Finance
Microsoft Indicators
Relationships
Correlation
Prediction
Correlation Model
Prediction Model
Relationship Graph
Finance Google Indicators
Correlation
Prediction
Correlation Model
Prediction Model
http://purdygoodengineering.com http://anant.us
Section 2: Multi-Tenant Data Processing Needs
Customer (C) (P) & (C) Provider (P)
C User C Team C Management C Management
P Analytics
P Analytics
P Support
CU Manager
CU Employee
CT Sales CM Executive CM Executive
CU Manager
PA * / PS *
PA * / PS *
CU Manager
CU Employee
CT Marketing CM Executive CM Executive
CU Manager
PA * / PS *
PA * / PS *
CU Employee CT Research CM Executive CM Executive
CU Manager
PA * / PS *
PA * / PS *
CU Employee CT Finance CM Executive CM Executive
CU Manager
PA * / PS *
PA * / PS *
http://purdygoodengineering.com http://anant.us
Section 2: Multi-Tenant Data Processing Needs
● Analyze Sales Team successes (Closed Accounts) to recommend companies
to target for Marketing campaigns.
● Analyze Sales Team User social account against social network users against
recommended companies to create Call Lists
● Correlate historic Marketing (Traffic & Conversions) with historic Sales (Leads
& Closed Accounts) data with historic Finance ( Revenue & Profit) to Predict
Sales from current Marketing & Sales activities
http://purdygoodengineering.com http://anant.us
Section 2: Out of the Box : MLLib in Spark
● Classification
● Regression
● Decision Trees
● Recommendation
● Clustering
● Topic Modeling
● Feature Transformations
● ML Pipelining / Persistence
● “Based on past
performance in the
companies in the CRM,
the most successful sales
have come from these
categories, so go after
these companies.”
http://purdygoodengineering.com http://anant.us
Section 2: Out of the Box : MLLib in Spark
● Load Data
● Extract Features
● Train Model
● Find Best Model
● Use Model to Predict
http://purdygoodengineering.com http://anant.us
Section 2: KeystoneML - End to End ML
http://keystone-ml.org/
http://purdygoodengineering.com http://anant.us
Section 2: Out of the Box : GraphX in Spark
● PageRank
● Connected components
● Label propagation
● SVD++
● Strongly connected components
● Triangle count
● “Based on the social graph
of sales team members
and the companies in your
CRM, talk to the
companies you are most
“closest” to.
http://purdygoodengineering.com http://anant.us
Section 2: Out of the Box : GraphX in Spark
● Load Nodes RDD
● Load Vertices RDD
● Create Graph from
Nodes & Vertices RDD
● Run Graph Process /
Query
● Get Data
http://ampcamp.berkeley.edu/big-d
ata-mini-course/graph-analytics-wit
h-graphx.html
http://purdygoodengineering.com http://anant.us
Section 2: Out of the Box : GraphX in Spark
● Load Edges into Graph
● Run Page Rank
● Load Nodes into RDD
● Join Users RDD with
Rank
http://purdygoodengineering.com http://anant.us
Questions and Answers
?
http://purdygoodengineering.com http://anant.us
Contact Information
Matthew Purdy
● matthew.purdy@purdygoodengineering.com
● http://www.purdygoodengineering.com
● https://www.linkedin.com/in/matthewpurdy
● https://github.com/matthewpurdy
Rahul Singh
● rahul.singh@anant.us
● http://www.anant.us
● http://www.linkedin.com/in/xingh
● https://github.com/xingh
Ad

More Related Content

Viewers also liked (14)

HBase and Accumulo | Washington DC Hadoop User Group
HBase and Accumulo | Washington DC Hadoop User GroupHBase and Accumulo | Washington DC Hadoop User Group
HBase and Accumulo | Washington DC Hadoop User Group
Cloudera, Inc.
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
lucenerevolution
 
BioSense 2.0
BioSense 2.0BioSense 2.0
BioSense 2.0
Taha Kass-Hout, MD, MS
 
Social Media for the Meta-Leader
Social Media for the Meta-LeaderSocial Media for the Meta-Leader
Social Media for the Meta-Leader
Taha Kass-Hout, MD, MS
 
BioSense Program Going Forward: HIMSS10 Conference
BioSense Program Going Forward: HIMSS10 ConferenceBioSense Program Going Forward: HIMSS10 Conference
BioSense Program Going Forward: HIMSS10 Conference
Taha Kass-Hout, MD, MS
 
Evolve: InSTEDD's Global Early Warning and Response System
Evolve: InSTEDD's Global Early Warning and Response SystemEvolve: InSTEDD's Global Early Warning and Response System
Evolve: InSTEDD's Global Early Warning and Response System
Taha Kass-Hout, MD, MS
 
Public Health Surveillance Through Collaboration
Public Health Surveillance Through CollaborationPublic Health Surveillance Through Collaboration
Public Health Surveillance Through Collaboration
Taha Kass-Hout, MD, MS
 
Big Data in Public Health
Big Data in Public HealthBig Data in Public Health
Big Data in Public Health
Taha Kass-Hout, MD, MS
 
precisionFDA
precisionFDAprecisionFDA
precisionFDA
Taha Kass-Hout, MD, MS
 
Geohash: Integration of Disparate Geospatial Data
Geohash: Integration of Disparate Geospatial DataGeohash: Integration of Disparate Geospatial Data
Geohash: Integration of Disparate Geospatial Data
DataCards
 
Latest Advances in Megapixel Surveillance
Latest Advances in Megapixel SurveillanceLatest Advances in Megapixel Surveillance
Latest Advances in Megapixel Surveillance
Steve Ma
 
GeoSpatially enabling your Spark and Accumulo clusters with LocationTech
GeoSpatially enabling your Spark and Accumulo clusters with LocationTechGeoSpatially enabling your Spark and Accumulo clusters with LocationTech
GeoSpatially enabling your Spark and Accumulo clusters with LocationTech
Rob Emanuele
 
Matchinguu droidcon presentation
Matchinguu droidcon presentationMatchinguu droidcon presentation
Matchinguu droidcon presentation
Droidcon Berlin
 
Riff: A Social Network and Collaborative Platform for Public Health Disease S...
Riff: A Social Network and Collaborative Platform for Public Health Disease S...Riff: A Social Network and Collaborative Platform for Public Health Disease S...
Riff: A Social Network and Collaborative Platform for Public Health Disease S...
Taha Kass-Hout, MD, MS
 
HBase and Accumulo | Washington DC Hadoop User Group
HBase and Accumulo | Washington DC Hadoop User GroupHBase and Accumulo | Washington DC Hadoop User Group
HBase and Accumulo | Washington DC Hadoop User Group
Cloudera, Inc.
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
lucenerevolution
 
BioSense Program Going Forward: HIMSS10 Conference
BioSense Program Going Forward: HIMSS10 ConferenceBioSense Program Going Forward: HIMSS10 Conference
BioSense Program Going Forward: HIMSS10 Conference
Taha Kass-Hout, MD, MS
 
Evolve: InSTEDD's Global Early Warning and Response System
Evolve: InSTEDD's Global Early Warning and Response SystemEvolve: InSTEDD's Global Early Warning and Response System
Evolve: InSTEDD's Global Early Warning and Response System
Taha Kass-Hout, MD, MS
 
Public Health Surveillance Through Collaboration
Public Health Surveillance Through CollaborationPublic Health Surveillance Through Collaboration
Public Health Surveillance Through Collaboration
Taha Kass-Hout, MD, MS
 
Geohash: Integration of Disparate Geospatial Data
Geohash: Integration of Disparate Geospatial DataGeohash: Integration of Disparate Geospatial Data
Geohash: Integration of Disparate Geospatial Data
DataCards
 
Latest Advances in Megapixel Surveillance
Latest Advances in Megapixel SurveillanceLatest Advances in Megapixel Surveillance
Latest Advances in Megapixel Surveillance
Steve Ma
 
GeoSpatially enabling your Spark and Accumulo clusters with LocationTech
GeoSpatially enabling your Spark and Accumulo clusters with LocationTechGeoSpatially enabling your Spark and Accumulo clusters with LocationTech
GeoSpatially enabling your Spark and Accumulo clusters with LocationTech
Rob Emanuele
 
Matchinguu droidcon presentation
Matchinguu droidcon presentationMatchinguu droidcon presentation
Matchinguu droidcon presentation
Droidcon Berlin
 
Riff: A Social Network and Collaborative Platform for Public Health Disease S...
Riff: A Social Network and Collaborative Platform for Public Health Disease S...Riff: A Social Network and Collaborative Platform for Public Health Disease S...
Riff: A Social Network and Collaborative Platform for Public Health Disease S...
Taha Kass-Hout, MD, MS
 

Similar to Machine Learning & Graph Processing w/ Spark and Accumulo (20)

Databricks: What We Have Learned by Eating Our Dog Food
Databricks: What We Have Learned by Eating Our Dog FoodDatabricks: What We Have Learned by Eating Our Dog Food
Databricks: What We Have Learned by Eating Our Dog Food
Databricks
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Databricks
 
[scala.by] Launching new application fast
[scala.by] Launching new application fast[scala.by] Launching new application fast
[scala.by] Launching new application fast
Denis Karpenko
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
 
OpenCL Kernel Optimization Tips
OpenCL Kernel Optimization TipsOpenCL Kernel Optimization Tips
OpenCL Kernel Optimization Tips
Champ Yen
 
Optimizing spark based data pipelines - are you up for it?
Optimizing spark based data pipelines - are you up for it?Optimizing spark based data pipelines - are you up for it?
Optimizing spark based data pipelines - are you up for it?
Etti Gur
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
inoshg
 
How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018
How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018
How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018
javier ramirez
 
Benchmarks, performance, scalability, and capacity what's behind the numbers
Benchmarks, performance, scalability, and capacity what's behind the numbersBenchmarks, performance, scalability, and capacity what's behind the numbers
Benchmarks, performance, scalability, and capacity what's behind the numbers
Justin Dorfman
 
Benchmarks, performance, scalability, and capacity what s behind the numbers...
Benchmarks, performance, scalability, and capacity  what s behind the numbers...Benchmarks, performance, scalability, and capacity  what s behind the numbers...
Benchmarks, performance, scalability, and capacity what s behind the numbers...
james tong
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
Omid Vahdaty
 
Dfrws eu 2014 rekall workshop
Dfrws eu 2014 rekall workshopDfrws eu 2014 rekall workshop
Dfrws eu 2014 rekall workshop
Tamas K Lengyel
 
Anurag Awasthi - Machine Learning applications for CloudStack
Anurag Awasthi - Machine Learning applications for CloudStackAnurag Awasthi - Machine Learning applications for CloudStack
Anurag Awasthi - Machine Learning applications for CloudStack
ShapeBlue
 
Boosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of TechniquesBoosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of Techniques
Ahsan Javed Awan
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
zmhassan
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Omid Vahdaty
 
Boosting big data with apache spark
Boosting big data with apache sparkBoosting big data with apache spark
Boosting big data with apache spark
InfoFarm
 
Software Design Practices for Large-Scale Automation
Software Design Practices for Large-Scale AutomationSoftware Design Practices for Large-Scale Automation
Software Design Practices for Large-Scale Automation
Hao Xu
 
Apache Spark MLlib - Random Foreset and Desicion Trees
Apache Spark MLlib - Random Foreset and Desicion TreesApache Spark MLlib - Random Foreset and Desicion Trees
Apache Spark MLlib - Random Foreset and Desicion Trees
Tuhin Mahmud
 
Data Analysis with TensorFlow in PostgreSQL
Data Analysis with TensorFlow in PostgreSQLData Analysis with TensorFlow in PostgreSQL
Data Analysis with TensorFlow in PostgreSQL
EDB
 
Databricks: What We Have Learned by Eating Our Dog Food
Databricks: What We Have Learned by Eating Our Dog FoodDatabricks: What We Have Learned by Eating Our Dog Food
Databricks: What We Have Learned by Eating Our Dog Food
Databricks
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Databricks
 
[scala.by] Launching new application fast
[scala.by] Launching new application fast[scala.by] Launching new application fast
[scala.by] Launching new application fast
Denis Karpenko
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
 
OpenCL Kernel Optimization Tips
OpenCL Kernel Optimization TipsOpenCL Kernel Optimization Tips
OpenCL Kernel Optimization Tips
Champ Yen
 
Optimizing spark based data pipelines - are you up for it?
Optimizing spark based data pipelines - are you up for it?Optimizing spark based data pipelines - are you up for it?
Optimizing spark based data pipelines - are you up for it?
Etti Gur
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
inoshg
 
How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018
How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018
How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018
javier ramirez
 
Benchmarks, performance, scalability, and capacity what's behind the numbers
Benchmarks, performance, scalability, and capacity what's behind the numbersBenchmarks, performance, scalability, and capacity what's behind the numbers
Benchmarks, performance, scalability, and capacity what's behind the numbers
Justin Dorfman
 
Benchmarks, performance, scalability, and capacity what s behind the numbers...
Benchmarks, performance, scalability, and capacity  what s behind the numbers...Benchmarks, performance, scalability, and capacity  what s behind the numbers...
Benchmarks, performance, scalability, and capacity what s behind the numbers...
james tong
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
Omid Vahdaty
 
Dfrws eu 2014 rekall workshop
Dfrws eu 2014 rekall workshopDfrws eu 2014 rekall workshop
Dfrws eu 2014 rekall workshop
Tamas K Lengyel
 
Anurag Awasthi - Machine Learning applications for CloudStack
Anurag Awasthi - Machine Learning applications for CloudStackAnurag Awasthi - Machine Learning applications for CloudStack
Anurag Awasthi - Machine Learning applications for CloudStack
ShapeBlue
 
Boosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of TechniquesBoosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of Techniques
Ahsan Javed Awan
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
zmhassan
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Omid Vahdaty
 
Boosting big data with apache spark
Boosting big data with apache sparkBoosting big data with apache spark
Boosting big data with apache spark
InfoFarm
 
Software Design Practices for Large-Scale Automation
Software Design Practices for Large-Scale AutomationSoftware Design Practices for Large-Scale Automation
Software Design Practices for Large-Scale Automation
Hao Xu
 
Apache Spark MLlib - Random Foreset and Desicion Trees
Apache Spark MLlib - Random Foreset and Desicion TreesApache Spark MLlib - Random Foreset and Desicion Trees
Apache Spark MLlib - Random Foreset and Desicion Trees
Tuhin Mahmud
 
Data Analysis with TensorFlow in PostgreSQL
Data Analysis with TensorFlow in PostgreSQLData Analysis with TensorFlow in PostgreSQL
Data Analysis with TensorFlow in PostgreSQL
EDB
 
Ad

More from Rahul Singh (15)

Unifying Business Information with Dashboards
Unifying Business Information with Dashboards Unifying Business Information with Dashboards
Unifying Business Information with Dashboards
Rahul Singh
 
Get Your Shit Together
Get Your Shit TogetherGet Your Shit Together
Get Your Shit Together
Rahul Singh
 
Building Online Business Software 101 (B2B)
Building Online Business Software 101 (B2B) Building Online Business Software 101 (B2B)
Building Online Business Software 101 (B2B)
Rahul Singh
 
Asynchronous Data Processing
Asynchronous Data ProcessingAsynchronous Data Processing
Asynchronous Data Processing
Rahul Singh
 
Building Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source TechnologiesBuilding Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source Technologies
Rahul Singh
 
Deliver Excellent Service to your Customers
Deliver Excellent Service to your CustomersDeliver Excellent Service to your Customers
Deliver Excellent Service to your Customers
Rahul Singh
 
Building Search Engines - Lucene, SolR and Elasticsearch
Building Search Engines - Lucene, SolR and ElasticsearchBuilding Search Engines - Lucene, SolR and Elasticsearch
Building Search Engines - Lucene, SolR and Elasticsearch
Rahul Singh
 
Building Smart Indexes for Drupal Sites
Building Smart Indexes for Drupal SitesBuilding Smart Indexes for Drupal Sites
Building Smart Indexes for Drupal Sites
Rahul Singh
 
Building People First - Lessons in Team Effectiveness & Happiness
Building People First - Lessons in Team Effectiveness & HappinessBuilding People First - Lessons in Team Effectiveness & Happiness
Building People First - Lessons in Team Effectiveness & Happiness
Rahul Singh
 
Select * From Internet - Integrating the Web
Select * From Internet - Integrating the WebSelect * From Internet - Integrating the Web
Select * From Internet - Integrating the Web
Rahul Singh
 
Bill Drayton - Father of Social Entrepreneurship, Leading Leader of Social Ch...
Bill Drayton - Father of Social Entrepreneurship, Leading Leader of Social Ch...Bill Drayton - Father of Social Entrepreneurship, Leading Leader of Social Ch...
Bill Drayton - Father of Social Entrepreneurship, Leading Leader of Social Ch...
Rahul Singh
 
The Future of the Internet - The Next 30 Years
The Future of the Internet - The Next 30 YearsThe Future of the Internet - The Next 30 Years
The Future of the Internet - The Next 30 Years
Rahul Singh
 
Modern Presidential Communications - Communicating Presidential Rhetorical Vi...
Modern Presidential Communications - Communicating Presidential Rhetorical Vi...Modern Presidential Communications - Communicating Presidential Rhetorical Vi...
Modern Presidential Communications - Communicating Presidential Rhetorical Vi...
Rahul Singh
 
Rahul.singh.speech presentation
Rahul.singh.speech presentationRahul.singh.speech presentation
Rahul.singh.speech presentation
Rahul Singh
 
Anant - Micro Enterprise - The Future, Today
Anant - Micro Enterprise - The Future, TodayAnant - Micro Enterprise - The Future, Today
Anant - Micro Enterprise - The Future, Today
Rahul Singh
 
Unifying Business Information with Dashboards
Unifying Business Information with Dashboards Unifying Business Information with Dashboards
Unifying Business Information with Dashboards
Rahul Singh
 
Get Your Shit Together
Get Your Shit TogetherGet Your Shit Together
Get Your Shit Together
Rahul Singh
 
Building Online Business Software 101 (B2B)
Building Online Business Software 101 (B2B) Building Online Business Software 101 (B2B)
Building Online Business Software 101 (B2B)
Rahul Singh
 
Asynchronous Data Processing
Asynchronous Data ProcessingAsynchronous Data Processing
Asynchronous Data Processing
Rahul Singh
 
Building Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source TechnologiesBuilding Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source Technologies
Rahul Singh
 
Deliver Excellent Service to your Customers
Deliver Excellent Service to your CustomersDeliver Excellent Service to your Customers
Deliver Excellent Service to your Customers
Rahul Singh
 
Building Search Engines - Lucene, SolR and Elasticsearch
Building Search Engines - Lucene, SolR and ElasticsearchBuilding Search Engines - Lucene, SolR and Elasticsearch
Building Search Engines - Lucene, SolR and Elasticsearch
Rahul Singh
 
Building Smart Indexes for Drupal Sites
Building Smart Indexes for Drupal SitesBuilding Smart Indexes for Drupal Sites
Building Smart Indexes for Drupal Sites
Rahul Singh
 
Building People First - Lessons in Team Effectiveness & Happiness
Building People First - Lessons in Team Effectiveness & HappinessBuilding People First - Lessons in Team Effectiveness & Happiness
Building People First - Lessons in Team Effectiveness & Happiness
Rahul Singh
 
Select * From Internet - Integrating the Web
Select * From Internet - Integrating the WebSelect * From Internet - Integrating the Web
Select * From Internet - Integrating the Web
Rahul Singh
 
Bill Drayton - Father of Social Entrepreneurship, Leading Leader of Social Ch...
Bill Drayton - Father of Social Entrepreneurship, Leading Leader of Social Ch...Bill Drayton - Father of Social Entrepreneurship, Leading Leader of Social Ch...
Bill Drayton - Father of Social Entrepreneurship, Leading Leader of Social Ch...
Rahul Singh
 
The Future of the Internet - The Next 30 Years
The Future of the Internet - The Next 30 YearsThe Future of the Internet - The Next 30 Years
The Future of the Internet - The Next 30 Years
Rahul Singh
 
Modern Presidential Communications - Communicating Presidential Rhetorical Vi...
Modern Presidential Communications - Communicating Presidential Rhetorical Vi...Modern Presidential Communications - Communicating Presidential Rhetorical Vi...
Modern Presidential Communications - Communicating Presidential Rhetorical Vi...
Rahul Singh
 
Rahul.singh.speech presentation
Rahul.singh.speech presentationRahul.singh.speech presentation
Rahul.singh.speech presentation
Rahul Singh
 
Anant - Micro Enterprise - The Future, Today
Anant - Micro Enterprise - The Future, TodayAnant - Micro Enterprise - The Future, Today
Anant - Micro Enterprise - The Future, Today
Rahul Singh
 
Ad

Recently uploaded (20)

Process Mining at Rabobank - Organizational challenges
Process Mining at Rabobank - Organizational challengesProcess Mining at Rabobank - Organizational challenges
Process Mining at Rabobank - Organizational challenges
Process mining Evangelist
 
problem solving.presentation slideshow bsc nursing
problem solving.presentation slideshow bsc nursingproblem solving.presentation slideshow bsc nursing
problem solving.presentation slideshow bsc nursing
vishnudathas123
 
Collibra DQ Installation setup and debug
Collibra DQ Installation setup and debugCollibra DQ Installation setup and debug
Collibra DQ Installation setup and debug
karthikprince20
 
vMix Pro Crack + Serial Number Torrent free Download
vMix Pro Crack + Serial Number Torrent free DownloadvMix Pro Crack + Serial Number Torrent free Download
vMix Pro Crack + Serial Number Torrent free Download
eyeskye547
 
717239550-Hotel-Management-Ppt-Final.pptx
717239550-Hotel-Management-Ppt-Final.pptx717239550-Hotel-Management-Ppt-Final.pptx
717239550-Hotel-Management-Ppt-Final.pptx
dharmendrasingh31102
 
50_questions_full.pptxdddddddddddddddddd
50_questions_full.pptxdddddddddddddddddd50_questions_full.pptxdddddddddddddddddd
50_questions_full.pptxdddddddddddddddddd
emir73065
 
Modern_Distribution_Presentation.pptx Aa
Modern_Distribution_Presentation.pptx AaModern_Distribution_Presentation.pptx Aa
Modern_Distribution_Presentation.pptx Aa
MuhammadAwaisKamboh
 
Suncorp - Integrating Process Mining at Australia's Largest Insurer
Suncorp - Integrating Process Mining at Australia's Largest InsurerSuncorp - Integrating Process Mining at Australia's Largest Insurer
Suncorp - Integrating Process Mining at Australia's Largest Insurer
Process mining Evangelist
 
chapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.pptchapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.ppt
justinebandajbn
 
chapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptxchapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptx
justinebandajbn
 
FPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptxFPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptx
ssuser4ef83d
 
Process Mining and Data Science in the Financial Industry
Process Mining and Data Science in the Financial IndustryProcess Mining and Data Science in the Financial Industry
Process Mining and Data Science in the Financial Industry
Process mining Evangelist
 
Process Mining at AE - Key success factors
Process Mining at AE - Key success factorsProcess Mining at AE - Key success factors
Process Mining at AE - Key success factors
Process mining Evangelist
 
Process Mining at Dimension Data - Jan vermeulen
Process Mining at Dimension Data - Jan vermeulenProcess Mining at Dimension Data - Jan vermeulen
Process Mining at Dimension Data - Jan vermeulen
Process mining Evangelist
 
Adopting Process Mining at the Rabobank - use case
Adopting Process Mining at the Rabobank - use caseAdopting Process Mining at the Rabobank - use case
Adopting Process Mining at the Rabobank - use case
Process mining Evangelist
 
Lagos School of Programming Final Project Updated.pdf
Lagos School of Programming Final Project Updated.pdfLagos School of Programming Final Project Updated.pdf
Lagos School of Programming Final Project Updated.pdf
benuju2016
 
RAG Chatbot using AWS Bedrock and Streamlit Framework
RAG Chatbot using AWS Bedrock and Streamlit FrameworkRAG Chatbot using AWS Bedrock and Streamlit Framework
RAG Chatbot using AWS Bedrock and Streamlit Framework
apanneer
 
hersh's midterm project.pdf music retail and distribution
hersh's midterm project.pdf music retail and distributionhersh's midterm project.pdf music retail and distribution
hersh's midterm project.pdf music retail and distribution
hershtara1
 
Microsoft Excel: A Comprehensive Overview
Microsoft Excel: A Comprehensive OverviewMicrosoft Excel: A Comprehensive Overview
Microsoft Excel: A Comprehensive Overview
GinaTomarongRegencia
 
定制(意大利Rimini毕业证)布鲁诺马代尔纳嘉雷迪米音乐学院学历认证
定制(意大利Rimini毕业证)布鲁诺马代尔纳嘉雷迪米音乐学院学历认证定制(意大利Rimini毕业证)布鲁诺马代尔纳嘉雷迪米音乐学院学历认证
定制(意大利Rimini毕业证)布鲁诺马代尔纳嘉雷迪米音乐学院学历认证
Taqyea
 
Process Mining at Rabobank - Organizational challenges
Process Mining at Rabobank - Organizational challengesProcess Mining at Rabobank - Organizational challenges
Process Mining at Rabobank - Organizational challenges
Process mining Evangelist
 
problem solving.presentation slideshow bsc nursing
problem solving.presentation slideshow bsc nursingproblem solving.presentation slideshow bsc nursing
problem solving.presentation slideshow bsc nursing
vishnudathas123
 
Collibra DQ Installation setup and debug
Collibra DQ Installation setup and debugCollibra DQ Installation setup and debug
Collibra DQ Installation setup and debug
karthikprince20
 
vMix Pro Crack + Serial Number Torrent free Download
vMix Pro Crack + Serial Number Torrent free DownloadvMix Pro Crack + Serial Number Torrent free Download
vMix Pro Crack + Serial Number Torrent free Download
eyeskye547
 
717239550-Hotel-Management-Ppt-Final.pptx
717239550-Hotel-Management-Ppt-Final.pptx717239550-Hotel-Management-Ppt-Final.pptx
717239550-Hotel-Management-Ppt-Final.pptx
dharmendrasingh31102
 
50_questions_full.pptxdddddddddddddddddd
50_questions_full.pptxdddddddddddddddddd50_questions_full.pptxdddddddddddddddddd
50_questions_full.pptxdddddddddddddddddd
emir73065
 
Modern_Distribution_Presentation.pptx Aa
Modern_Distribution_Presentation.pptx AaModern_Distribution_Presentation.pptx Aa
Modern_Distribution_Presentation.pptx Aa
MuhammadAwaisKamboh
 
Suncorp - Integrating Process Mining at Australia's Largest Insurer
Suncorp - Integrating Process Mining at Australia's Largest InsurerSuncorp - Integrating Process Mining at Australia's Largest Insurer
Suncorp - Integrating Process Mining at Australia's Largest Insurer
Process mining Evangelist
 
chapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.pptchapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.ppt
justinebandajbn
 
chapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptxchapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptx
justinebandajbn
 
FPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptxFPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptx
ssuser4ef83d
 
Process Mining and Data Science in the Financial Industry
Process Mining and Data Science in the Financial IndustryProcess Mining and Data Science in the Financial Industry
Process Mining and Data Science in the Financial Industry
Process mining Evangelist
 
Process Mining at Dimension Data - Jan vermeulen
Process Mining at Dimension Data - Jan vermeulenProcess Mining at Dimension Data - Jan vermeulen
Process Mining at Dimension Data - Jan vermeulen
Process mining Evangelist
 
Adopting Process Mining at the Rabobank - use case
Adopting Process Mining at the Rabobank - use caseAdopting Process Mining at the Rabobank - use case
Adopting Process Mining at the Rabobank - use case
Process mining Evangelist
 
Lagos School of Programming Final Project Updated.pdf
Lagos School of Programming Final Project Updated.pdfLagos School of Programming Final Project Updated.pdf
Lagos School of Programming Final Project Updated.pdf
benuju2016
 
RAG Chatbot using AWS Bedrock and Streamlit Framework
RAG Chatbot using AWS Bedrock and Streamlit FrameworkRAG Chatbot using AWS Bedrock and Streamlit Framework
RAG Chatbot using AWS Bedrock and Streamlit Framework
apanneer
 
hersh's midterm project.pdf music retail and distribution
hersh's midterm project.pdf music retail and distributionhersh's midterm project.pdf music retail and distribution
hersh's midterm project.pdf music retail and distribution
hershtara1
 
Microsoft Excel: A Comprehensive Overview
Microsoft Excel: A Comprehensive OverviewMicrosoft Excel: A Comprehensive Overview
Microsoft Excel: A Comprehensive Overview
GinaTomarongRegencia
 
定制(意大利Rimini毕业证)布鲁诺马代尔纳嘉雷迪米音乐学院学历认证
定制(意大利Rimini毕业证)布鲁诺马代尔纳嘉雷迪米音乐学院学历认证定制(意大利Rimini毕业证)布鲁诺马代尔纳嘉雷迪米音乐学院学历认证
定制(意大利Rimini毕业证)布鲁诺马代尔纳嘉雷迪米音乐学院学历认证
Taqyea
 

Machine Learning & Graph Processing w/ Spark and Accumulo