SlideShare a Scribd company logo
Polyglot Persistence
Big Data in the Cloud


Andrei Savu / andrei.savu@cloudsoftcorp.com
Overview

• Introduction
• Databases
• Search
• Processing
• Deployment
Polyglot Persistence

“Polyglot Persistence, like polyglot
programming, is all about choosing the right
persistence option for the task at hand”

           http://www.nearinfinity.com/blogs/scott_leberknight/polyglot_persistence.html
                                   http://martinfowler.com/bliki/PolyglotPersistence.html
Polyglot Persistence & Big Data in the Cloud
It all started from ...
a set of papers released by Google & Amazon
• Google Filesystem (2003)
  http://research.google.com/archive/gfs.html



• Google MapReduce (2004)
  http://research.google.com/archive/mapreduce.html



• Google BigTable (2006)
  http://research.google.com/archive/bigtable.html



• Amazon Dynamo (2007)
  http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-
  sosp2007.pdf
Databases
Apache HBase
•   Java                     •   persistence through
                                 HDFS (Hadoop)
•   designed to be able to
    store massive amounts    •   Map/Reduce with
    of data                      Hadoop

•   speaks HTTP / REST,      •   designed for real time
    Thrift, Avro                 workloads

•   based on Google          •   https://hbase.apache.org/
    BigTable
Apache Cassandra
•   Java                  •   really fast writes

•   inspired by Google    •   excellent for a large
    BigTable and Amazon       number of high speed
    Dynamo                    counters

•   tunable trade-offs    •   Map/Reduce possible
                              with Hadoop
•   query by column and
                          •
    range of keys             http://cassandra.apache.org/
MongoDB
•   C++                          •   map/reduce with
                                     javascript
•   document database
    (bson) with rich indexing    •   server side javascript

•   master / slave replication   •   journaling

•   built-in sharding            •   fast in-place updates

•   auto failover with replica   •   http://www.mongodb.org/
    sets
Apache CouchDB
•   Erlang                       •   exposes a stream of
                                     realtime updates
•   document database
    (json)                       •   needs compacting

•   bi-directional replication   •   indexing via views (JS)

•   advanced conflict             •   attachment handling
    resolution
                                 •   https://couchdb.apache.org/

•   MVCC - writes do not
    block reads
Riak (Basho)
•   Erlang, C, Javascript      •   tunable trade-offs (N, R,
                                   W)
•   key, value store
                               •   mapreduce in JS or
•   focus on fault tolerance       Erlang
    and cross datacenter
    replication                •   full-text indexing with
                                   riak search
•   speaks HTTP/REST or
    custom binary              •   http://wiki.basho.com/
Neo4j
•   Java                      •   web admin interface

•   graph database            •   nodes & relationships
                                  can have metadata
•   speaks HTTP/REST
                              •   indexing
•   standalone or
    embeddable in Java apps   •   http://neo4j.org/

•   full ACID
Redis
•   C/C++                       •   values can be expired

•   disk-backed data            •   Pub/Sub for messaging
    structure server
                                •   ideal for rapidly changing
•   master-slave replication        data that fits in memory

•   supports: strings, lists,   •   http://redis.io/
    sets, hashes, sorted sets

•   batch operations
Search
elasticsearch
•   Java                     •   simple multi-tenancy

•   based on Apache Lucene   •   real-time search

•   distributed by design    •   scale to 100s of
                                 machines
•   cloud aware (Amazon)
                             •   http://www.elasticsearch.org/

•   understands JSON
    objects

•   no-schema required
Apache SolrCloud
•   Java                     •   automatic management
                                 of multiple shards
•   based on Apache Lucene
    (share the same repo)    •   automatic fail-over

•   adds distributed         •   durable writes
    capabilites to Solr
                             •   https://wiki.apache.org/
•   based on ZooKeeper for       solr/SolrCloud
    coordination & config
Processing
Apache Hadoop
•   Java, C/C++               •   can scale to 1000s of
                                  machines
•   set of distributed
    systems (hdfs, mr etc.)   •   designed to be highly
                                  available at the
•   framework for                 application level
    distributed data
    processing                •   https://
                                  hadoop.apache.org/
•   simple programming
    model (map / reduce)
Hadoop Ecosystem
•   HDFS (Storage)           •   Oozie (workflow)

•   MapReduce (Processing)   •   Mahout (machine
                                 learning)
•   Hive, Pig (high level
    languages)               •   Flume (log streaming)

•   HBase (database)         •   Sqoop (data import)

•   ZooKeeper                •   Whirr (deployment)
    (coordination)
Deployment
on Cloud Infrastructure (using jclouds)
Apache Whirr
        https://whirr.apache.org/

 * disclaimer: I am a member of the PMC
First Steps
• Download
  $ curl -O http://www.apache.org/dist/whirr/whirr-0.7.1/whirr-0.7.1.tar.gz
  $ tar zxf whirr-0.7.1.tar.gz; cd whirr-0.7.1


• Use
  # export credentials
  $ bin/whirr launch-cluster --config ...
  $ bin/whirr destroy-cluster --config ...



  https://whirr.apache.org/docs/latest/whirr-in-5-minutes.html
Deploy Hadoop

whirr.instance-templates=
 1 hadoop-namenode+hadoop-jobtracker,
 10 hadoop-datanode+hadoop-tasktracker

           https://whirr.apache.org/docs/0.7.1/quick-start-guide.html
With Mahout

whirr.instance-templates=
 1 hadoop-namenode+hadoop-jobtracker
   +mahout-client,
 10 hadoop-datanode+hadoop-tasktracker
Or with HBase

whirr.instance-templates=
 1 hadoop-namenode+hadoop-jobtracker
    +hbase-master+zookeeper,
 10 hadoop-datanode+hadoop-tasktracker
    +hbase-regionserver
Or Cassandra


whirr.instance-templates=10 cassandra
And elasticsearch


whirr.instance-templates=10 elasticsearch
Thanks!
andrei.savu@cloudsoftcorp.com
Ad

More Related Content

What's hot (20)

Meet Hadoop Family: part 4
Meet Hadoop Family: part 4Meet Hadoop Family: part 4
Meet Hadoop Family: part 4
caizer_x
 
Top 10 lessons learned from deploying hadoop in a private cloud
Top 10 lessons learned from deploying hadoop in a private cloudTop 10 lessons learned from deploying hadoop in a private cloud
Top 10 lessons learned from deploying hadoop in a private cloud
Rogue Wave Software
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
Arnab Mitra
 
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
rhatr
 
HBaseConEast2016: Practical Kerberos with Apache HBase
HBaseConEast2016: Practical Kerberos with Apache HBaseHBaseConEast2016: Practical Kerberos with Apache HBase
HBaseConEast2016: Practical Kerberos with Apache HBase
Michael Stack
 
Nutch - web-scale search engine toolkit
Nutch - web-scale search engine toolkitNutch - web-scale search engine toolkit
Nutch - web-scale search engine toolkit
abial
 
Rails on HBase
Rails on HBaseRails on HBase
Rails on HBase
EffectiveUI
 
ArangoDB – A different approach to NoSQL
ArangoDB – A different approach to NoSQLArangoDB – A different approach to NoSQL
ArangoDB – A different approach to NoSQL
ArangoDB Database
 
Data Processing and Ruby in the World
Data Processing and Ruby in the WorldData Processing and Ruby in the World
Data Processing and Ruby in the World
SATOSHI TAGOMORI
 
CI_CONF 2012: Scaling
CI_CONF 2012: ScalingCI_CONF 2012: Scaling
CI_CONF 2012: Scaling
Chris Miller
 
The Bixo Web Mining Toolkit
The Bixo Web Mining ToolkitThe Bixo Web Mining Toolkit
The Bixo Web Mining Toolkit
Tom Croucher
 
HAProxyでMySQL HA on Amazon EC2
HAProxyでMySQL HA on Amazon EC2HAProxyでMySQL HA on Amazon EC2
HAProxyでMySQL HA on Amazon EC2
Michael H. Oshita
 
Rails on HBase
Rails on HBaseRails on HBase
Rails on HBase
zpinter
 
Rupy2012 ArangoDB Workshop Part2
Rupy2012 ArangoDB Workshop Part2Rupy2012 ArangoDB Workshop Part2
Rupy2012 ArangoDB Workshop Part2
ArangoDB Database
 
新浪微博开放平台Redis实战
新浪微博开放平台Redis实战新浪微博开放平台Redis实战
新浪微博开放平台Redis实战
mysqlops
 
Speeding Up The Snail
Speeding Up The SnailSpeeding Up The Snail
Speeding Up The Snail
Marcus Deglos
 
MongoDB Case Study at NoSQL Now 2012
MongoDB Case Study at NoSQL Now 2012MongoDB Case Study at NoSQL Now 2012
MongoDB Case Study at NoSQL Now 2012
Sean Laurent
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
Sean Laurent
 
HBaseCon 2013: Full-Text Indexing for Apache HBase
HBaseCon 2013: Full-Text Indexing for Apache HBaseHBaseCon 2013: Full-Text Indexing for Apache HBase
HBaseCon 2013: Full-Text Indexing for Apache HBase
Cloudera, Inc.
 
What can-be-done-around-mesos
What can-be-done-around-mesosWhat can-be-done-around-mesos
What can-be-done-around-mesos
Zhou Weitao
 
Meet Hadoop Family: part 4
Meet Hadoop Family: part 4Meet Hadoop Family: part 4
Meet Hadoop Family: part 4
caizer_x
 
Top 10 lessons learned from deploying hadoop in a private cloud
Top 10 lessons learned from deploying hadoop in a private cloudTop 10 lessons learned from deploying hadoop in a private cloud
Top 10 lessons learned from deploying hadoop in a private cloud
Rogue Wave Software
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
Arnab Mitra
 
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
rhatr
 
HBaseConEast2016: Practical Kerberos with Apache HBase
HBaseConEast2016: Practical Kerberos with Apache HBaseHBaseConEast2016: Practical Kerberos with Apache HBase
HBaseConEast2016: Practical Kerberos with Apache HBase
Michael Stack
 
Nutch - web-scale search engine toolkit
Nutch - web-scale search engine toolkitNutch - web-scale search engine toolkit
Nutch - web-scale search engine toolkit
abial
 
ArangoDB – A different approach to NoSQL
ArangoDB – A different approach to NoSQLArangoDB – A different approach to NoSQL
ArangoDB – A different approach to NoSQL
ArangoDB Database
 
Data Processing and Ruby in the World
Data Processing and Ruby in the WorldData Processing and Ruby in the World
Data Processing and Ruby in the World
SATOSHI TAGOMORI
 
CI_CONF 2012: Scaling
CI_CONF 2012: ScalingCI_CONF 2012: Scaling
CI_CONF 2012: Scaling
Chris Miller
 
The Bixo Web Mining Toolkit
The Bixo Web Mining ToolkitThe Bixo Web Mining Toolkit
The Bixo Web Mining Toolkit
Tom Croucher
 
HAProxyでMySQL HA on Amazon EC2
HAProxyでMySQL HA on Amazon EC2HAProxyでMySQL HA on Amazon EC2
HAProxyでMySQL HA on Amazon EC2
Michael H. Oshita
 
Rails on HBase
Rails on HBaseRails on HBase
Rails on HBase
zpinter
 
Rupy2012 ArangoDB Workshop Part2
Rupy2012 ArangoDB Workshop Part2Rupy2012 ArangoDB Workshop Part2
Rupy2012 ArangoDB Workshop Part2
ArangoDB Database
 
新浪微博开放平台Redis实战
新浪微博开放平台Redis实战新浪微博开放平台Redis实战
新浪微博开放平台Redis实战
mysqlops
 
Speeding Up The Snail
Speeding Up The SnailSpeeding Up The Snail
Speeding Up The Snail
Marcus Deglos
 
MongoDB Case Study at NoSQL Now 2012
MongoDB Case Study at NoSQL Now 2012MongoDB Case Study at NoSQL Now 2012
MongoDB Case Study at NoSQL Now 2012
Sean Laurent
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
Sean Laurent
 
HBaseCon 2013: Full-Text Indexing for Apache HBase
HBaseCon 2013: Full-Text Indexing for Apache HBaseHBaseCon 2013: Full-Text Indexing for Apache HBase
HBaseCon 2013: Full-Text Indexing for Apache HBase
Cloudera, Inc.
 
What can-be-done-around-mesos
What can-be-done-around-mesosWhat can-be-done-around-mesos
What can-be-done-around-mesos
Zhou Weitao
 

Viewers also liked (13)

794.инженерная графика
794.инженерная графика794.инженерная графика
794.инженерная графика
ivanov1566334322
 
Uso Seguro de la Red
Uso Seguro de la RedUso Seguro de la Red
Uso Seguro de la Red
Antonio Torrecilla Martínez
 
FGB Appreciation letter
FGB Appreciation letterFGB Appreciation letter
FGB Appreciation letter
TD DEEPAK
 
【平成25年度】地球の持続可能性と地域の持続可能性 / Global Sustainability and Local Sustainability
【平成25年度】地球の持続可能性と地域の持続可能性 / Global Sustainability and Local Sustainability【平成25年度】地球の持続可能性と地域の持続可能性 / Global Sustainability and Local Sustainability
【平成25年度】地球の持続可能性と地域の持続可能性 / Global Sustainability and Local Sustainability
Environmental Consortium for Leadership Development (EcoLeaD)
 
Watch a. murray vs r. haase live online
Watch a. murray vs r. haase live onlineWatch a. murray vs r. haase live online
Watch a. murray vs r. haase live online
abner_alanis
 
Man and War
Man and WarMan and War
Man and War
Гульназ Ш
 
6914
69146914
6914
ivanov1566334322
 
6908
69086908
6908
ivanov1566334322
 
SCR credentials
SCR credentialsSCR credentials
SCR credentials
SCR relaciones públicas
 
Juan Ponce de León
Juan Ponce de LeónJuan Ponce de León
Juan Ponce de León
Ethan Kyle Pascua
 
Gerund
GerundGerund
Gerund
Гульназ Ш
 
How to find Product Market Fit - Founder Institute
How to find Product Market Fit - Founder InstituteHow to find Product Market Fit - Founder Institute
How to find Product Market Fit - Founder Institute
Justin Wilcox
 
Matematicas
MatematicasMatematicas
Matematicas
Carlos Lopez
 
Ad

Similar to Polyglot Persistence & Big Data in the Cloud (20)

Michael stack -the state of apache h base
Michael stack -the state of apache h baseMichael stack -the state of apache h base
Michael stack -the state of apache h base
hdhappy001
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
Corley S.r.l.
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
Steve Staso
 
Hadoop distributions - ecosystem
Hadoop distributions - ecosystemHadoop distributions - ecosystem
Hadoop distributions - ecosystem
Jakub Stransky
 
Hadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache DrillHadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache Drill
MapR Technologies
 
Apache Content Technologies
Apache Content TechnologiesApache Content Technologies
Apache Content Technologies
gagravarr
 
Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14
Renato Javier Marroquín Mogrovejo
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
bddmoscow
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
James Chen
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
sunera pathan
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
sunera pathan
 
Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars George
JAX London
 
hadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptxhadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptx
raghavanand36
 
Hadoop: Components and Key Ideas, -part1
Hadoop: Components and Key Ideas, -part1Hadoop: Components and Key Ideas, -part1
Hadoop: Components and Key Ideas, -part1
Sandeep Kunkunuru
 
Be faster then rabbits
Be faster then rabbitsBe faster then rabbits
Be faster then rabbits
Vladislav Bauer
 
Hadoop
HadoopHadoop
Hadoop
Abhishek Agarwal
 
Search in the Apache Hadoop Ecosystem: Thoughts from the Field
Search in the Apache Hadoop Ecosystem: Thoughts from the FieldSearch in the Apache Hadoop Ecosystem: Thoughts from the Field
Search in the Apache Hadoop Ecosystem: Thoughts from the Field
Alex Moundalexis
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
Shweta Patnaik
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
Shweta Patnaik
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
Shweta Patnaik
 
Michael stack -the state of apache h base
Michael stack -the state of apache h baseMichael stack -the state of apache h base
Michael stack -the state of apache h base
hdhappy001
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
Corley S.r.l.
 
Hadoop distributions - ecosystem
Hadoop distributions - ecosystemHadoop distributions - ecosystem
Hadoop distributions - ecosystem
Jakub Stransky
 
Hadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache DrillHadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache Drill
MapR Technologies
 
Apache Content Technologies
Apache Content TechnologiesApache Content Technologies
Apache Content Technologies
gagravarr
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
bddmoscow
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
James Chen
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
sunera pathan
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
sunera pathan
 
Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars George
JAX London
 
hadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptxhadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptx
raghavanand36
 
Hadoop: Components and Key Ideas, -part1
Hadoop: Components and Key Ideas, -part1Hadoop: Components and Key Ideas, -part1
Hadoop: Components and Key Ideas, -part1
Sandeep Kunkunuru
 
Search in the Apache Hadoop Ecosystem: Thoughts from the Field
Search in the Apache Hadoop Ecosystem: Thoughts from the FieldSearch in the Apache Hadoop Ecosystem: Thoughts from the Field
Search in the Apache Hadoop Ecosystem: Thoughts from the Field
Alex Moundalexis
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
Shweta Patnaik
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
Shweta Patnaik
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
Shweta Patnaik
 
Ad

More from Andrei Savu (20)

The Evolving Landscape of Data Engineering
The Evolving Landscape of Data EngineeringThe Evolving Landscape of Data Engineering
The Evolving Landscape of Data Engineering
Andrei Savu
 
The Evolving Landscape of Data Engineering
The Evolving Landscape of Data EngineeringThe Evolving Landscape of Data Engineering
The Evolving Landscape of Data Engineering
Andrei Savu
 
Recap on AWS Lambda after re:Invent 2015
Recap on AWS Lambda after re:Invent 2015Recap on AWS Lambda after re:Invent 2015
Recap on AWS Lambda after re:Invent 2015
Andrei Savu
 
One Hadoop, Multiple Clouds - NYC Big Data Meetup
One Hadoop, Multiple Clouds - NYC Big Data MeetupOne Hadoop, Multiple Clouds - NYC Big Data Meetup
One Hadoop, Multiple Clouds - NYC Big Data Meetup
Andrei Savu
 
Introducing Cloudera Director at Big Data Bash
Introducing Cloudera Director at Big Data BashIntroducing Cloudera Director at Big Data Bash
Introducing Cloudera Director at Big Data Bash
Andrei Savu
 
APIs & Underlying Protocols #APICraftSF
APIs & Underlying Protocols #APICraftSFAPIs & Underlying Protocols #APICraftSF
APIs & Underlying Protocols #APICraftSF
Andrei Savu
 
Challenges for running Hadoop on AWS - AdvancedAWS Meetup
Challenges for running Hadoop on AWS - AdvancedAWS MeetupChallenges for running Hadoop on AWS - AdvancedAWS Meetup
Challenges for running Hadoop on AWS - AdvancedAWS Meetup
Andrei Savu
 
Cloud as a Data Platform
Cloud as a Data PlatformCloud as a Data Platform
Cloud as a Data Platform
Andrei Savu
 
Apache Provisionr (incubating) - Bucharest JUG 10
Apache Provisionr (incubating) - Bucharest JUG 10Apache Provisionr (incubating) - Bucharest JUG 10
Apache Provisionr (incubating) - Bucharest JUG 10
Andrei Savu
 
Creating pools of Virtual Machines - ApacheCon NA 2013
Creating pools of Virtual Machines - ApacheCon NA 2013Creating pools of Virtual Machines - ApacheCon NA 2013
Creating pools of Virtual Machines - ApacheCon NA 2013
Andrei Savu
 
Data Scientist Toolbox
Data Scientist ToolboxData Scientist Toolbox
Data Scientist Toolbox
Andrei Savu
 
Axemblr Provisionr 0.3.x Overview
Axemblr Provisionr 0.3.x OverviewAxemblr Provisionr 0.3.x Overview
Axemblr Provisionr 0.3.x Overview
Andrei Savu
 
2012 in Review - Bucharest JUG
2012 in Review - Bucharest JUG2012 in Review - Bucharest JUG
2012 in Review - Bucharest JUG
Andrei Savu
 
Metrics for Web Applications - Netcamp 2012
Metrics for Web Applications - Netcamp 2012Metrics for Web Applications - Netcamp 2012
Metrics for Web Applications - Netcamp 2012
Andrei Savu
 
Counters with Riak on Amazon EC2 at Hackover
Counters with Riak on Amazon EC2 at HackoverCounters with Riak on Amazon EC2 at Hackover
Counters with Riak on Amazon EC2 at Hackover
Andrei Savu
 
Simple REST with Dropwizard
Simple REST with DropwizardSimple REST with Dropwizard
Simple REST with Dropwizard
Andrei Savu
 
Guava Overview Part 2 Bucharest JUG #2
Guava Overview Part 2 Bucharest JUG #2 Guava Overview Part 2 Bucharest JUG #2
Guava Overview Part 2 Bucharest JUG #2
Andrei Savu
 
Guava Overview. Part 1 @ Bucharest JUG #1
Guava Overview. Part 1 @ Bucharest JUG #1 Guava Overview. Part 1 @ Bucharest JUG #1
Guava Overview. Part 1 @ Bucharest JUG #1
Andrei Savu
 
Building a Great Team in Open Source - Open Agile 2011
Building a Great Team in Open Source - Open Agile 2011Building a Great Team in Open Source - Open Agile 2011
Building a Great Team in Open Source - Open Agile 2011
Andrei Savu
 
Apache Whirr
Apache WhirrApache Whirr
Apache Whirr
Andrei Savu
 
The Evolving Landscape of Data Engineering
The Evolving Landscape of Data EngineeringThe Evolving Landscape of Data Engineering
The Evolving Landscape of Data Engineering
Andrei Savu
 
The Evolving Landscape of Data Engineering
The Evolving Landscape of Data EngineeringThe Evolving Landscape of Data Engineering
The Evolving Landscape of Data Engineering
Andrei Savu
 
Recap on AWS Lambda after re:Invent 2015
Recap on AWS Lambda after re:Invent 2015Recap on AWS Lambda after re:Invent 2015
Recap on AWS Lambda after re:Invent 2015
Andrei Savu
 
One Hadoop, Multiple Clouds - NYC Big Data Meetup
One Hadoop, Multiple Clouds - NYC Big Data MeetupOne Hadoop, Multiple Clouds - NYC Big Data Meetup
One Hadoop, Multiple Clouds - NYC Big Data Meetup
Andrei Savu
 
Introducing Cloudera Director at Big Data Bash
Introducing Cloudera Director at Big Data BashIntroducing Cloudera Director at Big Data Bash
Introducing Cloudera Director at Big Data Bash
Andrei Savu
 
APIs & Underlying Protocols #APICraftSF
APIs & Underlying Protocols #APICraftSFAPIs & Underlying Protocols #APICraftSF
APIs & Underlying Protocols #APICraftSF
Andrei Savu
 
Challenges for running Hadoop on AWS - AdvancedAWS Meetup
Challenges for running Hadoop on AWS - AdvancedAWS MeetupChallenges for running Hadoop on AWS - AdvancedAWS Meetup
Challenges for running Hadoop on AWS - AdvancedAWS Meetup
Andrei Savu
 
Cloud as a Data Platform
Cloud as a Data PlatformCloud as a Data Platform
Cloud as a Data Platform
Andrei Savu
 
Apache Provisionr (incubating) - Bucharest JUG 10
Apache Provisionr (incubating) - Bucharest JUG 10Apache Provisionr (incubating) - Bucharest JUG 10
Apache Provisionr (incubating) - Bucharest JUG 10
Andrei Savu
 
Creating pools of Virtual Machines - ApacheCon NA 2013
Creating pools of Virtual Machines - ApacheCon NA 2013Creating pools of Virtual Machines - ApacheCon NA 2013
Creating pools of Virtual Machines - ApacheCon NA 2013
Andrei Savu
 
Data Scientist Toolbox
Data Scientist ToolboxData Scientist Toolbox
Data Scientist Toolbox
Andrei Savu
 
Axemblr Provisionr 0.3.x Overview
Axemblr Provisionr 0.3.x OverviewAxemblr Provisionr 0.3.x Overview
Axemblr Provisionr 0.3.x Overview
Andrei Savu
 
2012 in Review - Bucharest JUG
2012 in Review - Bucharest JUG2012 in Review - Bucharest JUG
2012 in Review - Bucharest JUG
Andrei Savu
 
Metrics for Web Applications - Netcamp 2012
Metrics for Web Applications - Netcamp 2012Metrics for Web Applications - Netcamp 2012
Metrics for Web Applications - Netcamp 2012
Andrei Savu
 
Counters with Riak on Amazon EC2 at Hackover
Counters with Riak on Amazon EC2 at HackoverCounters with Riak on Amazon EC2 at Hackover
Counters with Riak on Amazon EC2 at Hackover
Andrei Savu
 
Simple REST with Dropwizard
Simple REST with DropwizardSimple REST with Dropwizard
Simple REST with Dropwizard
Andrei Savu
 
Guava Overview Part 2 Bucharest JUG #2
Guava Overview Part 2 Bucharest JUG #2 Guava Overview Part 2 Bucharest JUG #2
Guava Overview Part 2 Bucharest JUG #2
Andrei Savu
 
Guava Overview. Part 1 @ Bucharest JUG #1
Guava Overview. Part 1 @ Bucharest JUG #1 Guava Overview. Part 1 @ Bucharest JUG #1
Guava Overview. Part 1 @ Bucharest JUG #1
Andrei Savu
 
Building a Great Team in Open Source - Open Agile 2011
Building a Great Team in Open Source - Open Agile 2011Building a Great Team in Open Source - Open Agile 2011
Building a Great Team in Open Source - Open Agile 2011
Andrei Savu
 

Recently uploaded (20)

Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 

Polyglot Persistence & Big Data in the Cloud

  • 2. Overview • Introduction • Databases • Search • Processing • Deployment
  • 3. Polyglot Persistence “Polyglot Persistence, like polyglot programming, is all about choosing the right persistence option for the task at hand” http://www.nearinfinity.com/blogs/scott_leberknight/polyglot_persistence.html http://martinfowler.com/bliki/PolyglotPersistence.html
  • 5. It all started from ... a set of papers released by Google & Amazon
  • 6. • Google Filesystem (2003) http://research.google.com/archive/gfs.html • Google MapReduce (2004) http://research.google.com/archive/mapreduce.html • Google BigTable (2006) http://research.google.com/archive/bigtable.html • Amazon Dynamo (2007) http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo- sosp2007.pdf
  • 8. Apache HBase • Java • persistence through HDFS (Hadoop) • designed to be able to store massive amounts • Map/Reduce with of data Hadoop • speaks HTTP / REST, • designed for real time Thrift, Avro workloads • based on Google • https://hbase.apache.org/ BigTable
  • 9. Apache Cassandra • Java • really fast writes • inspired by Google • excellent for a large BigTable and Amazon number of high speed Dynamo counters • tunable trade-offs • Map/Reduce possible with Hadoop • query by column and • range of keys http://cassandra.apache.org/
  • 10. MongoDB • C++ • map/reduce with javascript • document database (bson) with rich indexing • server side javascript • master / slave replication • journaling • built-in sharding • fast in-place updates • auto failover with replica • http://www.mongodb.org/ sets
  • 11. Apache CouchDB • Erlang • exposes a stream of realtime updates • document database (json) • needs compacting • bi-directional replication • indexing via views (JS) • advanced conflict • attachment handling resolution • https://couchdb.apache.org/ • MVCC - writes do not block reads
  • 12. Riak (Basho) • Erlang, C, Javascript • tunable trade-offs (N, R, W) • key, value store • mapreduce in JS or • focus on fault tolerance Erlang and cross datacenter replication • full-text indexing with riak search • speaks HTTP/REST or custom binary • http://wiki.basho.com/
  • 13. Neo4j • Java • web admin interface • graph database • nodes & relationships can have metadata • speaks HTTP/REST • indexing • standalone or embeddable in Java apps • http://neo4j.org/ • full ACID
  • 14. Redis • C/C++ • values can be expired • disk-backed data • Pub/Sub for messaging structure server • ideal for rapidly changing • master-slave replication data that fits in memory • supports: strings, lists, • http://redis.io/ sets, hashes, sorted sets • batch operations
  • 16. elasticsearch • Java • simple multi-tenancy • based on Apache Lucene • real-time search • distributed by design • scale to 100s of machines • cloud aware (Amazon) • http://www.elasticsearch.org/ • understands JSON objects • no-schema required
  • 17. Apache SolrCloud • Java • automatic management of multiple shards • based on Apache Lucene (share the same repo) • automatic fail-over • adds distributed • durable writes capabilites to Solr • https://wiki.apache.org/ • based on ZooKeeper for solr/SolrCloud coordination & config
  • 19. Apache Hadoop • Java, C/C++ • can scale to 1000s of machines • set of distributed systems (hdfs, mr etc.) • designed to be highly available at the • framework for application level distributed data processing • https:// hadoop.apache.org/ • simple programming model (map / reduce)
  • 20. Hadoop Ecosystem • HDFS (Storage) • Oozie (workflow) • MapReduce (Processing) • Mahout (machine learning) • Hive, Pig (high level languages) • Flume (log streaming) • HBase (database) • Sqoop (data import) • ZooKeeper • Whirr (deployment) (coordination)
  • 22. Apache Whirr https://whirr.apache.org/ * disclaimer: I am a member of the PMC
  • 23. First Steps • Download $ curl -O http://www.apache.org/dist/whirr/whirr-0.7.1/whirr-0.7.1.tar.gz $ tar zxf whirr-0.7.1.tar.gz; cd whirr-0.7.1 • Use # export credentials $ bin/whirr launch-cluster --config ... $ bin/whirr destroy-cluster --config ... https://whirr.apache.org/docs/latest/whirr-in-5-minutes.html
  • 24. Deploy Hadoop whirr.instance-templates= 1 hadoop-namenode+hadoop-jobtracker, 10 hadoop-datanode+hadoop-tasktracker https://whirr.apache.org/docs/0.7.1/quick-start-guide.html
  • 25. With Mahout whirr.instance-templates= 1 hadoop-namenode+hadoop-jobtracker +mahout-client, 10 hadoop-datanode+hadoop-tasktracker
  • 26. Or with HBase whirr.instance-templates= 1 hadoop-namenode+hadoop-jobtracker +hbase-master+zookeeper, 10 hadoop-datanode+hadoop-tasktracker +hbase-regionserver

Editor's Notes