SlideShare a Scribd company logo
Data Lake and the rise of the Microservices
About Me
• IT Operations Manager @BigStepInc
• Tech Support 2007 — 2008
• Systems Administrator 2008 — late 2014 (from Junior to Senior)
• IT Operations Manager late 2014 — Present
• Passionate about improvement and systems in general
• Totally dislike repetitive tasks
@mboeru
@bigstepinc
marius@bigstep.com
About Bigstep
• High performance, bare metal cloud purpose built for big data
• Automatically deployed (managed and unmanaged) big data software stacks
• HDFS as a Service Offering
• Managed Docker platform (coming soon)
• Spark clusters as a service (coming soon)
• Purely on-demand: bare metal instances get deployed in 2 minutes, can be deleted anytime
• Locally attached drives support
• SDN controlled Layer 2 networking (40Gbps per instance, cut through)
• Distributed SSD based storage fabric
Big Data technologies for mainstream and vice-versa
• Due to the cap on CPU frequency, the horizontal is the only dimension left to grow into.
• Client-server architecture outdated.
• All components of an application must be as independent as possible and as scalable as possible.
• Big data technologies increasingly used in general purpose applications
• In-memory technologies are orders of magnitude faster than the others.
• Docker promotes and simplifies large scale application management using low-overhead
containers instead of VMs
• Mesos used with Docker and some additional services creates a Distributed OS
Source: Tori Randall, Ph.D. prepares a 550-year old Peruvian child mummy for a CT scan
Data as artefacts
• Just like archeological
artefacts, old data can yield
new insights if correlated in a
novel way or analysed with a
new technology.
• Throwing away data because
it is of no use today might
cripple the business tomorrow.
The Data Lake - A paradigm, not a technology
• Store unstructured data in its original format
• Store structured data along with the structure (schema) so it can be distributed onto multiple
machines
• Ingest massive amounts of data - go to petabyte scale if needed
• Stream in or batch import data from any source
• Perform new, deeper analytics by focusing on correlations between diverse data sources:
clickstream, social media, machine data, documents, audio/video, etc.
• Store anonymised data (keep IDs and not names or other personally identifiable information)
• Promote data exploration
Data Services
• A data service provides data to other services
Clusters
Service Cluster
Timetable
Datalake
Service
center load
predictor
Driver's
path
optimiser
Datalake Datalake
Data Services - It’s about the teams and not the technology
• Conway law: “[…] organizations which design systems ... are constrained to produce designs
which are copies of the communication structures of these organizations”
Application
View
Controller
Model
in charge of
UX specialists
Backend specialists
in charge of
DB specialists
in charge of
poorcommunication
Per data microservice teams
• Data teams are independent
• A data service has its own
release cycle
• Ultra-specialisation is
reduced
• Communication among
members of the same team
is better
App App
App
App App
App
App AppApp
API
API
API
better communication better communication
better communication
Monolith vs. Microservices approach
Server
App
Server
Monolithic approach
App
App
Server
App
App
Server
App
App
Server
App
App
Server
App
Server
App
Server
App
Server
App
App
Server
App
App
Microservices approach
Polyglot persistence
• The data does not have to reside in the same place (e.g.: same HDFS cluster)
• But it has to be always available for any team, microservice, or data application authorized to
use it
Single DB (slave)
piece
3
piece
4
piece
1
piece
2
Single DB
DB 4DB 3 DB 4DB 3
DB 1 DB 2
DB 4DB 3
DB 1DB 1 DB 2DB 2
Polyglot persistence
• The data does not have to reside in the same place (e.g.: same HDFS cluster)
• But it has to be always available for any team, microservice, or data application authorized to
use it
Single DB (slave)
piece
3
piece
4
piece
1
piece
2
Single DB
DB 4DB 3 DB 4DB 3
DB 1 DB 2
DB 4DB 3
DB 1DB 1 DB 2DB 2
Microservices orientated architecture
• Components, not layers.
• Each component can scale horizontally and is masterless
• Each component can be unit tested independently
• Each component can be deployed independently to production
• Multiple versions of same component can coexist for a short amount of time
• Using APIs to integrate components as opposed to direct method call
• Use natively backward compatible API designs and implementations
• Use distributed locking (e.g.: Zookeeper) instead of file based locking
• Use queuing instead of blocking calls with evolving schemas (e.g.: Kafka with Avro serialiser)
• Using distributed databases (e.g.: Couchbase) instead of master-slave oriented ones. Avoid
immutable schemas.
Docker
• A Docker container is neither a VM nor a
VPS
• Application level virtualisation
• Same kernel
• No performance overhead
• Instant deployment
• Usually a single app per container
• Uses libcontainer (previously used LXC)
engine (network namespaces and cgroups)
• Git-like deployment method with branches
and repositories.
Container
Kernel
Container Container Container
vNIC
LAN
vNIC
WAN
LAN
WAN
vNIC
LAN
vNIC
WAN
vNIC
LAN
vNIC
WAN
vNIC
LAN
vNIC
WAN
Docker Persistency
• Docker is designed for services that do not need persistency but it does support it
• By default all containers have an unique clone of the filesystem in the image
• All changes to this clone are stored in unique directories per container that does not get
garbage collected
• A new container has a new tree and so restarting a container without an explicit mapping
appears as having destroyed the data.
• Docker achieves persistency by mapping directories from the host machine to the container.
Mesos & Marathon
• Allows an app’s environment to be software
defined.
• Docker (currently) knows only about 1 host
• Orchestration layer for Docker containers
• Out of the box load-balancing
• Monitors and restarts containers if failed
• API driven
• Useful for creating high performance,
distributed, fault tolerant architectures.
C C C
C C C
C C C
C C C
C C C
C C C
Docker Networking in Mesos
instance-1001.bigstep.io
container
container
container
eno1
instance-1002.bigstep.io
container
container
container
eno1
instance-100n.bigstep.io
container
container
container
eno1
LAN
layer 2
...
haproxy haproxy haproxy
WAN
internet
Instancearray01.bigstep.io
client
DNS loadbalancing
172.167.1.2:80
172.167.1.3:80
172.167.1.200:80
172.167.2.2:80
172.167.3.3:80
172.167.3.200:80
172.167.3.2:80
172.167.3.3:80
172.167.3.200:80
... ... ...
31.00.62.211:80 31.00.62.212:80 31.00.62.213:80
• Uses network namespaces
• Needs Layer 2 or software overlay network
• Each container gets a private IP
• Bigstep Automatic DNS load-balancing
• Automatic HAProxy load-balancing
Docker vs Native - Latency
AverageResponseTime(ms)
-SmallerIsBetter
0
6
11
17
22
INSERT AVG response time (us) SELECT AVG response time (us) UPDATE AVG response time (us)
11
19
21
10
18
19
1 node native 1 node native 1 Docker container
Source: Bigstep’s Cassandra Benchmark 2015
Docker vs Native - Throughput
KReq/s-biggerisbetter
0
43
85
128
170
INSERT throughput (k) SELECT throughput (k) UPDATE throughput (k)
149
92
82
168
9690
1 node native 1 node native 1 Docker container
Source: Bigstep’s Cassandra Benchmark 2015
Streaming versus batch
• Resource usage patterns for streaming resemble those of web-centric systems, and need
consolidation for efficiency as well as high availability
time
resource
usage (%)
25%
resource usage pattern of a production system time
resource
usage (%)
100%
typical resource usage pattern of a big data analytics system
Spark with Mesos
• Spark & Spark Streaming are great candidates for building data microservices as they are very
fast and easy to use
• Spark can use Mesos as a resource manager
• Spark needs YARN to access Secure HDFS YARN on Mesos: Myriad
Is it hard to build a Data Lake?
• Use flexible infrastructure, workloads are very difficult to predict as data volumes and types of
analysis change all the time
• Polyglot Persistency promotes the idea that data must be always available - but that it can be
stored in any technology that fits - e.g. Hadoop, NoSQL.
• Polyglot Programming advocates the use of the right tool for the right job. Docker-based
deployment makes environment setup more or less irrelevant.
• Mesos is more complicated to setup on-premise. Mesosphere offers a commercial product
for this. Bigstep also automates a scalable Mesos (with Docker) deployment on bare metal.
• Data import services could be tricky to setup. The problem is the organisation structure and
security. Anonymisation is required.
• A service discovery solution is required: Use mesos-dns
Conclusions
• Data (micro-)services allows building a data ecosystem within your organisation. A team is a
provider of data to other teams.
• An agile data environment enables an agile business. New tools must be inserted quickly into
the mix. (Eg: found out about Looker today, why not try it on the data).
• There are methods to improve consolidation ratios with 40% while preserving performance of
data services
Data analysis
Business
modelling
Business
understanding
+ =
Production
Systems
machine data
prediction model
Visualization &
Reports
Ad

More Related Content

What's hot (20)

Lambda-less Stream Processing @Scale in LinkedIn
Lambda-less Stream Processing @Scale in LinkedIn Lambda-less Stream Processing @Scale in LinkedIn
Lambda-less Stream Processing @Scale in LinkedIn
DataWorks Summit/Hadoop Summit
 
Hdfs 2016-hadoop-summit-dublin-v1
Hdfs 2016-hadoop-summit-dublin-v1Hdfs 2016-hadoop-summit-dublin-v1
Hdfs 2016-hadoop-summit-dublin-v1
Chris Nauroth
 
Presentation on Large Scale Data Management
Presentation on Large Scale Data ManagementPresentation on Large Scale Data Management
Presentation on Large Scale Data Management
Chris Bunch
 
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lake
Mykola Zerniuk
 
C*ollege Credit: Is My App a Good Fit for Cassandra?
C*ollege Credit: Is My App a Good Fit for Cassandra?C*ollege Credit: Is My App a Good Fit for Cassandra?
C*ollege Credit: Is My App a Good Fit for Cassandra?
DataStax
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
Evans Ye
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake House
Data Con LA
 
Monitoring MySQL at scale
Monitoring MySQL at scaleMonitoring MySQL at scale
Monitoring MySQL at scale
Ovais Tariq
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
 
Exploring microservices in a Microsoft landscape
Exploring microservices in a Microsoft landscapeExploring microservices in a Microsoft landscape
Exploring microservices in a Microsoft landscape
Alex Thissen
 
Real time analytics
Real time analyticsReal time analytics
Real time analytics
Leandro Totino Pereira
 
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Data Con LA
 
Seamless, Real-Time Data Integration with Connect
Seamless, Real-Time Data Integration with ConnectSeamless, Real-Time Data Integration with Connect
Seamless, Real-Time Data Integration with Connect
Precisely
 
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder HortonworksThe Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
Data Con LA
 
Making Every Drop Count: How i20 Addresses the Water Crisis with the IoT and ...
Making Every Drop Count: How i20 Addresses the Water Crisis with the IoT and ...Making Every Drop Count: How i20 Addresses the Water Crisis with the IoT and ...
Making Every Drop Count: How i20 Addresses the Water Crisis with the IoT and ...
DataStax
 
Cassandra Lunch #88: Cadence
Cassandra Lunch #88: CadenceCassandra Lunch #88: Cadence
Cassandra Lunch #88: Cadence
Anant Corporation
 
Spark Streaming with Azure Databricks
Spark Streaming with Azure DatabricksSpark Streaming with Azure Databricks
Spark Streaming with Azure Databricks
Dustin Vannoy
 
Sidecars and a Microservices Mesh
Sidecars and a Microservices MeshSidecars and a Microservices Mesh
Sidecars and a Microservices Mesh
Red Hat Developers
 
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
Jeff Hung
 
20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 20150716 introduction to apache spark v3
20150716 introduction to apache spark v3
Andrey Vykhodtsev
 
Hdfs 2016-hadoop-summit-dublin-v1
Hdfs 2016-hadoop-summit-dublin-v1Hdfs 2016-hadoop-summit-dublin-v1
Hdfs 2016-hadoop-summit-dublin-v1
Chris Nauroth
 
Presentation on Large Scale Data Management
Presentation on Large Scale Data ManagementPresentation on Large Scale Data Management
Presentation on Large Scale Data Management
Chris Bunch
 
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lake
Mykola Zerniuk
 
C*ollege Credit: Is My App a Good Fit for Cassandra?
C*ollege Credit: Is My App a Good Fit for Cassandra?C*ollege Credit: Is My App a Good Fit for Cassandra?
C*ollege Credit: Is My App a Good Fit for Cassandra?
DataStax
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
Evans Ye
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake House
Data Con LA
 
Monitoring MySQL at scale
Monitoring MySQL at scaleMonitoring MySQL at scale
Monitoring MySQL at scale
Ovais Tariq
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
 
Exploring microservices in a Microsoft landscape
Exploring microservices in a Microsoft landscapeExploring microservices in a Microsoft landscape
Exploring microservices in a Microsoft landscape
Alex Thissen
 
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Data Con LA
 
Seamless, Real-Time Data Integration with Connect
Seamless, Real-Time Data Integration with ConnectSeamless, Real-Time Data Integration with Connect
Seamless, Real-Time Data Integration with Connect
Precisely
 
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder HortonworksThe Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
Data Con LA
 
Making Every Drop Count: How i20 Addresses the Water Crisis with the IoT and ...
Making Every Drop Count: How i20 Addresses the Water Crisis with the IoT and ...Making Every Drop Count: How i20 Addresses the Water Crisis with the IoT and ...
Making Every Drop Count: How i20 Addresses the Water Crisis with the IoT and ...
DataStax
 
Cassandra Lunch #88: Cadence
Cassandra Lunch #88: CadenceCassandra Lunch #88: Cadence
Cassandra Lunch #88: Cadence
Anant Corporation
 
Spark Streaming with Azure Databricks
Spark Streaming with Azure DatabricksSpark Streaming with Azure Databricks
Spark Streaming with Azure Databricks
Dustin Vannoy
 
Sidecars and a Microservices Mesh
Sidecars and a Microservices MeshSidecars and a Microservices Mesh
Sidecars and a Microservices Mesh
Red Hat Developers
 
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
Cloud Computing in the Cloud (Hadoop.tw Meetup @ 2015/11/23)
Jeff Hung
 
20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 20150716 introduction to apache spark v3
20150716 introduction to apache spark v3
Andrey Vykhodtsev
 

Viewers also liked (20)

Lessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and MicroservicesLessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and Microservices
Alexis Seigneurin
 
Une infrastructure de stockage et sa suite analytique : Le duo gagnant du Dat...
Une infrastructure de stockage et sa suite analytique : Le duo gagnant du Dat...Une infrastructure de stockage et sa suite analytique : Le duo gagnant du Dat...
Une infrastructure de stockage et sa suite analytique : Le duo gagnant du Dat...
RSD
 
Destroying Data Silos
Destroying Data SilosDestroying Data Silos
Destroying Data Silos
DataWorks Summit
 
A Microservice Architecture for Big Data Pipelines
A Microservice Architecture for Big Data PipelinesA Microservice Architecture for Big Data Pipelines
A Microservice Architecture for Big Data Pipelines
Daniel Mescheder
 
Apache Olingo - from Incubation to a real Olingo (Apache TLP)
Apache Olingo - from Incubation to a real Olingo (Apache TLP)Apache Olingo - from Incubation to a real Olingo (Apache TLP)
Apache Olingo - from Incubation to a real Olingo (Apache TLP)
mirbo
 
Stream and Batch Processing in the Cloud with Data Microservices
Stream and Batch Processing in the Cloud with Data MicroservicesStream and Batch Processing in the Cloud with Data Microservices
Stream and Batch Processing in the Cloud with Data Microservices
marius_bogoevici
 
Datalake Architecture
Datalake ArchitectureDatalake Architecture
Datalake Architecture
TechYugadi IT Solutions & Consulting
 
Apply Machine Learning to Microservices
Apply Machine Learning to MicroservicesApply Machine Learning to Microservices
Apply Machine Learning to Microservices
Kai Wähner
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architectures
Daniel Marcous
 
Integrating Apache NiFi and Apache Flink
Integrating Apache NiFi and Apache FlinkIntegrating Apache NiFi and Apache Flink
Integrating Apache NiFi and Apache Flink
Hortonworks
 
Azure Data Lake and U-SQL
Azure Data Lake and U-SQLAzure Data Lake and U-SQL
Azure Data Lake and U-SQL
Michael Rys
 
BigData_Chp3: Data Processing
BigData_Chp3: Data ProcessingBigData_Chp3: Data Processing
BigData_Chp3: Data Processing
Lilia Sfaxi
 
Filling the Data Lake
Filling the Data LakeFilling the Data Lake
Filling the Data Lake
DataWorks Summit/Hadoop Summit
 
From SOA to MSA
From SOA to MSAFrom SOA to MSA
From SOA to MSA
William Yang
 
Micro Service Architecture
Micro Service ArchitectureMicro Service Architecture
Micro Service Architecture
Eduards Sizovs
 
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Spark Summit
 
MicroService Architecture
MicroService ArchitectureMicroService Architecture
MicroService Architecture
Fred George
 
Big Data and Fast Data - Lambda Architecture in Action
Big Data and Fast Data - Lambda Architecture in ActionBig Data and Fast Data - Lambda Architecture in Action
Big Data and Fast Data - Lambda Architecture in Action
Guido Schmutz
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
James Serra
 
Developing applications with a microservice architecture (SVforum, microservi...
Developing applications with a microservice architecture (SVforum, microservi...Developing applications with a microservice architecture (SVforum, microservi...
Developing applications with a microservice architecture (SVforum, microservi...
Chris Richardson
 
Lessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and MicroservicesLessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and Microservices
Alexis Seigneurin
 
Une infrastructure de stockage et sa suite analytique : Le duo gagnant du Dat...
Une infrastructure de stockage et sa suite analytique : Le duo gagnant du Dat...Une infrastructure de stockage et sa suite analytique : Le duo gagnant du Dat...
Une infrastructure de stockage et sa suite analytique : Le duo gagnant du Dat...
RSD
 
A Microservice Architecture for Big Data Pipelines
A Microservice Architecture for Big Data PipelinesA Microservice Architecture for Big Data Pipelines
A Microservice Architecture for Big Data Pipelines
Daniel Mescheder
 
Apache Olingo - from Incubation to a real Olingo (Apache TLP)
Apache Olingo - from Incubation to a real Olingo (Apache TLP)Apache Olingo - from Incubation to a real Olingo (Apache TLP)
Apache Olingo - from Incubation to a real Olingo (Apache TLP)
mirbo
 
Stream and Batch Processing in the Cloud with Data Microservices
Stream and Batch Processing in the Cloud with Data MicroservicesStream and Batch Processing in the Cloud with Data Microservices
Stream and Batch Processing in the Cloud with Data Microservices
marius_bogoevici
 
Apply Machine Learning to Microservices
Apply Machine Learning to MicroservicesApply Machine Learning to Microservices
Apply Machine Learning to Microservices
Kai Wähner
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architectures
Daniel Marcous
 
Integrating Apache NiFi and Apache Flink
Integrating Apache NiFi and Apache FlinkIntegrating Apache NiFi and Apache Flink
Integrating Apache NiFi and Apache Flink
Hortonworks
 
Azure Data Lake and U-SQL
Azure Data Lake and U-SQLAzure Data Lake and U-SQL
Azure Data Lake and U-SQL
Michael Rys
 
BigData_Chp3: Data Processing
BigData_Chp3: Data ProcessingBigData_Chp3: Data Processing
BigData_Chp3: Data Processing
Lilia Sfaxi
 
Micro Service Architecture
Micro Service ArchitectureMicro Service Architecture
Micro Service Architecture
Eduards Sizovs
 
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Spark Summit
 
MicroService Architecture
MicroService ArchitectureMicroService Architecture
MicroService Architecture
Fred George
 
Big Data and Fast Data - Lambda Architecture in Action
Big Data and Fast Data - Lambda Architecture in ActionBig Data and Fast Data - Lambda Architecture in Action
Big Data and Fast Data - Lambda Architecture in Action
Guido Schmutz
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
James Serra
 
Developing applications with a microservice architecture (SVforum, microservi...
Developing applications with a microservice architecture (SVforum, microservi...Developing applications with a microservice architecture (SVforum, microservi...
Developing applications with a microservice architecture (SVforum, microservi...
Chris Richardson
 
Ad

Similar to Data Lake and the rise of the microservices (20)

SpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud ComputingSpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud Computing
SpringPeople
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
Mohit Tare
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
Peter Clapham
 
Data Analytics Using Container Persistence Through SMACK - Manny Rodriguez-Pe...
Data Analytics Using Container Persistence Through SMACK - Manny Rodriguez-Pe...Data Analytics Using Container Persistence Through SMACK - Manny Rodriguez-Pe...
Data Analytics Using Container Persistence Through SMACK - Manny Rodriguez-Pe...
{code} by Dell EMC
 
Hpc lunch and learn
Hpc lunch and learnHpc lunch and learn
Hpc lunch and learn
John D Almon
 
State of the Container Ecosystem
State of the Container EcosystemState of the Container Ecosystem
State of the Container Ecosystem
Vinay Rao
 
Big Data Storage Concepts from the "Big Data concepts Technology and Architec...
Big Data Storage Concepts from the "Big Data concepts Technology and Architec...Big Data Storage Concepts from the "Big Data concepts Technology and Architec...
Big Data Storage Concepts from the "Big Data concepts Technology and Architec...
raghdooosh
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
vijayapraba1
 
Ankus, bigdata deployment and orchestration framework
Ankus, bigdata deployment and orchestration frameworkAnkus, bigdata deployment and orchestration framework
Ankus, bigdata deployment and orchestration framework
Ashrith Mekala
 
DataOps with Project Amaterasu
DataOps with Project AmaterasuDataOps with Project Amaterasu
DataOps with Project Amaterasu
DataWorks Summit/Hadoop Summit
 
Lecture 5- Data Collection and Storage.pptx
Lecture 5- Data Collection and Storage.pptxLecture 5- Data Collection and Storage.pptx
Lecture 5- Data Collection and Storage.pptx
Brianc34
 
Data management in cloud computing trainee
Data management in cloud computing  traineeData management in cloud computing  trainee
Data management in cloud computing trainee
Damilola Mosaku
 
NoSQL Architecture Overview
NoSQL Architecture OverviewNoSQL Architecture Overview
NoSQL Architecture Overview
Christopher Foot
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
musrath mohammad
 
Azure IaaS Tanıtım - Kısa Anlatım
Azure IaaS Tanıtım - Kısa Anlatım Azure IaaS Tanıtım - Kısa Anlatım
Azure IaaS Tanıtım - Kısa Anlatım
Mustafa
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native Platform
Sunil Govindan
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native Platform
Sunil Govindan
 
Chaptor 2- Big Data Processing in big data technologies
Chaptor 2- Big Data Processing in big data technologiesChaptor 2- Big Data Processing in big data technologies
Chaptor 2- Big Data Processing in big data technologies
GulbakshiDharmale
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...
Simon Ambridge
 
Database Administration & Management - 01
Database Administration & Management - 01Database Administration & Management - 01
Database Administration & Management - 01
FaisalMashood
 
SpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud ComputingSpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud Computing
SpringPeople
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
Mohit Tare
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
Peter Clapham
 
Data Analytics Using Container Persistence Through SMACK - Manny Rodriguez-Pe...
Data Analytics Using Container Persistence Through SMACK - Manny Rodriguez-Pe...Data Analytics Using Container Persistence Through SMACK - Manny Rodriguez-Pe...
Data Analytics Using Container Persistence Through SMACK - Manny Rodriguez-Pe...
{code} by Dell EMC
 
Hpc lunch and learn
Hpc lunch and learnHpc lunch and learn
Hpc lunch and learn
John D Almon
 
State of the Container Ecosystem
State of the Container EcosystemState of the Container Ecosystem
State of the Container Ecosystem
Vinay Rao
 
Big Data Storage Concepts from the "Big Data concepts Technology and Architec...
Big Data Storage Concepts from the "Big Data concepts Technology and Architec...Big Data Storage Concepts from the "Big Data concepts Technology and Architec...
Big Data Storage Concepts from the "Big Data concepts Technology and Architec...
raghdooosh
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
vijayapraba1
 
Ankus, bigdata deployment and orchestration framework
Ankus, bigdata deployment and orchestration frameworkAnkus, bigdata deployment and orchestration framework
Ankus, bigdata deployment and orchestration framework
Ashrith Mekala
 
Lecture 5- Data Collection and Storage.pptx
Lecture 5- Data Collection and Storage.pptxLecture 5- Data Collection and Storage.pptx
Lecture 5- Data Collection and Storage.pptx
Brianc34
 
Data management in cloud computing trainee
Data management in cloud computing  traineeData management in cloud computing  trainee
Data management in cloud computing trainee
Damilola Mosaku
 
NoSQL Architecture Overview
NoSQL Architecture OverviewNoSQL Architecture Overview
NoSQL Architecture Overview
Christopher Foot
 
Azure IaaS Tanıtım - Kısa Anlatım
Azure IaaS Tanıtım - Kısa Anlatım Azure IaaS Tanıtım - Kısa Anlatım
Azure IaaS Tanıtım - Kısa Anlatım
Mustafa
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native Platform
Sunil Govindan
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native Platform
Sunil Govindan
 
Chaptor 2- Big Data Processing in big data technologies
Chaptor 2- Big Data Processing in big data technologiesChaptor 2- Big Data Processing in big data technologies
Chaptor 2- Big Data Processing in big data technologies
GulbakshiDharmale
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...
Simon Ambridge
 
Database Administration & Management - 01
Database Administration & Management - 01Database Administration & Management - 01
Database Administration & Management - 01
FaisalMashood
 
Ad

More from Bigstep (9)

Memory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and VirtualizationMemory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and Virtualization
Bigstep
 
Start Making Big Data With SQL and RDBMS Skills - Webinar by Bigstep and Exasol
Start Making Big Data With SQL and RDBMS Skills - Webinar by Bigstep and ExasolStart Making Big Data With SQL and RDBMS Skills - Webinar by Bigstep and Exasol
Start Making Big Data With SQL and RDBMS Skills - Webinar by Bigstep and Exasol
Bigstep
 
How to Automate Big Data with Ansible
How to Automate Big Data with AnsibleHow to Automate Big Data with Ansible
How to Automate Big Data with Ansible
Bigstep
 
Cassandra Performance Benchmark
Cassandra Performance BenchmarkCassandra Performance Benchmark
Cassandra Performance Benchmark
Bigstep
 
Couchbase In The Cloud - A Performance Benchmark
Couchbase In The Cloud - A Performance BenchmarkCouchbase In The Cloud - A Performance Benchmark
Couchbase In The Cloud - A Performance Benchmark
Bigstep
 
Couchdoop: Connecting Hadoop with Couchbase
Couchdoop: Connecting Hadoop with CouchbaseCouchdoop: Connecting Hadoop with Couchbase
Couchdoop: Connecting Hadoop with Couchbase
Bigstep
 
Building a Hadoop Connector
Building a Hadoop Connector Building a Hadoop Connector
Building a Hadoop Connector
Bigstep
 
Getting the Most Out of Your NoSQL DB
Getting the Most Out of Your NoSQL DBGetting the Most Out of Your NoSQL DB
Getting the Most Out of Your NoSQL DB
Bigstep
 
Getting the most out of Impala - Best practices for infrastructure optimization
Getting the most out of Impala - Best practices for infrastructure optimizationGetting the most out of Impala - Best practices for infrastructure optimization
Getting the most out of Impala - Best practices for infrastructure optimization
Bigstep
 
Memory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and VirtualizationMemory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and Virtualization
Bigstep
 
Start Making Big Data With SQL and RDBMS Skills - Webinar by Bigstep and Exasol
Start Making Big Data With SQL and RDBMS Skills - Webinar by Bigstep and ExasolStart Making Big Data With SQL and RDBMS Skills - Webinar by Bigstep and Exasol
Start Making Big Data With SQL and RDBMS Skills - Webinar by Bigstep and Exasol
Bigstep
 
How to Automate Big Data with Ansible
How to Automate Big Data with AnsibleHow to Automate Big Data with Ansible
How to Automate Big Data with Ansible
Bigstep
 
Cassandra Performance Benchmark
Cassandra Performance BenchmarkCassandra Performance Benchmark
Cassandra Performance Benchmark
Bigstep
 
Couchbase In The Cloud - A Performance Benchmark
Couchbase In The Cloud - A Performance BenchmarkCouchbase In The Cloud - A Performance Benchmark
Couchbase In The Cloud - A Performance Benchmark
Bigstep
 
Couchdoop: Connecting Hadoop with Couchbase
Couchdoop: Connecting Hadoop with CouchbaseCouchdoop: Connecting Hadoop with Couchbase
Couchdoop: Connecting Hadoop with Couchbase
Bigstep
 
Building a Hadoop Connector
Building a Hadoop Connector Building a Hadoop Connector
Building a Hadoop Connector
Bigstep
 
Getting the Most Out of Your NoSQL DB
Getting the Most Out of Your NoSQL DBGetting the Most Out of Your NoSQL DB
Getting the Most Out of Your NoSQL DB
Bigstep
 
Getting the most out of Impala - Best practices for infrastructure optimization
Getting the most out of Impala - Best practices for infrastructure optimizationGetting the most out of Impala - Best practices for infrastructure optimization
Getting the most out of Impala - Best practices for infrastructure optimization
Bigstep
 

Recently uploaded (20)

Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 

Data Lake and the rise of the microservices

  • 1. Data Lake and the rise of the Microservices
  • 2. About Me • IT Operations Manager @BigStepInc • Tech Support 2007 — 2008 • Systems Administrator 2008 — late 2014 (from Junior to Senior) • IT Operations Manager late 2014 — Present • Passionate about improvement and systems in general • Totally dislike repetitive tasks @mboeru @bigstepinc [email protected]
  • 3. About Bigstep • High performance, bare metal cloud purpose built for big data • Automatically deployed (managed and unmanaged) big data software stacks • HDFS as a Service Offering • Managed Docker platform (coming soon) • Spark clusters as a service (coming soon) • Purely on-demand: bare metal instances get deployed in 2 minutes, can be deleted anytime • Locally attached drives support • SDN controlled Layer 2 networking (40Gbps per instance, cut through) • Distributed SSD based storage fabric
  • 4. Big Data technologies for mainstream and vice-versa • Due to the cap on CPU frequency, the horizontal is the only dimension left to grow into. • Client-server architecture outdated. • All components of an application must be as independent as possible and as scalable as possible. • Big data technologies increasingly used in general purpose applications • In-memory technologies are orders of magnitude faster than the others. • Docker promotes and simplifies large scale application management using low-overhead containers instead of VMs • Mesos used with Docker and some additional services creates a Distributed OS
  • 5. Source: Tori Randall, Ph.D. prepares a 550-year old Peruvian child mummy for a CT scan Data as artefacts • Just like archeological artefacts, old data can yield new insights if correlated in a novel way or analysed with a new technology. • Throwing away data because it is of no use today might cripple the business tomorrow.
  • 6. The Data Lake - A paradigm, not a technology • Store unstructured data in its original format • Store structured data along with the structure (schema) so it can be distributed onto multiple machines • Ingest massive amounts of data - go to petabyte scale if needed • Stream in or batch import data from any source • Perform new, deeper analytics by focusing on correlations between diverse data sources: clickstream, social media, machine data, documents, audio/video, etc. • Store anonymised data (keep IDs and not names or other personally identifiable information) • Promote data exploration
  • 7. Data Services • A data service provides data to other services Clusters Service Cluster Timetable Datalake Service center load predictor Driver's path optimiser Datalake Datalake
  • 8. Data Services - It’s about the teams and not the technology • Conway law: “[…] organizations which design systems ... are constrained to produce designs which are copies of the communication structures of these organizations” Application View Controller Model in charge of UX specialists Backend specialists in charge of DB specialists in charge of poorcommunication
  • 9. Per data microservice teams • Data teams are independent • A data service has its own release cycle • Ultra-specialisation is reduced • Communication among members of the same team is better App App App App App App App AppApp API API API better communication better communication better communication
  • 10. Monolith vs. Microservices approach Server App Server Monolithic approach App App Server App App Server App App Server App App Server App Server App Server App Server App App Server App App Microservices approach
  • 11. Polyglot persistence • The data does not have to reside in the same place (e.g.: same HDFS cluster) • But it has to be always available for any team, microservice, or data application authorized to use it Single DB (slave) piece 3 piece 4 piece 1 piece 2 Single DB DB 4DB 3 DB 4DB 3 DB 1 DB 2 DB 4DB 3 DB 1DB 1 DB 2DB 2
  • 12. Polyglot persistence • The data does not have to reside in the same place (e.g.: same HDFS cluster) • But it has to be always available for any team, microservice, or data application authorized to use it Single DB (slave) piece 3 piece 4 piece 1 piece 2 Single DB DB 4DB 3 DB 4DB 3 DB 1 DB 2 DB 4DB 3 DB 1DB 1 DB 2DB 2
  • 13. Microservices orientated architecture • Components, not layers. • Each component can scale horizontally and is masterless • Each component can be unit tested independently • Each component can be deployed independently to production • Multiple versions of same component can coexist for a short amount of time • Using APIs to integrate components as opposed to direct method call • Use natively backward compatible API designs and implementations • Use distributed locking (e.g.: Zookeeper) instead of file based locking • Use queuing instead of blocking calls with evolving schemas (e.g.: Kafka with Avro serialiser) • Using distributed databases (e.g.: Couchbase) instead of master-slave oriented ones. Avoid immutable schemas.
  • 14. Docker • A Docker container is neither a VM nor a VPS • Application level virtualisation • Same kernel • No performance overhead • Instant deployment • Usually a single app per container • Uses libcontainer (previously used LXC) engine (network namespaces and cgroups) • Git-like deployment method with branches and repositories. Container Kernel Container Container Container vNIC LAN vNIC WAN LAN WAN vNIC LAN vNIC WAN vNIC LAN vNIC WAN vNIC LAN vNIC WAN
  • 15. Docker Persistency • Docker is designed for services that do not need persistency but it does support it • By default all containers have an unique clone of the filesystem in the image • All changes to this clone are stored in unique directories per container that does not get garbage collected • A new container has a new tree and so restarting a container without an explicit mapping appears as having destroyed the data. • Docker achieves persistency by mapping directories from the host machine to the container.
  • 16. Mesos & Marathon • Allows an app’s environment to be software defined. • Docker (currently) knows only about 1 host • Orchestration layer for Docker containers • Out of the box load-balancing • Monitors and restarts containers if failed • API driven • Useful for creating high performance, distributed, fault tolerant architectures. C C C C C C C C C C C C C C C C C C
  • 17. Docker Networking in Mesos instance-1001.bigstep.io container container container eno1 instance-1002.bigstep.io container container container eno1 instance-100n.bigstep.io container container container eno1 LAN layer 2 ... haproxy haproxy haproxy WAN internet Instancearray01.bigstep.io client DNS loadbalancing 172.167.1.2:80 172.167.1.3:80 172.167.1.200:80 172.167.2.2:80 172.167.3.3:80 172.167.3.200:80 172.167.3.2:80 172.167.3.3:80 172.167.3.200:80 ... ... ... 31.00.62.211:80 31.00.62.212:80 31.00.62.213:80 • Uses network namespaces • Needs Layer 2 or software overlay network • Each container gets a private IP • Bigstep Automatic DNS load-balancing • Automatic HAProxy load-balancing
  • 18. Docker vs Native - Latency AverageResponseTime(ms) -SmallerIsBetter 0 6 11 17 22 INSERT AVG response time (us) SELECT AVG response time (us) UPDATE AVG response time (us) 11 19 21 10 18 19 1 node native 1 node native 1 Docker container Source: Bigstep’s Cassandra Benchmark 2015
  • 19. Docker vs Native - Throughput KReq/s-biggerisbetter 0 43 85 128 170 INSERT throughput (k) SELECT throughput (k) UPDATE throughput (k) 149 92 82 168 9690 1 node native 1 node native 1 Docker container Source: Bigstep’s Cassandra Benchmark 2015
  • 20. Streaming versus batch • Resource usage patterns for streaming resemble those of web-centric systems, and need consolidation for efficiency as well as high availability time resource usage (%) 25% resource usage pattern of a production system time resource usage (%) 100% typical resource usage pattern of a big data analytics system
  • 21. Spark with Mesos • Spark & Spark Streaming are great candidates for building data microservices as they are very fast and easy to use • Spark can use Mesos as a resource manager • Spark needs YARN to access Secure HDFS YARN on Mesos: Myriad
  • 22. Is it hard to build a Data Lake? • Use flexible infrastructure, workloads are very difficult to predict as data volumes and types of analysis change all the time • Polyglot Persistency promotes the idea that data must be always available - but that it can be stored in any technology that fits - e.g. Hadoop, NoSQL. • Polyglot Programming advocates the use of the right tool for the right job. Docker-based deployment makes environment setup more or less irrelevant. • Mesos is more complicated to setup on-premise. Mesosphere offers a commercial product for this. Bigstep also automates a scalable Mesos (with Docker) deployment on bare metal. • Data import services could be tricky to setup. The problem is the organisation structure and security. Anonymisation is required. • A service discovery solution is required: Use mesos-dns
  • 23. Conclusions • Data (micro-)services allows building a data ecosystem within your organisation. A team is a provider of data to other teams. • An agile data environment enables an agile business. New tools must be inserted quickly into the mix. (Eg: found out about Looker today, why not try it on the data). • There are methods to improve consolidation ratios with 40% while preserving performance of data services Data analysis Business modelling Business understanding + = Production Systems machine data prediction model Visualization & Reports