SlideShare a Scribd company logo
Case Study
Elasticsearch Ingest @ Cisco Intercloud
Agenda
• Express Overview of StreamSets Data Collector
Kirit Basu, Product Management, StreamSets
• Introduction to Elastic
CatherineJohnson, Solutions Architect, Elastic
• Implementing Shipped Analytics Using StreamSets and Elasticsearch
Dmitri Chtchourov, Innovation Architect, Cloud Solutions CTO Group
Group
Performance Management
for Data Flows
© 2015 StreamSets, Inc. All rights reserved. May not be copied, modified, or distributed in whole or part without written consent of StreamSets, Inc.
History Founded by Informatica and Cloudera veterans.
Mission Bring operational excellence to managing data in motion.
Challenge Move data efficiently and with quality in the face of change.
Solution Open source software enabling performance management of
data flows.
Use cases Hadoop Ingest, Search Ingest, Message Broker Enablement,
Log Shipping, Cloud Migration, IoT, ...
Momentum Thousands of downloads, hundreds of companies using.
StreamSets At a Glance
© 2015 StreamSets, Inc. All rights reserved. May not be copied, modified, or distributed in whole or part without written consent of StreamSets, Inc.
StreamSets Data Collector
Adaptable Flows for Efficiency
Design ingest pipelines with minimal coding and
maximum flexibility.
Data Flow KPIs for Control
Monitor and act on data flow performance and
data quality.
Containerized Architecture for Agility
Operate continuously in the face of constant
change.
Open source software for the rapid
development and reliably operation of
complex data flows.
Get Started with StreamSets
http://streamsets.com/opensource
https://github.com/streamsets/datacollector/
#streamsets
March 2016
Introduction to Elastic
Software that makes massive amounts of
structured and unstructured data usable for
search, logging, analytics, and more in mission
critical systems and applications
Examples: Elastic Stack Use Cases
Logging
IT Operations
Application Management
Security Analytics
Analytics Search
Marketing Insights
Business Development
Customer Sentiment
Website Search
Internal/Intranet Search
URL Search
Internal Systems/Applications External Systems/Applications
Developers IT/Ops Business Users
Elastic Solves Many Developer Use Cases
Social
Location
User-
Activity
Machine
(Log files)
Documents
Handles Complex
& Diverse Data
Meets Today’s Core
Developer Requirements
Developer requirements
Many users / use cases
Fast data processing
Large data volumes
Data quality & integrity
Cross-source insights
Solves Critical
Use Cases
Application
Search
Embedded
Search
Logging
Security
Analytics
Operational
Analytics
More …
The Elastic Stack
Ingest
Store, Index,
& Analyze
User Interface
Plugins Monitoring Security Alerting
Elastic Cloud: Hosted Elasticsearch
Thank you!
www.elastic.co
Implementing Shipped Analytics Using
Streamsets and Elasticsearch
Dmitri Chtchourov, Innovation Architect, Cloud Solutions CTO Group
Tymofii Polekhin, Software Engineer
Agenda
MANTL & Shipped
Shipped Analytics for Shipped
Why we need Shipped Analytics?
Archtecture and Data Flow
Streamsets Pipelines
End to end dataflow and performance with Elasticsearch
Benefits of Streamsets
Demo
Microservices managed and scaled separately
Microservices managed by Mesos in a single platform
Microservices architecture for Mesos frameworks and other components
CIS/AWS/Metastack/vSphere/UCS…
Terraform
Spark
Executor N
Spark
Executor 1
Spark
Scheduler
Kafka
Broker N
Kafka
Broker 1
Kafka
Scheduler
Docker Docker
TraefikMicroservices …
REST API
REST API
Scripted provisioning
Direct provisioning
Policy, Auto-scaling
VM1
or
BM1
VM2
or
BM2
VM3
or
BM3
VM4
or
BM4
VM5
or
BM5
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Shipped Analytics Cluster
Probe
Probe
Probe
• Both Shipped and Shipped Analytics running on MANTL
• Shipped Analytics – infra and app logs and metrics analysis
mesos-master
mesos-slave
marathon
zookeeper
consul
syslog
frameworks
collectd
cpu
memory
interface
disk
df
load
docker
zookeeper
marathon
mesos-slave
mesos-master
CollectD and Filebeat processes
running on every node in the
cluster.
Infrastructure Layer
Zookeeper Cluster Consul Cluster
Mesos Cluster
Marathon Framework
Kafka Cluster
topbeat filebeat
journalbeat dockerbeat
• Experimenting with Elastic Beats (unified arch., closer to micro-services model)
• Elastic Beats to replace collectd plugins and cAdvisor for containers
<file | top | *>beat collectd
logstash
DNS SRV beats.logstash.service.consul
Data normalization
Tagging
Cluster name decoration
Logstash is a single process per
cluster, discoverable with
standard inter-cluster
discovery mechanism, which
will get metrics from collectd
on every slave and logs from
filebeat on every slave,
normalize data and send to
desired output
DNS SRV collectd.logstash.service.consul
NOTE: currently Logstash is running in Docker container on every node, will be moving to Filebeat and Logstash mesos framework soon
logstash
Kafka 0.9.0.0 supports SSL
authentication and data
encryption for producers.
This is must-have security
when sending data to external
destination through WAN.
Sending data to central SA
cluster for long-term analytics
SSL encryption
WAN
kafka
SSL authentication
Shipped cluster
Shipped Analytics
StreamSets running in Mesos
Spark Cluster mode processing
data from multiple source
Shipped clusters and storing it
in Elasticsearch cluster.
kafka
elasticsearch
Streamsets Spark Streaming Cluster
Spark Job
Master instance
Spark Job Spark Job Spark Job
Lambda Reference Architecture
Monitoring / Analytics Cluster (local, Texas-3)
Global Monitoring / Analytics Cluster (global, Texas-1)
Monitoring / Analytics Cluster (local, Ams. -1 )
Monitoring / Analytics Cluster (local, Lon.-1)
Local components and deployment is the same as global, just smaller
Real-time and batch processing (Lambda), anomaly detection, visualization
SSL
Kafka
SSL
SSL
MQTT
Divide nodes by role for more
stable cluster operation and
ease of scalability
3 master/search nodes
5 live data nodes
3 archive data nodes
master/
search
master/
search
master/
search
live/
data
live/
data
live/
data
live/
data
live/
data
archive
/data
archive
/data
archive
/data
Shards=5 Replicas=4 Shards=5 Replicas=1
archive
/data
archive
/data
CPU=4
RAM=30GB
HDD=4TB
CPU=4
RAM=30GB
HDD=4TB
CPU=4
RAM=30GB
HDD=4TB
Streamsets pipelines process
incoming messages and
transform them according to
business logic requirements,
normalizing metrics and
parsing log lines; popping up
important information using
GROK filters or scripts.
Cluster Name
Decorator
Fields Type
Normalization
Metrics/Logs
Stream Splitter
ES Logs Output
General GROK
Filters
Float Value
Truncate
ES Metrics
Output
Shipped GROK
Logic
Marathon
• Streamsets instances running in docker containers in Marathon
o Easy deployment and scaling
o Fast upgrade to newer version
• Issues we faced with this approach:
o Containers were killed by marathon
o Needed to re-import pipeline every time we launch container
Marathon
• Working with Streamsets trying to resolve the OOM issue we increased
container memory and SDC heap size
• At first, all looked normal and we thought that it was just
starving on resources, but several days later we had SDC killed again
• We increased MEM and HEAP even more – to 16G, but we bought just
another day or two before is was killed again
• Looked like SDC heap were constantly filling with data
that don’t go away and eventually it kills the container
• Also GC was working hard and sometimes we got freezes
up to 60 seconds
• Decided to move out from Docker
Marathon
• Streamsets reading JSON messages from Kafka cluster and output
to Elasticsearch cluster
o De-serializing and serializing JSON was very slow with single
threaded process
o Consuming from Kafka performance test showed:
 JSON format: 5k records/sec avg
 Text format: 50k records/sec avg
 Binary format: 250k records/sec avg
• Streamsets team were very proactive with this issues
and in 2 days we received a fix for multi-threaded JSON parsing
o New testing showed:
 JSON format: 66k records/sec avg
Marathon
• Streamsets has never failed because of any internal logic bugs
but we kept seeing this oom-killer popping up and recovering was
not automated
• We decided to leave docker and run SDC natively on host,
still using Marathon for scaling and failover
• Without docker, we now can upload our pipeline on SDC startup,
and it will start working as soon as instance has loaded
We can freely scale up/down whenever we need
Also, we got rid of oom-killer issue as well
Each one of our 3 SDC instances already processes ~3B messages, with no issues!
• Streamsets pipeline consume metrics gathered by collectd
and logs gathered by logstash from 4 different clusters
(including self), transform and decorate them and send to
Elasticsearch for storage and analytics.
• First of all we consume messages from Kafka topic at
average of 5,000 messages per second. The consumer
itself parses JSON-format and sends further.
• Next stage is a JavaScript script that decorates messages
with cluster name, based on a instance hostname in that
message
• Finally, we exclude Marathon events from stream sending
them directly to ES
• Next stage will splits stream into 2 parts: logs and metrics
• Metrics are send straight to ES without any transformation
• Logs are the most interesting part:
o We pop docker container logs from stream and
delete “time” field that’s duplicate timstamp and
sending them to ES
o We separate logs from specific clusters, because we
need to apply special logic for them
o Separation is done though mapping IP’s to clusters in
the pipeline realtime
• Collecting data from several Mesos clusters and need to
correlate container metrics with it’s logs
• Use appID taskID and runID to identify specific containers
logs
• Container logs itself have all three of this, while mesos-
master and mesos-agent logs lacks runID
• All unidentified data is discarded
Current ShippedAnalytics prod cluster configuration:
Kafka Cluster: 7 brokers with 4CPU and 16GB RAM each
Logstash topic for all incoming messages with 7 partitions and 2 replicas
Current data flow is avg 5000 messages/sec to Kafka
Current data size is avg 1,2MB/sec to Kafka
Streamsets: 3 instances with identical pipeline configuration reading from Kafka cluster
7 partitions are split between 3 instances like 3/2/2
All 3 instances running natively on host (non-docker) with Marathon
Marathon restarts failed instance with automatic pipeline upload and start
Elasticsearch: 7 nodes with 4CPU, 16GB RAM and 2TB storage each
Each metrics is written to its own index, total of 15 indexes
Each index has 5 primary shards and 5 replica shards
Total Doc count: 17,5B Total Doc size: 3.84TB
1 Day rate count: ~500M 1 Day rate size: ~120GB
Streamsets is a great product to work with, also team is super helpful and works fast
• Lots of input and output connectors, huge processing capabilities
• Very intuitive and rich User Interface
• Easy to create pipelines visually, instead of writing code
• Clear data flow paths
• Small resource consumption compared to performance
• Easily can handle up to 10k records/sec to Elasticsearch with 1CPU 2GB RAM
• Simple configuration and deployment process
• Opensource(!)
• Fast logic changes with minimum downtime
• Preview mode(!) – check every stage before throwing all your data it
• Rich data transformation possibilities
• GROK filters – easy to migrate from Logstash
• Smart Errors handling
• Reliable: not once did Streamets crashed by itself – only Docker, Marathon, Mesos issues
Thank You!
Ad

Recommended

Streamsets and spark
Streamsets and spark
Hari Shreedharan
 
Streamsets and spark at SF Hadoop User Group
Streamsets and spark at SF Hadoop User Group
Hari Shreedharan
 
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Guglielmo Iozzia
 
Building Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSets
Pat Patterson
 
Streamsets and spark in Retail
Streamsets and spark in Retail
Hari Shreedharan
 
Telco analytics at scale
Telco analytics at scale
datamantra
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Spark Summit
 
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
HostedbyConfluent
 
Modern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data Capture
Databricks
 
From Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache Apex
DataWorks Summit
 
Kafka Summit SF 2017 - Keynote - Go Against the Flow: Databases and Stream Pr...
Kafka Summit SF 2017 - Keynote - Go Against the Flow: Databases and Stream Pr...
confluent
 
Meetup070416 Presentations
Meetup070416 Presentations
Ana Rebelo
 
Big Telco - Yousun Jeong
Big Telco - Yousun Jeong
Spark Summit
 
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
confluent
 
Lambda architecture: from zero to One
Lambda architecture: from zero to One
Serg Masyutin
 
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
HostedbyConfluent
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
HostedbyConfluent
 
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Spark Summit
 
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
Yahoo Developer Network
 
The delta architecture
The delta architecture
Prakash Chockalingam
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020
Julien Le Dem
 
Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
SingleStore
 
Tangram: Distributed Scheduling Framework for Apache Spark at Facebook
Tangram: Distributed Scheduling Framework for Apache Spark at Facebook
Databricks
 
Kappa Architecture on Apache Kafka and Querona: datamass.io
Kappa Architecture on Apache Kafka and Querona: datamass.io
Piotr Czarnas
 
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...
Spark Summit
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
Databricks
 
What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017
Databricks
 
ASPgems - kappa architecture
ASPgems - kappa architecture
Juantomás García Molina
 
Monitoring Kafka w/ Prometheus
Monitoring Kafka w/ Prometheus
kawamuray
 
ElasticSearch Basic Introduction
ElasticSearch Basic Introduction
Mayur Rathod
 

More Related Content

What's hot (20)

Modern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data Capture
Databricks
 
From Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache Apex
DataWorks Summit
 
Kafka Summit SF 2017 - Keynote - Go Against the Flow: Databases and Stream Pr...
Kafka Summit SF 2017 - Keynote - Go Against the Flow: Databases and Stream Pr...
confluent
 
Meetup070416 Presentations
Meetup070416 Presentations
Ana Rebelo
 
Big Telco - Yousun Jeong
Big Telco - Yousun Jeong
Spark Summit
 
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
confluent
 
Lambda architecture: from zero to One
Lambda architecture: from zero to One
Serg Masyutin
 
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
HostedbyConfluent
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
HostedbyConfluent
 
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Spark Summit
 
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
Yahoo Developer Network
 
The delta architecture
The delta architecture
Prakash Chockalingam
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020
Julien Le Dem
 
Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
SingleStore
 
Tangram: Distributed Scheduling Framework for Apache Spark at Facebook
Tangram: Distributed Scheduling Framework for Apache Spark at Facebook
Databricks
 
Kappa Architecture on Apache Kafka and Querona: datamass.io
Kappa Architecture on Apache Kafka and Querona: datamass.io
Piotr Czarnas
 
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...
Spark Summit
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
Databricks
 
What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017
Databricks
 
ASPgems - kappa architecture
ASPgems - kappa architecture
Juantomás García Molina
 
Modern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data Capture
Databricks
 
From Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache Apex
DataWorks Summit
 
Kafka Summit SF 2017 - Keynote - Go Against the Flow: Databases and Stream Pr...
Kafka Summit SF 2017 - Keynote - Go Against the Flow: Databases and Stream Pr...
confluent
 
Meetup070416 Presentations
Meetup070416 Presentations
Ana Rebelo
 
Big Telco - Yousun Jeong
Big Telco - Yousun Jeong
Spark Summit
 
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
confluent
 
Lambda architecture: from zero to One
Lambda architecture: from zero to One
Serg Masyutin
 
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
HostedbyConfluent
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
HostedbyConfluent
 
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Spark Summit
 
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
Yahoo Developer Network
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020
Julien Le Dem
 
Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
SingleStore
 
Tangram: Distributed Scheduling Framework for Apache Spark at Facebook
Tangram: Distributed Scheduling Framework for Apache Spark at Facebook
Databricks
 
Kappa Architecture on Apache Kafka and Querona: datamass.io
Kappa Architecture on Apache Kafka and Querona: datamass.io
Piotr Czarnas
 
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...
Spark Summit
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
Databricks
 
What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017
Databricks
 

Viewers also liked (20)

Monitoring Kafka w/ Prometheus
Monitoring Kafka w/ Prometheus
kawamuray
 
ElasticSearch Basic Introduction
ElasticSearch Basic Introduction
Mayur Rathod
 
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Rick Bilodeau
 
Bad Data is Polluting Big Data
Bad Data is Polluting Big Data
Streamsets Inc.
 
Logging infrastructure for Microservices using StreamSets Data Collector
Logging infrastructure for Microservices using StreamSets Data Collector
Cask Data
 
ElasticSearch on AWS - Real Estate portal case study (Spitogatos.gr)
ElasticSearch on AWS - Real Estate portal case study (Spitogatos.gr)
Andreas Chatzakis
 
Adaptive Data Cleansing with StreamSets and Cassandra (Pat Patterson, StreamS...
Adaptive Data Cleansing with StreamSets and Cassandra (Pat Patterson, StreamS...
DataStax
 
Real-time personal trainer on the SMACK stack
Real-time personal trainer on the SMACK stack
Anirvan Chakraborty
 
Demystifying salesforce for developers
Demystifying salesforce for developers
Heitor Souza
 
Kafka Lambda architecture with mirroring
Kafka Lambda architecture with mirroring
Anant Rustagi
 
Extreme Salesforce Data Volumes Webinar
Extreme Salesforce Data Volumes Webinar
Salesforce Developers
 
How Apache Kafka is transforming Hadoop, Spark and Storm
How Apache Kafka is transforming Hadoop, Spark and Storm
Edureka!
 
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Martin Zapletal
 
Cassandra Day Atlanta 2016 - Monitoring Cassandra
Cassandra Day Atlanta 2016 - Monitoring Cassandra
aaronmorton
 
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Spark Summit
 
Prometheus lightning talk (Devops Dublin March 2015)
Prometheus lightning talk (Devops Dublin March 2015)
Brian Brazil
 
Handling of Large Data by Salesforce
Handling of Large Data by Salesforce
Thinqloud
 
Big Data Day LA 2015 - Event Driven Architecture for Web Analytics by Peyman ...
Big Data Day LA 2015 - Event Driven Architecture for Web Analytics by Peyman ...
Data Con LA
 
Machine learning at Scale with Apache Spark
Machine learning at Scale with Apache Spark
Martin Zapletal
 
Salesforce REST API
Salesforce REST API
Bohdan Dovhań
 
Monitoring Kafka w/ Prometheus
Monitoring Kafka w/ Prometheus
kawamuray
 
ElasticSearch Basic Introduction
ElasticSearch Basic Introduction
Mayur Rathod
 
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Rick Bilodeau
 
Bad Data is Polluting Big Data
Bad Data is Polluting Big Data
Streamsets Inc.
 
Logging infrastructure for Microservices using StreamSets Data Collector
Logging infrastructure for Microservices using StreamSets Data Collector
Cask Data
 
ElasticSearch on AWS - Real Estate portal case study (Spitogatos.gr)
ElasticSearch on AWS - Real Estate portal case study (Spitogatos.gr)
Andreas Chatzakis
 
Adaptive Data Cleansing with StreamSets and Cassandra (Pat Patterson, StreamS...
Adaptive Data Cleansing with StreamSets and Cassandra (Pat Patterson, StreamS...
DataStax
 
Real-time personal trainer on the SMACK stack
Real-time personal trainer on the SMACK stack
Anirvan Chakraborty
 
Demystifying salesforce for developers
Demystifying salesforce for developers
Heitor Souza
 
Kafka Lambda architecture with mirroring
Kafka Lambda architecture with mirroring
Anant Rustagi
 
Extreme Salesforce Data Volumes Webinar
Extreme Salesforce Data Volumes Webinar
Salesforce Developers
 
How Apache Kafka is transforming Hadoop, Spark and Storm
How Apache Kafka is transforming Hadoop, Spark and Storm
Edureka!
 
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Martin Zapletal
 
Cassandra Day Atlanta 2016 - Monitoring Cassandra
Cassandra Day Atlanta 2016 - Monitoring Cassandra
aaronmorton
 
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Spark Summit
 
Prometheus lightning talk (Devops Dublin March 2015)
Prometheus lightning talk (Devops Dublin March 2015)
Brian Brazil
 
Handling of Large Data by Salesforce
Handling of Large Data by Salesforce
Thinqloud
 
Big Data Day LA 2015 - Event Driven Architecture for Web Analytics by Peyman ...
Big Data Day LA 2015 - Event Driven Architecture for Web Analytics by Peyman ...
Data Con LA
 
Machine learning at Scale with Apache Spark
Machine learning at Scale with Apache Spark
Martin Zapletal
 
Ad

Similar to Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud (20)

Centralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stack
Rich Lee
 
IBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data Lake
Torsten Steinbach
 
Kafka streams decoupling with stores
Kafka streams decoupling with stores
Yoni Farin
 
Serverless SQL
Serverless SQL
Torsten Steinbach
 
Centralized Logging System Using ELK Stack
Centralized Logging System Using ELK Stack
Rohit Sharma
 
IBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep Dive
Torsten Steinbach
 
ELK Ruminating on Logs (Zendcon 2016)
ELK Ruminating on Logs (Zendcon 2016)
Mathew Beane
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Luan Moreno Medeiros Maciel
 
Databricks Platform.pptx
Databricks Platform.pptx
Alex Ivy
 
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
SolarWinds Loggly
 
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
Cisco DevNet
 
Instrumenting and Scaling Databases with Envoy
Instrumenting and Scaling Databases with Envoy
Daniel Hochman
 
Typesafe spark- Zalando meetup
Typesafe spark- Zalando meetup
Stavros Kontopoulos
 
Enabling Microservices Frameworks to Solve Business Problems
Enabling Microservices Frameworks to Solve Business Problems
Ken Owens
 
Logging, Metrics, and APM: The Operations Trifecta
Logging, Metrics, and APM: The Operations Trifecta
Elasticsearch
 
Fluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes Meetup
Sadayuki Furuhashi
 
Kafka & Hadoop in Rakuten
Kafka & Hadoop in Rakuten
Rakuten Group, Inc.
 
Cloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark Analytics
amesar0
 
Big Data_Architecture.pptx
Big Data_Architecture.pptx
betalab
 
Choose Right Stream Storage: Amazon Kinesis Data Streams vs MSK
Choose Right Stream Storage: Amazon Kinesis Data Streams vs MSK
Sungmin Kim
 
Centralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stack
Rich Lee
 
IBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data Lake
Torsten Steinbach
 
Kafka streams decoupling with stores
Kafka streams decoupling with stores
Yoni Farin
 
Centralized Logging System Using ELK Stack
Centralized Logging System Using ELK Stack
Rohit Sharma
 
IBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep Dive
Torsten Steinbach
 
ELK Ruminating on Logs (Zendcon 2016)
ELK Ruminating on Logs (Zendcon 2016)
Mathew Beane
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Luan Moreno Medeiros Maciel
 
Databricks Platform.pptx
Databricks Platform.pptx
Alex Ivy
 
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
SolarWinds Loggly
 
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
Cisco DevNet
 
Instrumenting and Scaling Databases with Envoy
Instrumenting and Scaling Databases with Envoy
Daniel Hochman
 
Enabling Microservices Frameworks to Solve Business Problems
Enabling Microservices Frameworks to Solve Business Problems
Ken Owens
 
Logging, Metrics, and APM: The Operations Trifecta
Logging, Metrics, and APM: The Operations Trifecta
Elasticsearch
 
Fluentd at Bay Area Kubernetes Meetup
Fluentd at Bay Area Kubernetes Meetup
Sadayuki Furuhashi
 
Cloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark Analytics
amesar0
 
Big Data_Architecture.pptx
Big Data_Architecture.pptx
betalab
 
Choose Right Stream Storage: Amazon Kinesis Data Streams vs MSK
Choose Right Stream Storage: Amazon Kinesis Data Streams vs MSK
Sungmin Kim
 
Ad

Recently uploaded (20)

Indigo dyeing Presentation (2).pptx as dye
Indigo dyeing Presentation (2).pptx as dye
shreeroop1335
 
最新版美国加利福尼亚大学旧金山法学院毕业证(UCLawSF毕业证书)定制
最新版美国加利福尼亚大学旧金山法学院毕业证(UCLawSF毕业证书)定制
taqyea
 
Microsoft Power BI - Advanced Certificate for Business Intelligence using Pow...
Microsoft Power BI - Advanced Certificate for Business Intelligence using Pow...
Prasenjit Debnath
 
MRI Pulse Sequence in radiology physics.pptx
MRI Pulse Sequence in radiology physics.pptx
BelaynehBishaw
 
UPS and Big Data intro to Business Analytics.pptx
UPS and Big Data intro to Business Analytics.pptx
sanjum5582
 
Data Visualisation in data science for students
Data Visualisation in data science for students
confidenceascend
 
Communication_Skills_Class10_Visual.pptx
Communication_Skills_Class10_Visual.pptx
namanrastogi70555
 
The Influence off Flexible Work Policies
The Influence off Flexible Work Policies
sales480687
 
ppt somu_Jarvis_AI_Assistant_presen.pptx
ppt somu_Jarvis_AI_Assistant_presen.pptx
MohammedumarFarhan
 
最新版美国芝加哥大学毕业证(UChicago毕业证书)原版定制
最新版美国芝加哥大学毕业证(UChicago毕业证书)原版定制
taqyea
 
最新版美国威斯康星大学河城分校毕业证(UWRF毕业证书)原版定制
最新版美国威斯康星大学河城分校毕业证(UWRF毕业证书)原版定制
taqyea
 
Artigo - Playing to Win.planejamento docx
Artigo - Playing to Win.planejamento docx
KellyXavier15
 
Camuflaje Tipos Características Militar 2025.ppt
Camuflaje Tipos Características Militar 2025.ppt
e58650738
 
Model Evaluation & Visualisation part of a series of intro modules for data ...
Model Evaluation & Visualisation part of a series of intro modules for data ...
brandonlee626749
 
Residential Zone 4 for industrial village
Residential Zone 4 for industrial village
MdYasinArafat13
 
美国毕业证范本中华盛顿大学学位证书CWU学生卡购买
美国毕业证范本中华盛顿大学学位证书CWU学生卡购买
Taqyea
 
Starbucks in the Indian market through its joint venture.
Starbucks in the Indian market through its joint venture.
sales480687
 
PPT1_CB_VII_CS_Ch3_FunctionsandChartsinCalc.ppsx
PPT1_CB_VII_CS_Ch3_FunctionsandChartsinCalc.ppsx
animaroy81
 
Allotted-MBBS-Student-list-batch-2021.pdf
Allotted-MBBS-Student-list-batch-2021.pdf
subhansaifi0603
 
NVIDIA Triton Inference Server, a game-changing platform for deploying AI mod...
NVIDIA Triton Inference Server, a game-changing platform for deploying AI mod...
Tamanna36
 
Indigo dyeing Presentation (2).pptx as dye
Indigo dyeing Presentation (2).pptx as dye
shreeroop1335
 
最新版美国加利福尼亚大学旧金山法学院毕业证(UCLawSF毕业证书)定制
最新版美国加利福尼亚大学旧金山法学院毕业证(UCLawSF毕业证书)定制
taqyea
 
Microsoft Power BI - Advanced Certificate for Business Intelligence using Pow...
Microsoft Power BI - Advanced Certificate for Business Intelligence using Pow...
Prasenjit Debnath
 
MRI Pulse Sequence in radiology physics.pptx
MRI Pulse Sequence in radiology physics.pptx
BelaynehBishaw
 
UPS and Big Data intro to Business Analytics.pptx
UPS and Big Data intro to Business Analytics.pptx
sanjum5582
 
Data Visualisation in data science for students
Data Visualisation in data science for students
confidenceascend
 
Communication_Skills_Class10_Visual.pptx
Communication_Skills_Class10_Visual.pptx
namanrastogi70555
 
The Influence off Flexible Work Policies
The Influence off Flexible Work Policies
sales480687
 
ppt somu_Jarvis_AI_Assistant_presen.pptx
ppt somu_Jarvis_AI_Assistant_presen.pptx
MohammedumarFarhan
 
最新版美国芝加哥大学毕业证(UChicago毕业证书)原版定制
最新版美国芝加哥大学毕业证(UChicago毕业证书)原版定制
taqyea
 
最新版美国威斯康星大学河城分校毕业证(UWRF毕业证书)原版定制
最新版美国威斯康星大学河城分校毕业证(UWRF毕业证书)原版定制
taqyea
 
Artigo - Playing to Win.planejamento docx
Artigo - Playing to Win.planejamento docx
KellyXavier15
 
Camuflaje Tipos Características Militar 2025.ppt
Camuflaje Tipos Características Militar 2025.ppt
e58650738
 
Model Evaluation & Visualisation part of a series of intro modules for data ...
Model Evaluation & Visualisation part of a series of intro modules for data ...
brandonlee626749
 
Residential Zone 4 for industrial village
Residential Zone 4 for industrial village
MdYasinArafat13
 
美国毕业证范本中华盛顿大学学位证书CWU学生卡购买
美国毕业证范本中华盛顿大学学位证书CWU学生卡购买
Taqyea
 
Starbucks in the Indian market through its joint venture.
Starbucks in the Indian market through its joint venture.
sales480687
 
PPT1_CB_VII_CS_Ch3_FunctionsandChartsinCalc.ppsx
PPT1_CB_VII_CS_Ch3_FunctionsandChartsinCalc.ppsx
animaroy81
 
Allotted-MBBS-Student-list-batch-2021.pdf
Allotted-MBBS-Student-list-batch-2021.pdf
subhansaifi0603
 
NVIDIA Triton Inference Server, a game-changing platform for deploying AI mod...
NVIDIA Triton Inference Server, a game-changing platform for deploying AI mod...
Tamanna36
 

Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud

  • 1. Case Study Elasticsearch Ingest @ Cisco Intercloud
  • 2. Agenda • Express Overview of StreamSets Data Collector Kirit Basu, Product Management, StreamSets • Introduction to Elastic CatherineJohnson, Solutions Architect, Elastic • Implementing Shipped Analytics Using StreamSets and Elasticsearch Dmitri Chtchourov, Innovation Architect, Cloud Solutions CTO Group Group
  • 4. © 2015 StreamSets, Inc. All rights reserved. May not be copied, modified, or distributed in whole or part without written consent of StreamSets, Inc. History Founded by Informatica and Cloudera veterans. Mission Bring operational excellence to managing data in motion. Challenge Move data efficiently and with quality in the face of change. Solution Open source software enabling performance management of data flows. Use cases Hadoop Ingest, Search Ingest, Message Broker Enablement, Log Shipping, Cloud Migration, IoT, ... Momentum Thousands of downloads, hundreds of companies using. StreamSets At a Glance
  • 5. © 2015 StreamSets, Inc. All rights reserved. May not be copied, modified, or distributed in whole or part without written consent of StreamSets, Inc. StreamSets Data Collector Adaptable Flows for Efficiency Design ingest pipelines with minimal coding and maximum flexibility. Data Flow KPIs for Control Monitor and act on data flow performance and data quality. Containerized Architecture for Agility Operate continuously in the face of constant change. Open source software for the rapid development and reliably operation of complex data flows.
  • 6. Get Started with StreamSets http://streamsets.com/opensource https://github.com/streamsets/datacollector/ #streamsets
  • 8. Software that makes massive amounts of structured and unstructured data usable for search, logging, analytics, and more in mission critical systems and applications
  • 9. Examples: Elastic Stack Use Cases Logging IT Operations Application Management Security Analytics Analytics Search Marketing Insights Business Development Customer Sentiment Website Search Internal/Intranet Search URL Search Internal Systems/Applications External Systems/Applications Developers IT/Ops Business Users
  • 10. Elastic Solves Many Developer Use Cases Social Location User- Activity Machine (Log files) Documents Handles Complex & Diverse Data Meets Today’s Core Developer Requirements Developer requirements Many users / use cases Fast data processing Large data volumes Data quality & integrity Cross-source insights Solves Critical Use Cases Application Search Embedded Search Logging Security Analytics Operational Analytics More …
  • 11. The Elastic Stack Ingest Store, Index, & Analyze User Interface Plugins Monitoring Security Alerting Elastic Cloud: Hosted Elasticsearch
  • 13. Implementing Shipped Analytics Using Streamsets and Elasticsearch Dmitri Chtchourov, Innovation Architect, Cloud Solutions CTO Group Tymofii Polekhin, Software Engineer
  • 14. Agenda MANTL & Shipped Shipped Analytics for Shipped Why we need Shipped Analytics? Archtecture and Data Flow Streamsets Pipelines End to end dataflow and performance with Elasticsearch Benefits of Streamsets Demo
  • 15. Microservices managed and scaled separately Microservices managed by Mesos in a single platform Microservices architecture for Mesos frameworks and other components CIS/AWS/Metastack/vSphere/UCS… Terraform Spark Executor N Spark Executor 1 Spark Scheduler Kafka Broker N Kafka Broker 1 Kafka Scheduler Docker Docker TraefikMicroservices … REST API REST API Scripted provisioning Direct provisioning Policy, Auto-scaling VM1 or BM1 VM2 or BM2 VM3 or BM3 VM4 or BM4 VM5 or BM5
  • 17. Shipped Analytics Cluster Probe Probe Probe • Both Shipped and Shipped Analytics running on MANTL • Shipped Analytics – infra and app logs and metrics analysis
  • 19. Infrastructure Layer Zookeeper Cluster Consul Cluster Mesos Cluster Marathon Framework Kafka Cluster topbeat filebeat journalbeat dockerbeat • Experimenting with Elastic Beats (unified arch., closer to micro-services model) • Elastic Beats to replace collectd plugins and cAdvisor for containers
  • 20. <file | top | *>beat collectd logstash DNS SRV beats.logstash.service.consul Data normalization Tagging Cluster name decoration Logstash is a single process per cluster, discoverable with standard inter-cluster discovery mechanism, which will get metrics from collectd on every slave and logs from filebeat on every slave, normalize data and send to desired output DNS SRV collectd.logstash.service.consul NOTE: currently Logstash is running in Docker container on every node, will be moving to Filebeat and Logstash mesos framework soon
  • 21. logstash Kafka 0.9.0.0 supports SSL authentication and data encryption for producers. This is must-have security when sending data to external destination through WAN. Sending data to central SA cluster for long-term analytics SSL encryption WAN kafka SSL authentication Shipped cluster Shipped Analytics
  • 22. StreamSets running in Mesos Spark Cluster mode processing data from multiple source Shipped clusters and storing it in Elasticsearch cluster. kafka elasticsearch Streamsets Spark Streaming Cluster Spark Job Master instance Spark Job Spark Job Spark Job
  • 23. Lambda Reference Architecture Monitoring / Analytics Cluster (local, Texas-3) Global Monitoring / Analytics Cluster (global, Texas-1) Monitoring / Analytics Cluster (local, Ams. -1 ) Monitoring / Analytics Cluster (local, Lon.-1) Local components and deployment is the same as global, just smaller Real-time and batch processing (Lambda), anomaly detection, visualization SSL Kafka SSL SSL MQTT
  • 24. Divide nodes by role for more stable cluster operation and ease of scalability 3 master/search nodes 5 live data nodes 3 archive data nodes master/ search master/ search master/ search live/ data live/ data live/ data live/ data live/ data archive /data archive /data archive /data Shards=5 Replicas=4 Shards=5 Replicas=1 archive /data archive /data CPU=4 RAM=30GB HDD=4TB CPU=4 RAM=30GB HDD=4TB CPU=4 RAM=30GB HDD=4TB
  • 25. Streamsets pipelines process incoming messages and transform them according to business logic requirements, normalizing metrics and parsing log lines; popping up important information using GROK filters or scripts. Cluster Name Decorator Fields Type Normalization Metrics/Logs Stream Splitter ES Logs Output General GROK Filters Float Value Truncate ES Metrics Output Shipped GROK Logic
  • 26. Marathon • Streamsets instances running in docker containers in Marathon o Easy deployment and scaling o Fast upgrade to newer version • Issues we faced with this approach: o Containers were killed by marathon o Needed to re-import pipeline every time we launch container
  • 27. Marathon • Working with Streamsets trying to resolve the OOM issue we increased container memory and SDC heap size • At first, all looked normal and we thought that it was just starving on resources, but several days later we had SDC killed again • We increased MEM and HEAP even more – to 16G, but we bought just another day or two before is was killed again • Looked like SDC heap were constantly filling with data that don’t go away and eventually it kills the container • Also GC was working hard and sometimes we got freezes up to 60 seconds • Decided to move out from Docker
  • 28. Marathon • Streamsets reading JSON messages from Kafka cluster and output to Elasticsearch cluster o De-serializing and serializing JSON was very slow with single threaded process o Consuming from Kafka performance test showed:  JSON format: 5k records/sec avg  Text format: 50k records/sec avg  Binary format: 250k records/sec avg • Streamsets team were very proactive with this issues and in 2 days we received a fix for multi-threaded JSON parsing o New testing showed:  JSON format: 66k records/sec avg
  • 29. Marathon • Streamsets has never failed because of any internal logic bugs but we kept seeing this oom-killer popping up and recovering was not automated • We decided to leave docker and run SDC natively on host, still using Marathon for scaling and failover • Without docker, we now can upload our pipeline on SDC startup, and it will start working as soon as instance has loaded We can freely scale up/down whenever we need Also, we got rid of oom-killer issue as well
  • 30. Each one of our 3 SDC instances already processes ~3B messages, with no issues!
  • 31. • Streamsets pipeline consume metrics gathered by collectd and logs gathered by logstash from 4 different clusters (including self), transform and decorate them and send to Elasticsearch for storage and analytics. • First of all we consume messages from Kafka topic at average of 5,000 messages per second. The consumer itself parses JSON-format and sends further. • Next stage is a JavaScript script that decorates messages with cluster name, based on a instance hostname in that message • Finally, we exclude Marathon events from stream sending them directly to ES
  • 32. • Next stage will splits stream into 2 parts: logs and metrics • Metrics are send straight to ES without any transformation • Logs are the most interesting part: o We pop docker container logs from stream and delete “time” field that’s duplicate timstamp and sending them to ES o We separate logs from specific clusters, because we need to apply special logic for them o Separation is done though mapping IP’s to clusters in the pipeline realtime
  • 33. • Collecting data from several Mesos clusters and need to correlate container metrics with it’s logs • Use appID taskID and runID to identify specific containers logs • Container logs itself have all three of this, while mesos- master and mesos-agent logs lacks runID • All unidentified data is discarded
  • 34. Current ShippedAnalytics prod cluster configuration: Kafka Cluster: 7 brokers with 4CPU and 16GB RAM each Logstash topic for all incoming messages with 7 partitions and 2 replicas Current data flow is avg 5000 messages/sec to Kafka Current data size is avg 1,2MB/sec to Kafka Streamsets: 3 instances with identical pipeline configuration reading from Kafka cluster 7 partitions are split between 3 instances like 3/2/2 All 3 instances running natively on host (non-docker) with Marathon Marathon restarts failed instance with automatic pipeline upload and start Elasticsearch: 7 nodes with 4CPU, 16GB RAM and 2TB storage each Each metrics is written to its own index, total of 15 indexes Each index has 5 primary shards and 5 replica shards Total Doc count: 17,5B Total Doc size: 3.84TB 1 Day rate count: ~500M 1 Day rate size: ~120GB
  • 35. Streamsets is a great product to work with, also team is super helpful and works fast • Lots of input and output connectors, huge processing capabilities • Very intuitive and rich User Interface • Easy to create pipelines visually, instead of writing code • Clear data flow paths • Small resource consumption compared to performance • Easily can handle up to 10k records/sec to Elasticsearch with 1CPU 2GB RAM • Simple configuration and deployment process • Opensource(!) • Fast logic changes with minimum downtime • Preview mode(!) – check every stage before throwing all your data it • Rich data transformation possibilities • GROK filters – easy to migrate from Logstash • Smart Errors handling • Reliable: not once did Streamets crashed by itself – only Docker, Marathon, Mesos issues