SlideShare a Scribd company logo
Christos Erotocritou, GridGain Systems
Fast Data with Apache
Ignite & Apache Spark
#EUstr10
© 2017 GridGain Systems, Inc.
#EUstr10SPARK SUMMIT
…is a distributed, memory-centric data platform
with powerful & flexible processing APIs
© 2017 GridGain Systems, Inc.
#EUstr10SPARK SUMMIT
Apache Ignite Memory-Centric Data Platform
Ignite Memory-Centric Storage
Ignite Native Persistence
(Flash, SSD, Intel 3D XPoint)
Third-Party Persistence
(RDBMS, HDFS, NoSQL)
SQL Transactions Compute IgniteRDD MLStreamingKey/Value
IoTFinancial
Services
Pharma &
Healthcare
E-CommerceTravel &
Logistics
Telco
Applications
© 2017 GridGain Systems, Inc.
#EUstr10SPARK SUMMIT
Memory-Centric Storage
© 2017 GridGain Systems, Inc.
#EUstr10SPARK SUMMIT
Pure Ignite Deployment
Front-End APIs
SQL TXCompute
Ignite
RDD
Key /
Value
Payments SecuritiesRisk Trading Clients
Ignite Cluster
DURABLE MEMORY DURABLE MEMORY DURABLE MEMORY
Data Caches / Tables
Applications in
Java, .NET & C++
Wide Range of
Data Access and
Processing APIs
Shared Storage
across Apps &
Support for Multi-
Tenancy
Disk & Memory
Data Storage
© 2017 GridGain Systems, Inc.
#EUstr10SPARK SUMMIT
Durable Memory
Ignite Server Cluster
Off-heap Removes
noticeable GC pauses
Automatic
Defragmentation
Stores Superset
of Data
Predictable memory
consumption
Fully Transactional
(Write-Ahead Log)
DURABLE MEMORY DURABLE MEMORY DURABLE MEMORY
Server Node Server Node Server Node
Memory-Centric Storage
Instantaneous
Restarts
© 2017 GridGain Systems, Inc.
#EUstr10SPARK SUMMIT
Apache Ignite Features
JCache Compute Transactions
Scan & Text
QueriesSQL JDBC &
ODBC
StreamingServices
Java .NET C++ PHP BI ToolsMemcached REST
DURABLE MEMORY DURABLE MEMORY DURABLE MEMORY DURABLE MEMORY
Distributed Memory-Centric Storage
Dynamic
Scaling
Server Nodes
© 2017 GridGain Systems, Inc.
#EUstr10SPARK SUMMIT
1.Initial Request
2.Fetch data from remote nodes
3.Process the entire data-set
1.Initial request
2.Co-locate processing with data
3.Reduce multiple results into one
Client-Server Processing Co-located Processing
2
1
Data & Processing
Node
Data & Processing
Node
Client Node
33
Data 1
Data NodeData 2
Data Node
Processing Node
1
2
© 2017 GridGain Systems, Inc.
#EUstr10SPARK SUMMIT
Hadoop, Spark & Ignite Deployment
SQL &
Compute APIDB
File Exports
Ignite
Clients
Kafka Data
Streamer
Ignite Data
Streamer
Spark App
Hadoop
Data Node
Spark App
Hadoop
Data Node
Spark App
Hadoop
Data Node
Spark Clients
Server Nodes
IgniteRDD IgniteRDD IgniteRDD
© 2017 GridGain Systems, Inc.
#EUstr10SPARK SUMMIT
Apache Ignite Spark Integration
Spark Application
Spark Worker
Spark
Job
Spark
Job
Yarn Mesos Docker HDFS
Spark Worker
Spark
Job
Spark
Job
Spark Worker
Spark
Job
Spark
Job
In-Memory Shared RDD or DataFrame
Share RDD
across jobs on
the host
In-Memory
Indexes
SQL on top of
RDDs
Share RDD
Globally
Ignite Node Ignite Node Ignite Node
© 2017 GridGain Systems, Inc.
#EUstr10SPARK SUMMIT
• IgniteContext is the main entry point to Spark-Ignite integration:
val igniteContext = new IgniteContext[Integer, Integer]
(sparkContext, () => new IgniteConfiguration())
val cache = igniteContext.fromCache("myRdd")
val result = cache.filter(_._2.contains("Ignite")).collect()
val cacheRdd = igniteContext.fromCache("myRdd")
cacheRdd.savePairs(sparkContext.parallelize(1 to 10000, 10).map(i => (i, i)))
• Saving values to Ignite:
• Running SQL queries against Ignite Cache:
val cacheRdd = igniteContext.fromCache("myRdd")
val result = cacheRdd.sql
("select _val from Integer where val > ? and val < ?", 10, 100)
• Reading values from Ignite:
Working with IgniteRDD
© 2017 GridGain Systems, Inc.
#EUstr10SPARK SUMMIT
val companyCacheIgnite = new IgniteContext[Int, String](sc, () =>
new IgniteConfiguration()).fromCache("CompanyCache")
val dfCompany = sqlContext.createDataFrame(companyCacheIgnite.map(p=>
Company(p._1, p._2)))
dfCompany.registerTempTable("company")
Working with DataFrame API
• Create an IgniteRDD
• Create a “Company” DataFrame
• Register DataFrame as a table
© 2017 GridGain Systems, Inc.
#EUstr10SPARK SUMMIT
– Ingests data from HDFS or
another distributed file system
– Inclined towards analytics (OLAP)
and focused on MR-specific
payloads
– Requires the creation of RDD and
data and processing operations
are governed by it
– Basic disk-based SQL support
– Strong ML libraries
– Big community
– Data source agnostic
– Fully fledged compute engine and
durable storage
– OLAP & OLTP
– Zero-deployment
– In-Memory SQL support
– Fully ACID transactions across
memory and disk
– Less focused on Hadoop
– Early ML Support
– Growing Community
© 2017 GridGain Systems, Inc.
#EUstr10SPARK SUMMIT
• What is GridGain?
• Binary build of Apache Ignite™
• Added enterprise features for enterprise deployments
• Earlier features and bug fixes by a few weeks
• Fully certified & tested releases
“We develop and support the worlds leading In-Memory Computing Platform”
© 2017 GridGain Systems, Inc.
#EUstr10SPARK SUMMIT
Thank you for joining us. Follow the conversation.
http://ignite.apache.org
Any Questions?
Ad

Recommended

Meet up roadmap cloudera 2020 - janeiro
Meet up roadmap cloudera 2020 - janeiro
Thiago Santiago
 
Apache Hadoop 3
Apache Hadoop 3
Cloudera, Inc.
 
the Paxos Commit algorithm
the Paxos Commit algorithm
paolos84
 
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
iwrigley
 
Evaluation of TPC-H on Spark and Spark SQL in ALOJA
Evaluation of TPC-H on Spark and Spark SQL in ALOJA
DataWorks Summit
 
Introduction to Apache Sqoop
Introduction to Apache Sqoop
Avkash Chauhan
 
Data Science and CDSW
Data Science and CDSW
Jason Hubbard
 
Apache Ambari: Past, Present, Future
Apache Ambari: Past, Present, Future
Hortonworks
 
Running Apache NiFi with Apache Spark : Integration Options
Running Apache NiFi with Apache Spark : Integration Options
Timothy Spann
 
The paxos commit algorithm
The paxos commit algorithm
ahmed hamza
 
Introduction to apache spark
Introduction to apache spark
Aakashdata
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Spark Summit
 
How to Speak Intel DPDK KNI for Web Services.
How to Speak Intel DPDK KNI for Web Services.
Naoto MATSUMOTO
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks
 
Spark introduction and architecture
Spark introduction and architecture
Sohil Jain
 
Streams, Tables, and Time in KSQL
Streams, Tables, and Time in KSQL
confluent
 
Using LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache Arrow
DataWorks Summit
 
Kafka Intro With Simple Java Producer Consumers
Kafka Intro With Simple Java Producer Consumers
Jean-Paul Azar
 
Cloudera Hadoop Distribution
Cloudera Hadoop Distribution
Thisara Pramuditha
 
Apache sqoop with an use case
Apache sqoop with an use case
Davin Abraham
 
Linux Kernel vs DPDK: HTTP Performance Showdown
Linux Kernel vs DPDK: HTTP Performance Showdown
ScyllaDB
 
Understanding Data Consistency in Apache Cassandra
Understanding Data Consistency in Apache Cassandra
DataStax
 
Fast analytics kudu to druid
Fast analytics kudu to druid
Worapol Alex Pongpech, PhD
 
zenoh: The Edge Data Fabric
zenoh: The Edge Data Fabric
Angelo Corsaro
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Databricks
 
Cassandra background-and-architecture
Cassandra background-and-architecture
Markus Klems
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Introduction to RTI DDS
Introduction to RTI DDS
John Breitenbach
 
Storage Engine Considerations for Your Apache Spark Applications with Mladen ...
Storage Engine Considerations for Your Apache Spark Applications with Mladen ...
Spark Summit
 
Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...
Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...
Spark Summit
 

More Related Content

What's hot (20)

Running Apache NiFi with Apache Spark : Integration Options
Running Apache NiFi with Apache Spark : Integration Options
Timothy Spann
 
The paxos commit algorithm
The paxos commit algorithm
ahmed hamza
 
Introduction to apache spark
Introduction to apache spark
Aakashdata
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Spark Summit
 
How to Speak Intel DPDK KNI for Web Services.
How to Speak Intel DPDK KNI for Web Services.
Naoto MATSUMOTO
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks
 
Spark introduction and architecture
Spark introduction and architecture
Sohil Jain
 
Streams, Tables, and Time in KSQL
Streams, Tables, and Time in KSQL
confluent
 
Using LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache Arrow
DataWorks Summit
 
Kafka Intro With Simple Java Producer Consumers
Kafka Intro With Simple Java Producer Consumers
Jean-Paul Azar
 
Cloudera Hadoop Distribution
Cloudera Hadoop Distribution
Thisara Pramuditha
 
Apache sqoop with an use case
Apache sqoop with an use case
Davin Abraham
 
Linux Kernel vs DPDK: HTTP Performance Showdown
Linux Kernel vs DPDK: HTTP Performance Showdown
ScyllaDB
 
Understanding Data Consistency in Apache Cassandra
Understanding Data Consistency in Apache Cassandra
DataStax
 
Fast analytics kudu to druid
Fast analytics kudu to druid
Worapol Alex Pongpech, PhD
 
zenoh: The Edge Data Fabric
zenoh: The Edge Data Fabric
Angelo Corsaro
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Databricks
 
Cassandra background-and-architecture
Cassandra background-and-architecture
Markus Klems
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Introduction to RTI DDS
Introduction to RTI DDS
John Breitenbach
 
Running Apache NiFi with Apache Spark : Integration Options
Running Apache NiFi with Apache Spark : Integration Options
Timothy Spann
 
The paxos commit algorithm
The paxos commit algorithm
ahmed hamza
 
Introduction to apache spark
Introduction to apache spark
Aakashdata
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Spark Summit
 
How to Speak Intel DPDK KNI for Web Services.
How to Speak Intel DPDK KNI for Web Services.
Naoto MATSUMOTO
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks
 
Spark introduction and architecture
Spark introduction and architecture
Sohil Jain
 
Streams, Tables, and Time in KSQL
Streams, Tables, and Time in KSQL
confluent
 
Using LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache Arrow
DataWorks Summit
 
Kafka Intro With Simple Java Producer Consumers
Kafka Intro With Simple Java Producer Consumers
Jean-Paul Azar
 
Apache sqoop with an use case
Apache sqoop with an use case
Davin Abraham
 
Linux Kernel vs DPDK: HTTP Performance Showdown
Linux Kernel vs DPDK: HTTP Performance Showdown
ScyllaDB
 
Understanding Data Consistency in Apache Cassandra
Understanding Data Consistency in Apache Cassandra
DataStax
 
zenoh: The Edge Data Fabric
zenoh: The Edge Data Fabric
Angelo Corsaro
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Databricks
 
Cassandra background-and-architecture
Cassandra background-and-architecture
Markus Klems
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 

Viewers also liked (13)

Storage Engine Considerations for Your Apache Spark Applications with Mladen ...
Storage Engine Considerations for Your Apache Spark Applications with Mladen ...
Spark Summit
 
Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...
Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...
Spark Summit
 
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
Spark Summit
 
Spark Pipelines in the Cloud with Alluxio with Gene Pang
Spark Pipelines in the Cloud with Alluxio with Gene Pang
Spark Summit
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Databricks
 
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Spark Summit
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Databricks
 
Optimal Strategies for Large Scale Batch ETL Jobs with Emma Tang
Optimal Strategies for Large Scale Batch ETL Jobs with Emma Tang
Databricks
 
Best Practices for Using Alluxio with Apache Spark with Gene Pang
Best Practices for Using Alluxio with Apache Spark with Gene Pang
Spark Summit
 
An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai Yu
An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai Yu
Databricks
 
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Databricks
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Summit
 
Storage Engine Considerations for Your Apache Spark Applications with Mladen ...
Storage Engine Considerations for Your Apache Spark Applications with Mladen ...
Spark Summit
 
Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...
Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...
Spark Summit
 
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
Spark Summit
 
Spark Pipelines in the Cloud with Alluxio with Gene Pang
Spark Pipelines in the Cloud with Alluxio with Gene Pang
Spark Summit
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Databricks
 
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Spark Summit
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Databricks
 
Optimal Strategies for Large Scale Batch ETL Jobs with Emma Tang
Optimal Strategies for Large Scale Batch ETL Jobs with Emma Tang
Databricks
 
Best Practices for Using Alluxio with Apache Spark with Gene Pang
Best Practices for Using Alluxio with Apache Spark with Gene Pang
Spark Summit
 
An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai Yu
An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai Yu
Databricks
 
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Databricks
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Summit
 
Ad

Similar to Fast Data with Apache Ignite and Apache Spark with Christos Erotocritou (20)

Improving Apache Spark™ In-Memory Computing with Apache Ignite™
Improving Apache Spark™ In-Memory Computing with Apache Ignite™
Tom Diederich
 
Apache Spark and Apache Ignite: Where Fast Data Meets IoT
Apache Spark and Apache Ignite: Where Fast Data Meets IoT
Denis Magda
 
Apache Ignite - Distributed Database Orchestration
Apache Ignite - Distributed Database Orchestration
Ariel Jatib
 
Spark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos Erotocritou
Spark Summit
 
Ingesting streaming data for analysis in apache ignite (stream sets theme)
Ingesting streaming data for analysis in apache ignite (stream sets theme)
Tom Diederich
 
Apache Ignite: In-Memory Hammer for Your Data Science Toolkit
Apache Ignite: In-Memory Hammer for Your Data Science Toolkit
Denis Magda
 
Apache Ignite: In-Memory Hammer for Your Data Science Toolkit
Apache Ignite: In-Memory Hammer for Your Data Science Toolkit
Denis Magda
 
In-Memory Computing Essentials for Architects and Engineers
In-Memory Computing Essentials for Architects and Engineers
Denis Magda
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
Chetan Khatri
 
Melbourne: Certus Data 2.0 Vault Meetup with Snowflake - Data Vault In The Cl...
Melbourne: Certus Data 2.0 Vault Meetup with Snowflake - Data Vault In The Cl...
Certus Solutions
 
Data Summer Conf 2018, “Apache Ignite + Apache Spark RDDs and DataFrames inte...
Data Summer Conf 2018, “Apache Ignite + Apache Spark RDDs and DataFrames inte...
Provectus
 
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT
Denis Magda
 
Running Spark In Production in the Cloud is Not Easy with Nayur Khan
Running Spark In Production in the Cloud is Not Easy with Nayur Khan
Databricks
 
Data relay introduction to big data clusters
Data relay introduction to big data clusters
Chris Adkin
 
IMC Summit 2016 Breakout - Matt Coventon - Test Driving Streaming and CEP on ...
IMC Summit 2016 Breakout - Matt Coventon - Test Driving Streaming and CEP on ...
In-Memory Computing Summit
 
Asterisk, HTML5 and NodeJS; a world of endless possibilities
Asterisk, HTML5 and NodeJS; a world of endless possibilities
Dan Jenkins
 
Real Time Analytics for Big Data a Twitter Case Study
Real Time Analytics for Big Data a Twitter Case Study
Nati Shalom
 
How to become an big data rockstar in 15 minutes - Akmal Chaudhri
How to become an big data rockstar in 15 minutes - Akmal Chaudhri
Dataconomy Media
 
Building an open source cloud storage platform for OpenStack - openATTIC
Building an open source cloud storage platform for OpenStack - openATTIC
it-novum
 
maxbox starter72 multilanguage coding
maxbox starter72 multilanguage coding
Max Kleiner
 
Improving Apache Spark™ In-Memory Computing with Apache Ignite™
Improving Apache Spark™ In-Memory Computing with Apache Ignite™
Tom Diederich
 
Apache Spark and Apache Ignite: Where Fast Data Meets IoT
Apache Spark and Apache Ignite: Where Fast Data Meets IoT
Denis Magda
 
Apache Ignite - Distributed Database Orchestration
Apache Ignite - Distributed Database Orchestration
Ariel Jatib
 
Spark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos Erotocritou
Spark Summit
 
Ingesting streaming data for analysis in apache ignite (stream sets theme)
Ingesting streaming data for analysis in apache ignite (stream sets theme)
Tom Diederich
 
Apache Ignite: In-Memory Hammer for Your Data Science Toolkit
Apache Ignite: In-Memory Hammer for Your Data Science Toolkit
Denis Magda
 
Apache Ignite: In-Memory Hammer for Your Data Science Toolkit
Apache Ignite: In-Memory Hammer for Your Data Science Toolkit
Denis Magda
 
In-Memory Computing Essentials for Architects and Engineers
In-Memory Computing Essentials for Architects and Engineers
Denis Magda
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
Chetan Khatri
 
Melbourne: Certus Data 2.0 Vault Meetup with Snowflake - Data Vault In The Cl...
Melbourne: Certus Data 2.0 Vault Meetup with Snowflake - Data Vault In The Cl...
Certus Solutions
 
Data Summer Conf 2018, “Apache Ignite + Apache Spark RDDs and DataFrames inte...
Data Summer Conf 2018, “Apache Ignite + Apache Spark RDDs and DataFrames inte...
Provectus
 
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT
Denis Magda
 
Running Spark In Production in the Cloud is Not Easy with Nayur Khan
Running Spark In Production in the Cloud is Not Easy with Nayur Khan
Databricks
 
Data relay introduction to big data clusters
Data relay introduction to big data clusters
Chris Adkin
 
IMC Summit 2016 Breakout - Matt Coventon - Test Driving Streaming and CEP on ...
IMC Summit 2016 Breakout - Matt Coventon - Test Driving Streaming and CEP on ...
In-Memory Computing Summit
 
Asterisk, HTML5 and NodeJS; a world of endless possibilities
Asterisk, HTML5 and NodeJS; a world of endless possibilities
Dan Jenkins
 
Real Time Analytics for Big Data a Twitter Case Study
Real Time Analytics for Big Data a Twitter Case Study
Nati Shalom
 
How to become an big data rockstar in 15 minutes - Akmal Chaudhri
How to become an big data rockstar in 15 minutes - Akmal Chaudhri
Dataconomy Media
 
Building an open source cloud storage platform for OpenStack - openATTIC
Building an open source cloud storage platform for OpenStack - openATTIC
it-novum
 
maxbox starter72 multilanguage coding
maxbox starter72 multilanguage coding
Max Kleiner
 
Ad

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

Recently uploaded (20)

ppt somu_Jarvis_AI_Assistant_presen.pptx
ppt somu_Jarvis_AI_Assistant_presen.pptx
MohammedumarFarhan
 
Model Evaluation & Visualisation part of a series of intro modules for data ...
Model Evaluation & Visualisation part of a series of intro modules for data ...
brandonlee626749
 
All the DataOps, all the paradigms .
All the DataOps, all the paradigms .
Lars Albertsson
 
Camuflaje Tipos Características Militar 2025.ppt
Camuflaje Tipos Características Militar 2025.ppt
e58650738
 
Indigo_Airlines_Strategy_Presentation.pptx
Indigo_Airlines_Strategy_Presentation.pptx
mukeshpurohit991
 
Prescriptive Process Monitoring Under Uncertainty and Resource Constraints: A...
Prescriptive Process Monitoring Under Uncertainty and Resource Constraints: A...
Mahmoud Shoush
 
Microsoft Power BI - Advanced Certificate for Business Intelligence using Pow...
Microsoft Power BI - Advanced Certificate for Business Intelligence using Pow...
Prasenjit Debnath
 
最新版意大利米兰大学毕业证(UNIMI毕业证书)原版定制
最新版意大利米兰大学毕业证(UNIMI毕业证书)原版定制
taqyea
 
Allotted-MBBS-Student-list-batch-2021.pdf
Allotted-MBBS-Student-list-batch-2021.pdf
subhansaifi0603
 
一比一原版(TUC毕业证书)开姆尼茨工业大学毕业证如何办理
一比一原版(TUC毕业证书)开姆尼茨工业大学毕业证如何办理
taqyed
 
NVIDIA Triton Inference Server, a game-changing platform for deploying AI mod...
NVIDIA Triton Inference Server, a game-changing platform for deploying AI mod...
Tamanna36
 
Artigo - Playing to Win.planejamento docx
Artigo - Playing to Win.planejamento docx
KellyXavier15
 
Crafting-Research-Recommendations Grade 12.pptx
Crafting-Research-Recommendations Grade 12.pptx
DaryllWhere
 
NASA ESE Study Results v4 05.29.2020.pptx
NASA ESE Study Results v4 05.29.2020.pptx
CiroAlejandroCamacho
 
Boost Business Efficiency with Professional Data Entry Services
Boost Business Efficiency with Professional Data Entry Services
eloiacs eloiacs
 
Indigo dyeing Presentation (2).pptx as dye
Indigo dyeing Presentation (2).pptx as dye
shreeroop1335
 
Shifting Focus on AI: How it Can Make a Positive Difference
Shifting Focus on AI: How it Can Make a Positive Difference
1508 A/S
 
最新版美国芝加哥大学毕业证(UChicago毕业证书)原版定制
最新版美国芝加哥大学毕业证(UChicago毕业证书)原版定制
taqyea
 
Measurecamp Copenhagen - Consent Context
Measurecamp Copenhagen - Consent Context
Human37
 
Starbucks in the Indian market through its joint venture.
Starbucks in the Indian market through its joint venture.
sales480687
 
ppt somu_Jarvis_AI_Assistant_presen.pptx
ppt somu_Jarvis_AI_Assistant_presen.pptx
MohammedumarFarhan
 
Model Evaluation & Visualisation part of a series of intro modules for data ...
Model Evaluation & Visualisation part of a series of intro modules for data ...
brandonlee626749
 
All the DataOps, all the paradigms .
All the DataOps, all the paradigms .
Lars Albertsson
 
Camuflaje Tipos Características Militar 2025.ppt
Camuflaje Tipos Características Militar 2025.ppt
e58650738
 
Indigo_Airlines_Strategy_Presentation.pptx
Indigo_Airlines_Strategy_Presentation.pptx
mukeshpurohit991
 
Prescriptive Process Monitoring Under Uncertainty and Resource Constraints: A...
Prescriptive Process Monitoring Under Uncertainty and Resource Constraints: A...
Mahmoud Shoush
 
Microsoft Power BI - Advanced Certificate for Business Intelligence using Pow...
Microsoft Power BI - Advanced Certificate for Business Intelligence using Pow...
Prasenjit Debnath
 
最新版意大利米兰大学毕业证(UNIMI毕业证书)原版定制
最新版意大利米兰大学毕业证(UNIMI毕业证书)原版定制
taqyea
 
Allotted-MBBS-Student-list-batch-2021.pdf
Allotted-MBBS-Student-list-batch-2021.pdf
subhansaifi0603
 
一比一原版(TUC毕业证书)开姆尼茨工业大学毕业证如何办理
一比一原版(TUC毕业证书)开姆尼茨工业大学毕业证如何办理
taqyed
 
NVIDIA Triton Inference Server, a game-changing platform for deploying AI mod...
NVIDIA Triton Inference Server, a game-changing platform for deploying AI mod...
Tamanna36
 
Artigo - Playing to Win.planejamento docx
Artigo - Playing to Win.planejamento docx
KellyXavier15
 
Crafting-Research-Recommendations Grade 12.pptx
Crafting-Research-Recommendations Grade 12.pptx
DaryllWhere
 
NASA ESE Study Results v4 05.29.2020.pptx
NASA ESE Study Results v4 05.29.2020.pptx
CiroAlejandroCamacho
 
Boost Business Efficiency with Professional Data Entry Services
Boost Business Efficiency with Professional Data Entry Services
eloiacs eloiacs
 
Indigo dyeing Presentation (2).pptx as dye
Indigo dyeing Presentation (2).pptx as dye
shreeroop1335
 
Shifting Focus on AI: How it Can Make a Positive Difference
Shifting Focus on AI: How it Can Make a Positive Difference
1508 A/S
 
最新版美国芝加哥大学毕业证(UChicago毕业证书)原版定制
最新版美国芝加哥大学毕业证(UChicago毕业证书)原版定制
taqyea
 
Measurecamp Copenhagen - Consent Context
Measurecamp Copenhagen - Consent Context
Human37
 
Starbucks in the Indian market through its joint venture.
Starbucks in the Indian market through its joint venture.
sales480687
 

Fast Data with Apache Ignite and Apache Spark with Christos Erotocritou

  • 1. Christos Erotocritou, GridGain Systems Fast Data with Apache Ignite & Apache Spark #EUstr10
  • 2. © 2017 GridGain Systems, Inc. #EUstr10SPARK SUMMIT …is a distributed, memory-centric data platform with powerful & flexible processing APIs
  • 3. © 2017 GridGain Systems, Inc. #EUstr10SPARK SUMMIT Apache Ignite Memory-Centric Data Platform Ignite Memory-Centric Storage Ignite Native Persistence (Flash, SSD, Intel 3D XPoint) Third-Party Persistence (RDBMS, HDFS, NoSQL) SQL Transactions Compute IgniteRDD MLStreamingKey/Value IoTFinancial Services Pharma & Healthcare E-CommerceTravel & Logistics Telco Applications
  • 4. © 2017 GridGain Systems, Inc. #EUstr10SPARK SUMMIT Memory-Centric Storage
  • 5. © 2017 GridGain Systems, Inc. #EUstr10SPARK SUMMIT Pure Ignite Deployment Front-End APIs SQL TXCompute Ignite RDD Key / Value Payments SecuritiesRisk Trading Clients Ignite Cluster DURABLE MEMORY DURABLE MEMORY DURABLE MEMORY Data Caches / Tables Applications in Java, .NET & C++ Wide Range of Data Access and Processing APIs Shared Storage across Apps & Support for Multi- Tenancy Disk & Memory Data Storage
  • 6. © 2017 GridGain Systems, Inc. #EUstr10SPARK SUMMIT Durable Memory Ignite Server Cluster Off-heap Removes noticeable GC pauses Automatic Defragmentation Stores Superset of Data Predictable memory consumption Fully Transactional (Write-Ahead Log) DURABLE MEMORY DURABLE MEMORY DURABLE MEMORY Server Node Server Node Server Node Memory-Centric Storage Instantaneous Restarts
  • 7. © 2017 GridGain Systems, Inc. #EUstr10SPARK SUMMIT Apache Ignite Features JCache Compute Transactions Scan & Text QueriesSQL JDBC & ODBC StreamingServices Java .NET C++ PHP BI ToolsMemcached REST DURABLE MEMORY DURABLE MEMORY DURABLE MEMORY DURABLE MEMORY Distributed Memory-Centric Storage Dynamic Scaling Server Nodes
  • 8. © 2017 GridGain Systems, Inc. #EUstr10SPARK SUMMIT 1.Initial Request 2.Fetch data from remote nodes 3.Process the entire data-set 1.Initial request 2.Co-locate processing with data 3.Reduce multiple results into one Client-Server Processing Co-located Processing 2 1 Data & Processing Node Data & Processing Node Client Node 33 Data 1 Data NodeData 2 Data Node Processing Node 1 2
  • 9. © 2017 GridGain Systems, Inc. #EUstr10SPARK SUMMIT Hadoop, Spark & Ignite Deployment SQL & Compute APIDB File Exports Ignite Clients Kafka Data Streamer Ignite Data Streamer Spark App Hadoop Data Node Spark App Hadoop Data Node Spark App Hadoop Data Node Spark Clients Server Nodes IgniteRDD IgniteRDD IgniteRDD
  • 10. © 2017 GridGain Systems, Inc. #EUstr10SPARK SUMMIT Apache Ignite Spark Integration Spark Application Spark Worker Spark Job Spark Job Yarn Mesos Docker HDFS Spark Worker Spark Job Spark Job Spark Worker Spark Job Spark Job In-Memory Shared RDD or DataFrame Share RDD across jobs on the host In-Memory Indexes SQL on top of RDDs Share RDD Globally Ignite Node Ignite Node Ignite Node
  • 11. © 2017 GridGain Systems, Inc. #EUstr10SPARK SUMMIT • IgniteContext is the main entry point to Spark-Ignite integration: val igniteContext = new IgniteContext[Integer, Integer] (sparkContext, () => new IgniteConfiguration()) val cache = igniteContext.fromCache("myRdd") val result = cache.filter(_._2.contains("Ignite")).collect() val cacheRdd = igniteContext.fromCache("myRdd") cacheRdd.savePairs(sparkContext.parallelize(1 to 10000, 10).map(i => (i, i))) • Saving values to Ignite: • Running SQL queries against Ignite Cache: val cacheRdd = igniteContext.fromCache("myRdd") val result = cacheRdd.sql ("select _val from Integer where val > ? and val < ?", 10, 100) • Reading values from Ignite: Working with IgniteRDD
  • 12. © 2017 GridGain Systems, Inc. #EUstr10SPARK SUMMIT val companyCacheIgnite = new IgniteContext[Int, String](sc, () => new IgniteConfiguration()).fromCache("CompanyCache") val dfCompany = sqlContext.createDataFrame(companyCacheIgnite.map(p=> Company(p._1, p._2))) dfCompany.registerTempTable("company") Working with DataFrame API • Create an IgniteRDD • Create a “Company” DataFrame • Register DataFrame as a table
  • 13. © 2017 GridGain Systems, Inc. #EUstr10SPARK SUMMIT – Ingests data from HDFS or another distributed file system – Inclined towards analytics (OLAP) and focused on MR-specific payloads – Requires the creation of RDD and data and processing operations are governed by it – Basic disk-based SQL support – Strong ML libraries – Big community – Data source agnostic – Fully fledged compute engine and durable storage – OLAP & OLTP – Zero-deployment – In-Memory SQL support – Fully ACID transactions across memory and disk – Less focused on Hadoop – Early ML Support – Growing Community
  • 14. © 2017 GridGain Systems, Inc. #EUstr10SPARK SUMMIT • What is GridGain? • Binary build of Apache Ignite™ • Added enterprise features for enterprise deployments • Earlier features and bug fixes by a few weeks • Fully certified & tested releases “We develop and support the worlds leading In-Memory Computing Platform”
  • 15. © 2017 GridGain Systems, Inc. #EUstr10SPARK SUMMIT Thank you for joining us. Follow the conversation. http://ignite.apache.org Any Questions?