SlideShare a Scribd company logo
6
Most read
9
Most read
10
Most read
Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 1
Memory Speed Big Data
AnalyticsAlluxio vs Apache Ignite
Irfan Elahi
Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 2
• Working as a Consultant in Deloitte (Analytics Service Line)
• 4+ years of experience in Big Data and Machine Learning in multiple verticals
• Recent Deloitte projects in Australia’s biggest Telecom company:
• Architecting one of the largest Hadoop deployments in cloud employing in-
memory computation technologies
• Developing enterprise-grade stream processing system based on Hadoop stack
employing NoSQL data-stores
• Premium Udemy Instructor with 12,000+ students from 131 countries
• Technical Reviewer of an upcoming Hadoop book published by APress
• Lets connect: ielahi@deloitte.com.au | linkedin.com/in/irfanelahi | @elahi_irfan
About Me
Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 3
• Big Data – The Evolution and Beyond
• In-Memory Computation Trend – Overview
• Unaddressed Challenges in Big Data
• Introduction and Deep Dive Comparison of
Alluxio and Apache Ignite
Agenda
Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 4
The Evolution and Beyond
Big Data
Scale-up
• Make individual expensive
machines bigger
• Challenges:
• No collocated processing
• Limited Scalability
Scale-out
• Add more machines
• Challenges:
• Programming complexity
• Partial Failures
Hadoop @ MapReduce
• Scalable , economical and fault
tolerant processing and storage
• Fit for Offline computation
• Disk I/O bound
Hadoop @ Spark
• Distributed in-memory
computation framework
• Fit for Offline + Online
computation
• Many challenges remain
unaddressed
Memory Centric Platforms
Enabling advanced caching and
memory management features to
optimize performance and address
many challenges
Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 5
Name of the game:
Memory is the new disk!
Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 6
Driving Factor: Economics
• Memory is much faster than
disk (approx. 3000x)
• Cost of memory decreasing
• Memory per node increasing
Challenges:
• Memory is still expensive than
disk (approx. 80x)
• Memory is still limited
• Not all data is memory-worthy
and that’s not all…
Driving Factor: Traditional
paradigms’ Limitations
Intermittent disk I/O and
serialization cost in traditional
computing platforms (e.g.
MapReduce) causes:
• High Latency
• In-efficiency in iterative
algorithms execution in
analytics (e.g. machine
learning, graph and network
analysis)
• In-efficiency in interactive data
mining
• Infeasibility for innovative use-
cases like stream processing
Impact: Innovative
technologies and processing
patterns
• New processing patterns:
Batch -> Event Driven
• New processing technologies:
Map Reduce -> Spark
Hive -> Impala
• New storage technologies:
HDFS -> Alluxio | IGFS
• New Use-cases:
Real-time stream processing
IoT
Overview
In-Memory Trend
Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 7
Overview
Un-addressed Challenges:
• On-Heap memory in memory-
centric platforms (e.g. Spark)
is limited thus causing
resource pressure
• Data resilience is
compromised in the event of
application crashes and
causes expensive disk I/O
• Inter-process data/state
sharing still relies on HDFS
I/O thereby causing
performance issues
On-Heap Memory
Constraints:
• Managing increasing number of compute and storage
platforms increases complexity
• Adding/Removing respective systems require
application changes thus impacting DevOps lifecycle
• Data locality gets compromised
Many Compute to Storage Integration
Paradox:
Many leading Big data platforms still don’t support:
• ACID compliant Transactions
• ANSI SQL compliance
• Indexing
• In-place mutation
Missing SQL and Transactional
Support on Hadoop
Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 8
Potential Missing Pieces of Puzzle:
• Alluxio
• Apache Ignite
Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 9
Alluxio
• Launched in 2012 by UC Berkeley
AMPLab
• Formerly known as Tachyon
• Licensed under Apache License 2.0
• Approximately 500 contributors
• Deployed in Yahoo, Baidu, Intel,
Samsung to name a few
A distributed and scalable storage virtualized across multiple storage systems under
unified namespace to facilitate data access at memory speed
Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 10
Ignite
• First release in early 2015
• August 2015: second fastest project to
graduate after Spark
• Licensed under Apache License 2.0
• Approximately 120 contributors
• Deployed in IBM, Siemens, Citibank,
Barclays, Nielsen to name a few
An distributed key-value store and scalable in-memory computing platform with
powerful SQL, key-value and processing APIs
Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 11
Deep Dive Comparison
• Alluxio
• Apache Ignite
Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 12
Alluxio Apache Ignite
Master Nodes:
• Manage File System Metadata
• Can be Primary or Secondary
• HA supported via ZooKeeper
Worker Nodes:
• Store data in the form of
blocks
• No rebalancing of blocks upon addition of new nodes
• Send heartbeats to Master Nodes
Require Under File System (UFS) (e.g. HDFS, S3) for operation
Architecture
master node(s):
group: B
worker node(s):
group: A
servers:
Optional Node Roles:
• Servers (Default | Equal by design | Multiple servers on one host)
• Clients (Explicitly defined | Connect to servers for computation)
Logical Grouping:
• User configurable node roles via attributes registered by nodes at
start-up
• Registered attributes can be leveraged for dynamic logical grouping
based on predicates (e.g. CPU Utilization > 50%) for localizing
processes and jobs
No Name-Node Architecture:
• When used as IGFS, no centralized metadata management (e.g. like
HDFS NameNode or Alluxio’s Master Nodes) is needed
• Hashing is used for data locality determination
Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 13
Alluxio Apache Ignite
Configuration:
• Requires explicitly specifying:
• Master Node(s)
• Worker Node(s)
in the configuration files
• Addition of nodes requires restarting cluster
Interfaces:
• Alluxio Shell
• Web interface (also enables to browse Alluxio FS)
Configuration:
• Doesn’t require explicit specification of nodes in configuration files
• Nodes discover themselves automatically when started
• Supported methods for nodes discovery:
• Multi-cast
• Static IP based
• Cluster can be scaled without restarting
• Supported deployment modes: Shared or Embedded
Interfaces:
Visor CommandLine:
For viewing topology, node metrics, cache statistics and
administrating cluster
Web Interface:
Needs to be installed separately
Architecture (Continued)
group: Bgroup: A
servers:master node(s):
worker node(s):
Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 14
Alluxio Apache Ignite
Architecture (Continued)
Alluxio Shell
Visor CommandLine
Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 15
Alluxio Apache Ignite
Architecture (Continued)
Alluxio Web Interface
Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 16
Alluxio Apache Ignite
• Provides two modes of persistence in addition to in-memory:
• Native Persistence (disk only)
• 3rd party Persistence (pluggable)
Native persistence:
• Treats disk for persisting super-set of data
• Supports SSD, Flash, 3d Xpoint storage
• Features like ACID compliance, SQL are supported only in this mode
3rd Party Persistence:
• Data stores like HDFS, Cassandra and JDBC based are pluggable
• Involves implementing CacheStore interface for read/write through
• Supports write-behind caching for improved performance
Integration with Data Stores
Alluxio (exposed as file system)
Spark
Map
Reduce
Flink
…
S3 HDFS Blob…
Ignite (In-memory)
Disk HDFS Cassandra…
native
persistence
3rd party
persistence
Spark/ Map
Reduce/ Flink SQL Streaming
…
Machine
Learning
Compute
• Enables processing frameworks to interact with data from
different data stores with unified namespace and API
• Conveniently supports the following data stores as UFS:
• HDFS, Blob, S3, GCS, Minio, Ceph, Swift, MapR-FS to
name a few
• Process involves mounting different UFS at different mount
points in Alluxio namespace and then accessing seamlessly in
applications
• Addition of more UFS storage is configurable
Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 17
Alluxio Apache Ignite
• Alluxio is provides Hadoop compatible file system APIs and
thus data can be read/written via Spark RDD’s file system
related APIs
• Enables to read/write data from different data stores
(configured as UFS) via Alluxio’s unified Namespace and API
• Automatically manages movement of data persisted in Alluxio
or UFS
Two ways to integrate with Spark:
• As stand-alone IGFS or caching layer on HDFS:
• Ignite is exposed as HDFS and thus data can be read/written
via Spark RDD’s File System related APIs
• As Distributed Cache via IgniteContext:
• Provides implementation of Spark RDDs (supporting all
transformations and actions)
• Mutable RDDs (view over distributed cache’s content)
• Configurable lifespan depending upon Ignite’s deployment
mode
Integration with Spark
Alluxio (exposed as file system)
S3 HDFS Blob
RDD’s save to
file system API
IgniteContext
IgniteRDD [Tuple2]
transformations
RDD’s read from FileSystem API
transformations
Disk…
savePairs Action
Ignite IGFS Ignite (Distributed Cache)
RDD’s read/write
to/from FileSystem API
Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 18
Alluxio Apache Ignite
//reading data:
val textRdd =
sc.textFile(“alluxio://masternode:19998/path”)
//transformations:
val textRdd2=textRdd.filter(_.contains(“deloitte”))
//writing data:
textRdd2.saveAsTextFile(“alluxio://masternode:19998/desti
nation_path”)
//creating IgniteContext
val igniteContext = new IgniteContext(sparkContext,() => new
IgniteConfiguration())
//creating IgniteRDD
val
cacheRdd:org.apache.ignite.spark.IgniteRDD[Integer,String]=
igniteContext.fromCache(“deloitte_cache")
//transformations:
val cacheRdd2=cacheRdd.filter(_._2.contains(“deloitte”))
//writing data:
cacheRdd2.savePairs()
Integration with Spark (Continued)
Alluxio (exposed as file system)
S3 HDFS Blob
RDD’s save to
file system API
IgniteContext
IgniteRDD [Tuple2]
transformations
RDD’s read from FileSystem API
transformations
Disk…
savePairs Action
Ignite IGFS Ignite (Distributed Cache)
RDD’s read/write
to/from FileSystem API
Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 19
Alluxio Apache Ignite
Memory Architecture
MEM
SSD
HDD
Alluxio Tiers
Performance
Alluxio storage is divided into three ordered tiers as follows:
• MEM (memory)
• SSD
• HDD
• Allows to store data greater than the available Memory in
cluster
• Automatically manages data between tiers
• Data is written to top tier by default
Memory
Disk
…
Memory Region Memory Region Memory Region
Memory Segment
Data
Page
B+
Tree
Index
Page
Index
Page
Data
Page
In Native Persistence, data and index storage is divided into:
• Memory (subset of data)
• Disk (superset of data)
• Data can be stored both off-heap and on-heap.
• When stored off-heap, less constraints on volume of data to
be stored and less GC pauses
• Memory is further divided into Memory Regions
• Memory Regions consist of Memory Segments which comprise
of Data Page, B+ Tree Page, Index Page and FreeList
Structures
Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 20
Alluxio Apache Ignite
Advanced Memory Management
MEM
SSD
HDD
Alluxio Tiers
Performance
• Pinning/Unpinning:
To enforce data locality in a specific tier
• Allocators:
For choosing locations to write new data blocks.
• Evictors:
For choosing which data to move to lower tier for
freeing space. Supported algorithms: Greedy, LRU,
LRFU, Partial LRU
• Evictors and Allocators are applied globally
• Write may fail if space cant be freed or if data exceeds the
size of top tier
Memory
Disk
…
Memory Region Memory Region Memory Region
Memory Segment
Data
Page
B+
Tree
Index
Page
Index
Page
Data
Page
Supports memory policies (e.g. eviction) to be applicable at:
• Memory Region Level (for off-heap caching)
• Entry Level (for on-heap caching)
thus providing more granular control
Eviction:
Supported algorithms for Page Based Eviction:
Random-LRU, Random2-LRU
Supported algorithms for Entry Level Eviction:
FIFO, LRU, Random
Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 21
Alluxio Apache Ignite
Additional Capabilities: SQL Support
Not supported
• Supports distributed and Horizontally scalable SQL Database
capabilities
• Supports indexing
• SQL ANSI-99 compliant
• Supports all SQL DDL and DML commands including UPDATE,
DELETE, MERGE queries
• Resembles Kudu’s capabilities
• Counters limitation of HDFS
• Supports running queries on data spanning on memory or
disk. All of the data need not be in memory for processing
unlike in Impala or Spark
JDBC ODBC Ignite SQL API
Disk
RAM
Disk
RAM
Disk
RAM
Disk
RAM
Disk
RAM
…
Data and Indexes
Java .Net C++ PHP BI Tools
Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 22
Alluxio Apache Ignite
Additional Capabilities (Continued)
Key Value APIs (not transactional/ACID compliant) Key Value APIs (transactional - ACID compliant)
Compute Grid: Distributed and parallel computation
MLGrid: Machine learning library on top of Apache Ignite.
Currently supports limited vector and matrix algebra operations
and other algorithms are on the roadmap.
Streaming ingestion: Ingesting real time streams of data into
ignite in distributed, scalable and fault tolerant manner
Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 23
• For inter-process state sharing (e.g. across Spark jobs), both provide adequate
functional capabilities.
• Both platforms provide automatic hot data management whereas Apache Ignite
provides more granular control courtesy of its per memory region policies.
• For convenience in use-cases involving interacting with data in multiple storage
systems at memory speed, Alluxio makes more sense.
• For building real-time data and analytics pipelines, Apache Ignite makes more
sense as a sink.
• For analytical use-cases involving relational processing and in-place mutation at
high speed, Apache Ignite makes more sense.
Key Takeaways:
Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 24
Questions

More Related Content

What's hot (20)

PDF
Facebook Messages & HBase
强 王
 
PDF
DataOpsbarcelona 2019: Deep dive into MySQL Group Replication... the magic e...
Frederic Descamps
 
PPTX
Ceph Introduction 2017
Karan Singh
 
PDF
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
ScyllaDB
 
PDF
Ceph issue 해결 사례
Open Source Consulting
 
PPTX
Apache Spark Architecture
Alexey Grishchenko
 
PPTX
The Impala Cookbook
Cloudera, Inc.
 
PPTX
Kafka monitoring using Prometheus and Grafana
wonyong hwang
 
PPTX
Introduction to Apache ZooKeeper
Saurav Haloi
 
PPTX
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Cloudera, Inc.
 
PDF
State of the Trino Project
Martin Traverso
 
PPTX
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
KEY
Redis overview for Software Architecture Forum
Christopher Spring
 
PDF
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Databricks
 
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
ODP
Lisa 2015-gluster fs-introduction
Gluster.org
 
PDF
Alluxio: Data Orchestration on Multi-Cloud
Jinwook Chung
 
PDF
Spark and S3 with Ryan Blue
Databricks
 
PDF
Hudi architecture, fundamentals and capabilities
Nishith Agarwal
 
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Facebook Messages & HBase
强 王
 
DataOpsbarcelona 2019: Deep dive into MySQL Group Replication... the magic e...
Frederic Descamps
 
Ceph Introduction 2017
Karan Singh
 
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
ScyllaDB
 
Ceph issue 해결 사례
Open Source Consulting
 
Apache Spark Architecture
Alexey Grishchenko
 
The Impala Cookbook
Cloudera, Inc.
 
Kafka monitoring using Prometheus and Grafana
wonyong hwang
 
Introduction to Apache ZooKeeper
Saurav Haloi
 
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Cloudera, Inc.
 
State of the Trino Project
Martin Traverso
 
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
Redis overview for Software Architecture Forum
Christopher Spring
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Databricks
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
Lisa 2015-gluster fs-introduction
Gluster.org
 
Alluxio: Data Orchestration on Multi-Cloud
Jinwook Chung
 
Spark and S3 with Ryan Blue
Databricks
 
Hudi architecture, fundamentals and capabilities
Nishith Agarwal
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 

Similar to Apache Ignite vs Alluxio: Memory Speed Big Data Analytics (20)

PDF
Scalable Analytics on the Cloud
Irfan Elahi
 
PDF
Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017
Alluxio, Inc.
 
PDF
Spark Summit EU talk by Jiri Simsa
Spark Summit
 
PDF
Spark Summit EU talk by Jiri Simsa
Alluxio, Inc.
 
PDF
Alluxio (Formerly Tachyon): Unify Data At Memory Speed at Global Big Data Con...
Alluxio, Inc.
 
PDF
Unify Data at Memory Speed
Alluxio, Inc.
 
PDF
Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...
Alluxio, Inc.
 
PDF
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
Alluxio, Inc.
 
PPTX
Alluxio Presentation at Strata San Jose 2016
Jiří Šimša
 
PDF
Alluxio: Unify Data at Memory Speed; 2016-11-18
Alluxio, Inc.
 
PDF
Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...
Spark Summit
 
PDF
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
Alluxio, Inc.
 
PDF
The Architecture of Decoupling Compute and Storage with Alluxio
Alluxio, Inc.
 
PDF
Best Practices for Using Alluxio with Spark
Alluxio, Inc.
 
PDF
Unified Big Data Analytics: Any Stack, Any Cloud
Alluxio, Inc.
 
PDF
Accelerate Spark Workloads on S3
Alluxio, Inc.
 
PDF
Alluxio: Unify Data at Memory Speed at Strata and Hadoop World San Jose 2017
Alluxio, Inc.
 
PDF
Unify Data at Memory Speed by Haoyuan Li - VAULT Conference 2017
Alluxio, Inc.
 
PDF
Alluxio @ Uber Seattle Meetup
Alluxio, Inc.
 
PDF
Best Practice in Accelerating Data Applications with Spark+Alluxio
Alluxio, Inc.
 
Scalable Analytics on the Cloud
Irfan Elahi
 
Effective Spark with Alluxio at Strata+Hadoop World San Jose 2017
Alluxio, Inc.
 
Spark Summit EU talk by Jiri Simsa
Spark Summit
 
Spark Summit EU talk by Jiri Simsa
Alluxio, Inc.
 
Alluxio (Formerly Tachyon): Unify Data At Memory Speed at Global Big Data Con...
Alluxio, Inc.
 
Unify Data at Memory Speed
Alluxio, Inc.
 
Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...
Alluxio, Inc.
 
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
Alluxio, Inc.
 
Alluxio Presentation at Strata San Jose 2016
Jiří Šimša
 
Alluxio: Unify Data at Memory Speed; 2016-11-18
Alluxio, Inc.
 
Effective Spark with Alluxio: Spark Summit East talk by Gene Pang and Haoyuan...
Spark Summit
 
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
Alluxio, Inc.
 
The Architecture of Decoupling Compute and Storage with Alluxio
Alluxio, Inc.
 
Best Practices for Using Alluxio with Spark
Alluxio, Inc.
 
Unified Big Data Analytics: Any Stack, Any Cloud
Alluxio, Inc.
 
Accelerate Spark Workloads on S3
Alluxio, Inc.
 
Alluxio: Unify Data at Memory Speed at Strata and Hadoop World San Jose 2017
Alluxio, Inc.
 
Unify Data at Memory Speed by Haoyuan Li - VAULT Conference 2017
Alluxio, Inc.
 
Alluxio @ Uber Seattle Meetup
Alluxio, Inc.
 
Best Practice in Accelerating Data Applications with Spark+Alluxio
Alluxio, Inc.
 
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
DataWorks Summit
 
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
PPTX
Managing the Dewey Decimal System
DataWorks Summit
 
PPTX
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
PPTX
Security Framework for Multitenant Architecture
DataWorks Summit
 
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
PPTX
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
PDF
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
PPTX
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
DataWorks Summit
 
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
DataWorks Summit
 
Ad

Recently uploaded (20)

PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PDF
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 

Apache Ignite vs Alluxio: Memory Speed Big Data Analytics

  • 1. Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 1 Memory Speed Big Data AnalyticsAlluxio vs Apache Ignite Irfan Elahi
  • 2. Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 2 • Working as a Consultant in Deloitte (Analytics Service Line) • 4+ years of experience in Big Data and Machine Learning in multiple verticals • Recent Deloitte projects in Australia’s biggest Telecom company: • Architecting one of the largest Hadoop deployments in cloud employing in- memory computation technologies • Developing enterprise-grade stream processing system based on Hadoop stack employing NoSQL data-stores • Premium Udemy Instructor with 12,000+ students from 131 countries • Technical Reviewer of an upcoming Hadoop book published by APress • Lets connect: [email protected] | linkedin.com/in/irfanelahi | @elahi_irfan About Me
  • 3. Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 3 • Big Data – The Evolution and Beyond • In-Memory Computation Trend – Overview • Unaddressed Challenges in Big Data • Introduction and Deep Dive Comparison of Alluxio and Apache Ignite Agenda
  • 4. Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 4 The Evolution and Beyond Big Data Scale-up • Make individual expensive machines bigger • Challenges: • No collocated processing • Limited Scalability Scale-out • Add more machines • Challenges: • Programming complexity • Partial Failures Hadoop @ MapReduce • Scalable , economical and fault tolerant processing and storage • Fit for Offline computation • Disk I/O bound Hadoop @ Spark • Distributed in-memory computation framework • Fit for Offline + Online computation • Many challenges remain unaddressed Memory Centric Platforms Enabling advanced caching and memory management features to optimize performance and address many challenges
  • 5. Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 5 Name of the game: Memory is the new disk!
  • 6. Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 6 Driving Factor: Economics • Memory is much faster than disk (approx. 3000x) • Cost of memory decreasing • Memory per node increasing Challenges: • Memory is still expensive than disk (approx. 80x) • Memory is still limited • Not all data is memory-worthy and that’s not all… Driving Factor: Traditional paradigms’ Limitations Intermittent disk I/O and serialization cost in traditional computing platforms (e.g. MapReduce) causes: • High Latency • In-efficiency in iterative algorithms execution in analytics (e.g. machine learning, graph and network analysis) • In-efficiency in interactive data mining • Infeasibility for innovative use- cases like stream processing Impact: Innovative technologies and processing patterns • New processing patterns: Batch -> Event Driven • New processing technologies: Map Reduce -> Spark Hive -> Impala • New storage technologies: HDFS -> Alluxio | IGFS • New Use-cases: Real-time stream processing IoT Overview In-Memory Trend
  • 7. Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 7 Overview Un-addressed Challenges: • On-Heap memory in memory- centric platforms (e.g. Spark) is limited thus causing resource pressure • Data resilience is compromised in the event of application crashes and causes expensive disk I/O • Inter-process data/state sharing still relies on HDFS I/O thereby causing performance issues On-Heap Memory Constraints: • Managing increasing number of compute and storage platforms increases complexity • Adding/Removing respective systems require application changes thus impacting DevOps lifecycle • Data locality gets compromised Many Compute to Storage Integration Paradox: Many leading Big data platforms still don’t support: • ACID compliant Transactions • ANSI SQL compliance • Indexing • In-place mutation Missing SQL and Transactional Support on Hadoop
  • 8. Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 8 Potential Missing Pieces of Puzzle: • Alluxio • Apache Ignite
  • 9. Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 9 Alluxio • Launched in 2012 by UC Berkeley AMPLab • Formerly known as Tachyon • Licensed under Apache License 2.0 • Approximately 500 contributors • Deployed in Yahoo, Baidu, Intel, Samsung to name a few A distributed and scalable storage virtualized across multiple storage systems under unified namespace to facilitate data access at memory speed
  • 10. Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 10 Ignite • First release in early 2015 • August 2015: second fastest project to graduate after Spark • Licensed under Apache License 2.0 • Approximately 120 contributors • Deployed in IBM, Siemens, Citibank, Barclays, Nielsen to name a few An distributed key-value store and scalable in-memory computing platform with powerful SQL, key-value and processing APIs
  • 11. Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 11 Deep Dive Comparison • Alluxio • Apache Ignite
  • 12. Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 12 Alluxio Apache Ignite Master Nodes: • Manage File System Metadata • Can be Primary or Secondary • HA supported via ZooKeeper Worker Nodes: • Store data in the form of blocks • No rebalancing of blocks upon addition of new nodes • Send heartbeats to Master Nodes Require Under File System (UFS) (e.g. HDFS, S3) for operation Architecture master node(s): group: B worker node(s): group: A servers: Optional Node Roles: • Servers (Default | Equal by design | Multiple servers on one host) • Clients (Explicitly defined | Connect to servers for computation) Logical Grouping: • User configurable node roles via attributes registered by nodes at start-up • Registered attributes can be leveraged for dynamic logical grouping based on predicates (e.g. CPU Utilization > 50%) for localizing processes and jobs No Name-Node Architecture: • When used as IGFS, no centralized metadata management (e.g. like HDFS NameNode or Alluxio’s Master Nodes) is needed • Hashing is used for data locality determination
  • 13. Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 13 Alluxio Apache Ignite Configuration: • Requires explicitly specifying: • Master Node(s) • Worker Node(s) in the configuration files • Addition of nodes requires restarting cluster Interfaces: • Alluxio Shell • Web interface (also enables to browse Alluxio FS) Configuration: • Doesn’t require explicit specification of nodes in configuration files • Nodes discover themselves automatically when started • Supported methods for nodes discovery: • Multi-cast • Static IP based • Cluster can be scaled without restarting • Supported deployment modes: Shared or Embedded Interfaces: Visor CommandLine: For viewing topology, node metrics, cache statistics and administrating cluster Web Interface: Needs to be installed separately Architecture (Continued) group: Bgroup: A servers:master node(s): worker node(s):
  • 14. Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 14 Alluxio Apache Ignite Architecture (Continued) Alluxio Shell Visor CommandLine
  • 15. Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 15 Alluxio Apache Ignite Architecture (Continued) Alluxio Web Interface
  • 16. Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 16 Alluxio Apache Ignite • Provides two modes of persistence in addition to in-memory: • Native Persistence (disk only) • 3rd party Persistence (pluggable) Native persistence: • Treats disk for persisting super-set of data • Supports SSD, Flash, 3d Xpoint storage • Features like ACID compliance, SQL are supported only in this mode 3rd Party Persistence: • Data stores like HDFS, Cassandra and JDBC based are pluggable • Involves implementing CacheStore interface for read/write through • Supports write-behind caching for improved performance Integration with Data Stores Alluxio (exposed as file system) Spark Map Reduce Flink … S3 HDFS Blob… Ignite (In-memory) Disk HDFS Cassandra… native persistence 3rd party persistence Spark/ Map Reduce/ Flink SQL Streaming … Machine Learning Compute • Enables processing frameworks to interact with data from different data stores with unified namespace and API • Conveniently supports the following data stores as UFS: • HDFS, Blob, S3, GCS, Minio, Ceph, Swift, MapR-FS to name a few • Process involves mounting different UFS at different mount points in Alluxio namespace and then accessing seamlessly in applications • Addition of more UFS storage is configurable
  • 17. Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 17 Alluxio Apache Ignite • Alluxio is provides Hadoop compatible file system APIs and thus data can be read/written via Spark RDD’s file system related APIs • Enables to read/write data from different data stores (configured as UFS) via Alluxio’s unified Namespace and API • Automatically manages movement of data persisted in Alluxio or UFS Two ways to integrate with Spark: • As stand-alone IGFS or caching layer on HDFS: • Ignite is exposed as HDFS and thus data can be read/written via Spark RDD’s File System related APIs • As Distributed Cache via IgniteContext: • Provides implementation of Spark RDDs (supporting all transformations and actions) • Mutable RDDs (view over distributed cache’s content) • Configurable lifespan depending upon Ignite’s deployment mode Integration with Spark Alluxio (exposed as file system) S3 HDFS Blob RDD’s save to file system API IgniteContext IgniteRDD [Tuple2] transformations RDD’s read from FileSystem API transformations Disk… savePairs Action Ignite IGFS Ignite (Distributed Cache) RDD’s read/write to/from FileSystem API
  • 18. Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 18 Alluxio Apache Ignite //reading data: val textRdd = sc.textFile(“alluxio://masternode:19998/path”) //transformations: val textRdd2=textRdd.filter(_.contains(“deloitte”)) //writing data: textRdd2.saveAsTextFile(“alluxio://masternode:19998/desti nation_path”) //creating IgniteContext val igniteContext = new IgniteContext(sparkContext,() => new IgniteConfiguration()) //creating IgniteRDD val cacheRdd:org.apache.ignite.spark.IgniteRDD[Integer,String]= igniteContext.fromCache(“deloitte_cache") //transformations: val cacheRdd2=cacheRdd.filter(_._2.contains(“deloitte”)) //writing data: cacheRdd2.savePairs() Integration with Spark (Continued) Alluxio (exposed as file system) S3 HDFS Blob RDD’s save to file system API IgniteContext IgniteRDD [Tuple2] transformations RDD’s read from FileSystem API transformations Disk… savePairs Action Ignite IGFS Ignite (Distributed Cache) RDD’s read/write to/from FileSystem API
  • 19. Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 19 Alluxio Apache Ignite Memory Architecture MEM SSD HDD Alluxio Tiers Performance Alluxio storage is divided into three ordered tiers as follows: • MEM (memory) • SSD • HDD • Allows to store data greater than the available Memory in cluster • Automatically manages data between tiers • Data is written to top tier by default Memory Disk … Memory Region Memory Region Memory Region Memory Segment Data Page B+ Tree Index Page Index Page Data Page In Native Persistence, data and index storage is divided into: • Memory (subset of data) • Disk (superset of data) • Data can be stored both off-heap and on-heap. • When stored off-heap, less constraints on volume of data to be stored and less GC pauses • Memory is further divided into Memory Regions • Memory Regions consist of Memory Segments which comprise of Data Page, B+ Tree Page, Index Page and FreeList Structures
  • 20. Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 20 Alluxio Apache Ignite Advanced Memory Management MEM SSD HDD Alluxio Tiers Performance • Pinning/Unpinning: To enforce data locality in a specific tier • Allocators: For choosing locations to write new data blocks. • Evictors: For choosing which data to move to lower tier for freeing space. Supported algorithms: Greedy, LRU, LRFU, Partial LRU • Evictors and Allocators are applied globally • Write may fail if space cant be freed or if data exceeds the size of top tier Memory Disk … Memory Region Memory Region Memory Region Memory Segment Data Page B+ Tree Index Page Index Page Data Page Supports memory policies (e.g. eviction) to be applicable at: • Memory Region Level (for off-heap caching) • Entry Level (for on-heap caching) thus providing more granular control Eviction: Supported algorithms for Page Based Eviction: Random-LRU, Random2-LRU Supported algorithms for Entry Level Eviction: FIFO, LRU, Random
  • 21. Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 21 Alluxio Apache Ignite Additional Capabilities: SQL Support Not supported • Supports distributed and Horizontally scalable SQL Database capabilities • Supports indexing • SQL ANSI-99 compliant • Supports all SQL DDL and DML commands including UPDATE, DELETE, MERGE queries • Resembles Kudu’s capabilities • Counters limitation of HDFS • Supports running queries on data spanning on memory or disk. All of the data need not be in memory for processing unlike in Impala or Spark JDBC ODBC Ignite SQL API Disk RAM Disk RAM Disk RAM Disk RAM Disk RAM … Data and Indexes Java .Net C++ PHP BI Tools
  • 22. Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 22 Alluxio Apache Ignite Additional Capabilities (Continued) Key Value APIs (not transactional/ACID compliant) Key Value APIs (transactional - ACID compliant) Compute Grid: Distributed and parallel computation MLGrid: Machine learning library on top of Apache Ignite. Currently supports limited vector and matrix algebra operations and other algorithms are on the roadmap. Streaming ingestion: Ingesting real time streams of data into ignite in distributed, scalable and fault tolerant manner
  • 23. Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 23 • For inter-process state sharing (e.g. across Spark jobs), both provide adequate functional capabilities. • Both platforms provide automatic hot data management whereas Apache Ignite provides more granular control courtesy of its per memory region policies. • For convenience in use-cases involving interacting with data in multiple storage systems at memory speed, Alluxio makes more sense. • For building real-time data and analytics pipelines, Apache Ignite makes more sense as a sink. • For analytical use-cases involving relational processing and in-place mutation at high speed, Apache Ignite makes more sense. Key Takeaways:
  • 24. Memory Speed Big Data Analytics: Alluxio vs Apache IgniteIrfan Elahi - Deloitte 24 Questions

Editor's Notes

  • #5: Traditional computation (e.g. Map Reduce) requires disk I/O thus resulting in slow performance Training Machine learning models iterative steps which are severely impacted if it involved intermittent disk I/O In machine learning involving interative algorithms, intermittent disk I/O severely impacts performance
  • #7: Traditional computation (e.g. Map Reduce) requires disk I/O thus resulting in slow performance Training Machine learning models iterative steps which are severely impacted if it involved intermittent disk I/O In machine learning involving interative algorithms, intermittent disk I/O severely impacts performance
  • #8: In-place mutation Tiered st Traditional computation (e.g. Map Reduce) requires disk I/O thus resulting in slow performance Training Machine learning models iterative steps which are severely impacted if it involved intermittent disk I/O In machine learning involving interative algorithms, intermittent disk I/O severely impacts performance https://www.slideshare.net/imcsummit/imcs2015-1-devopensource-inmemory-platforms?qid=f751b742-9974-420c-b0da-e4ee57eeeadd&v=&b=&from_search=34 States not shareable across processes Causes resource pressure Abstracts are limited to the lifecycle of a job Spark: RDDs don’t exist beyond a job RDDs state cant be shared across jobs ???