SlideShare a Scribd company logo
Hadoop Application
Architectures:
Architecting a Next
Generation Data Platform
Strata Data Conference, New York 2017
tiny.cloudera.com/app-arch-newyork
tiny.cloudera.com/nyquestions
Mark Grover | @mark_grover
Jonathan Seidman | @jseidman
Gwen Shapira | @gwenshap
Questions?
tiny.cloudera.com/nyquestions
Logistics
▪ Break at 3:00 – 3:30 PM
▪ Questions at the end of each section
▪ Slides at tiny.cloudera.com/app-arch-newyork
▪ Code at https://github.com/hadooparchitecturebook/Taxi360
Questions?
tiny.cloudera.com/nyquestions
About the book
▪ @hadooparchbook
▪ hadooparchitecturebook.com
▪ github.com/hadooparchitecturebook
▪ slideshare.com/hadooparchbook
Questions?
tiny.cloudera.com/nyquestions
About the presenters
▪ Product Manager at Lyft
▪ Formerly Software Engineer
on Spark at Cloudera
▪ Committer on Apache Bigtop,
PMC member on Apache
Sentry, Apache Spot
(incubating)
▪ Contributor to Apache Spark,
Hadoop, Hive, Sqoop, Pig,
Flume
Mark Grover
Questions?
tiny.cloudera.com/nyquestions
About the presenters
▪ Product Manager at Confluent
▪ PMC for Apache Kafka.
▪ Previously software engineer
at Cloudera
▪ @gwenshap on twitter
Gwen Shapira
Questions?
tiny.cloudera.com/nyquestions
About the presenters
▪ Software Engineer at
Cloudera
▪ Contributor to Apache Sqoop.
▪ Previously Technical Lead on
the big data team at Orbitz,
co-founder of the Chicago
Hadoop User Group and
Chicago Big Data
Jonathan Seidman
Case Study Overview
Internet of Things and Entity 360
Questions?
tiny.cloudera.com/nyquestions
Customer 360
Questions?
tiny.cloudera.com/nyquestions
Connected Cars
Questions?
tiny.cloudera.com/nyquestions
Entity (Taxi) 360 View
Geo-location/
Traffic Data
Customer Data
Maintenance
Data
Other Data
Sources
Streaming
Vehicle Data
Questions?
tiny.cloudera.com/nyquestions
What Makes Hadoop a Fit?
Data Sources Extract Transform Load
The early days…
Questions?
tiny.cloudera.com/nyquestions
What Makes Hadoop a Fit?
SERVERS MARTS EDWS DOCUMENTS STORAGE SEARCH ARCHIVE
ERP,	CRM,	RDBMS,	MACHINES FILES,	IMAGES,	VIDEOS,	LOGS,	CLICKSTREAMS EXTERNAL	DATA	SOURCES
Today…
Questions?
tiny.cloudera.com/nyquestions
Enabling a Range of New Use Cases…
Fraud Detection Market
Transactions
Internet of Things Network Security
Questions?
tiny.cloudera.com/nyquestions
Hadoop Challenges
Kafka StreamsKafka Connect
Kafka
Questions?
tiny.cloudera.com/nyquestions
Challenges – Architectural Considerations
▪ Reliable and scalable ingress of multiple data types and sources:
- High volume event data? Batch data?
▪ Reliable and scalable storage to support multiple workloads and access patterns
- Historical data? Real-time search? Analytics?
▪ Processing engines (for background processing):
- Stream processing? Batch processing?
▪ Data Modeling
- Modeling data for real-time random access? Analytic access? Batch access?
Case Study
Requirements
Overview
Questions?
tiny.cloudera.com/nyquestions
Requirements
▪ Allow users (technical and non-technical) to analyze and visualize data…
Questions?
tiny.cloudera.com/nyquestions
Requirements
▪ Provide analysts with query capabilities via a standard interface…
Questions?
tiny.cloudera.com/nyquestions
Requirements
▪ Provide developers the ability to perform batch processing on historical data…
Questions?
tiny.cloudera.com/nyquestions
Requirements
▪ To support all this, we need:
- Reliable ingestion of streaming and batch data.
- Ability to perform transformations on streaming data in flight.
- Ability to perform sophisticated processing of historical data.
High level architecture
Walkthrough
Questions?
tiny.cloudera.com/nyquestions
High level architecture
Source Transport Stream
Processing
Storage Access
Data Producers Pub-Sub
Processing &
Ingestion Engine
Nested
Tables
Indexed
Cube
Relational
Tables
Entity Time
Series Lookup
Batch
Processing
SQL
NRT REST
NRT Dashboard
Data Sources
Considerations
Questions?
tiny.cloudera.com/nyquestions
High level architecture
TransportSource Stream
Processing
Storage Access
Data Producers
Processing &
Ingestion Engine
Nested
Tables
Indexed
Cube
Relational
Tables
Entity Time
Series Lookup
Batch
Processing
SQL
NRT REST
NRT Dashboard
Pub Sub
Questions?
tiny.cloudera.com/nyquestions
Key to Customer 360 Success
Your project is only as good as the quality and variety of data sources
Geo-location/
Traffic Data
Customer DataMaintenance
Data
Other Data
Sources
Streaming
Vehicle Data
Files
CSV? XML?
JSON?
Twitter?
Mainframe?
Database Salesforce?
MQTT
Questions?
tiny.cloudera.com/nyquestions
Data Producers: Flume vs. Kafka
▪ Flume – well integrated with Hadoop.
▪ Part of Hadoop ecosystem
▪ Great choice when ingesting data into HDFS.
▪ Can support simple transformations.
▪ Minimal coding – built in support for common data sources.
▪ Kafka – flexible, get-everything pipe
▪ Producers in ~ 20 languages
▪ REST API
▪ Huge connector ecosystem
Questions?
tiny.cloudera.com/nyquestions
Kafka Clients
Apache Kafka Clients Ecosystem Clients
Questions?
tiny.cloudera.com/nyquestions
REST Proxy
Talking to Non-native Kafka Apps and Outside the Firewall
REST Proxy
Non-Java Applications
Native Kafka Java Applications
REST / HTTP
Simplifies administrative
actions
Simplifies message creation
and consumption
Provides a RESTful
interface to a Kafka
cluster
Questions?
tiny.cloudera.com/nyquestions
Kafka Connect
Streaming Data Capture
JDBC
Logs
MQTT
RDBMS
Key/Value
HDFS
Kafka Connect API
Kafka
Connector
Connector
Connector
Connector
Connector
Connector
Sources Sinks
Fault tolerant
Manage hundreds of data
sources and sinks
Preserves data schema
Part of Apache Kafka
project
Includes simple
transformations
Questions?
tiny.cloudera.com/nyquestions
Ecosystem of Connectors
Databases Datastore/File Store
Analytics Applications / Other
Questions?
tiny.cloudera.com/nyquestions
How Connect Works?
Log
Connector
MQTT
Connector
REST API
Logs MQTT
Log Task Log Task
MQTT
Task
MQTT
Task
Questions?
tiny.cloudera.com/nyquestions
Schema Registry
Elastic
Cassandra
HDFS
Example Consumers
Serializer
Source 1
Serializer
Source 2
!
Kafka Topic!
Schema Registry
Define the expected fields for each Kafka topic
Automatically handle schema changes (e.g. new fields)Prevent backwards incompatible changes
Get different data sources to talk the same language
Questions?
tiny.cloudera.com/nyquestions
High level architecture
TransportSource Stream
Processing
Storage Access
Processing &
Ingestion Engine
Nested
Tables
Indexed
Cube
Relational
Tables
Entity Time
Series Lookup
Batch
Processing
SQL
NRT REST
NRT Dashboard
Pub Sub
Questions?
tiny.cloudera.com/nyquestions
But wait!
What about batch data?
Buffering
Questions?
tiny.cloudera.com/nyquestions
High level architecture
Source Buffer Stream
Processing
Storage Access
Pub-Sub
Processing &
Ingestion Engine
Nested
Tables
Indexed
Cube
Relational
Tables
Entity Time
Series Lookup
Batch
Processing
SQL
NRT REST
NRT Dashboard
Data Producers
Questions?
tiny.cloudera.com/nyquestions
Buffering Data
▪ What do we mean by “buffering” and why do we need it?
event,event,event,event,event,event…
This is bad!
▪ Network partitions happen
▪ Producers and Consumers
work at different rates
▪ Reliable storage is hard
Stream processing is hard
Lets do one at a time
Questions?
tiny.cloudera.com/nyquestions
Buffering Data – Message Brokers
Publisher
Publisher
Publisher
Message
Queue
Subscriber
Subscriber
Subscriber
Questions?
tiny.cloudera.com/nyquestions
High level architecture
Source Buffer Stream
Processing
Storage Access
Processing &
Ingestion Engine
Nested
Tables
Indexed
Cube
Relational
Tables
Entity Time
Series Lookup
Batch
Processing
SQL
NRT REST
NRT Dashboard
Questions?
tiny.cloudera.com/nyquestions
What is Kafka?
▪ It’s like a message queue, right?
- Actually, it’s a “distributed commit log”
- Or “streaming data platform”
0 1 2 3 4 5 6 7 8
Data
Source
Data
Consumer
A
Data
Consumer
B
Questions?
tiny.cloudera.com/nyquestions
Topics and Partitions
▪ Messages are organized into topics, and each topic is split into partitions.
- Each partition is an immutable, time-sequenced log of messages on disk.
- Note that time ordering is guaranteed within, but not across, partitions.
0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8
Partition 0
Partition 1
Partition 2
Data
Source
Topic
Questions?
tiny.cloudera.com/nyquestions
Consumers
In Our Architecture
Taxi Trip Data
Producer
Kafka
taxi-trip-input
Topic
Stream
Processing
(Analytic)
Stream
Processing
(Lookup)
Stream
Processing
(Search)
Stream
Processing
(Long Term)
Questions?
tiny.cloudera.com/nyquestions
Input Events
CMT,2009-01-05 08:31:55,2009-01-05 8:37:50,1,0.90000000000000002,-73.977936999999997,
40.745919000000001,,,-
73.983609000000001,40.755051000000002,Credit,5.2999999999999998,0,,0.79000000000000004,0,6.0
899999999999999
vendor_name,Trip_Pickup_DateTime,Trip_Dropoff_DateTime,Passenger_Count,Trip_Distance,
Start_Lon,Start_Lat,Rate_Code,store_and_forward,End_Lon,End_Lat,Payment_Type,Fare_Amt,
surcharge,mta_tax,Tip_Amt,Tolls_Amt,Total_Amt
Questions?
tiny.cloudera.com/nyquestions
Kafka Considerations – Reliability
▪ But remember there are tradeoffs…
Questions?
tiny.cloudera.com/nyquestions
Kafka Considerations – Reliability
▪ Different reliability levels for topics:
Taxi Trip Data
Kafka
taxi-trip-input
Twitter customer-sentiment
100% – dups
are ok
(“At least
once”)
<=100%
(“At most
once”)
News Flash:
Kafka’s Exactly Once
Producer is on the way
Questions?
tiny.cloudera.com/nyquestions
Kafka Reliability – Replication
Producer
Broker
Partition1
Partition2
Partition3
Leader
Questions?
tiny.cloudera.com/nyquestions
Kafka Reliability – Replication
Producer
Broker
Partition1
Partition2
Partition3
Questions?
tiny.cloudera.com/nyquestions
Kafka Reliability – Replication
Producer
Broker
Partition1
Partition2
Partition3
Broker
Partition1
Partition2
Partition3
Leader
Questions?
tiny.cloudera.com/nyquestions
Kafka Reliability– Replication
Producer
Broker
Partition1
Partition2
Partition3
Broker
Partition1
Partition2
Partition3
Leader
Leader
Questions?
tiny.cloudera.com/nyquestions
Kafka Reliability – Replication
Producer
Broker
Partition1
Partition2
Partition3
Broker
Partition1
Partition2
Partition3
Broker
Partition1
Partition2
Partition3
Leader
Questions?
tiny.cloudera.com/nyquestions
Kafka Reliability – Replication
▪ So how does this relate to our application?
kafka-topics --zookeeper ZKHOST:ZKPORT –partition 2 --replication-factor 3 
--create --topic taxi-trip-input
kafka-topics --zookeeper ZKHOST:ZKPORT –partition 2 --replication-factor 1 
--create –topic customer-sentiment
Questions?
tiny.cloudera.com/nyquestions
Kafka Reliability – Producers
Taxi Trip Data
Kafka
taxi_trip_input
Partition 1
Partition 2
Partition 3
Topic B
Partition 1
Partition 2
Partition 3
Message
failure?
Producer
Resend
message
acks=all
Questions?
tiny.cloudera.com/nyquestions
Kafka Reliability – Producers
▪ What about duplicates?
Taxi Trip Data
Kafka
taxi_trip_input
Partition 1
Partition 2
Partition 3
Topic B
Partition 1
Partition 2
Partition 3
Producer
ID Message
1000 2009-01-04 03:02:00,1,2.629,...
1001 2009-01-04 03:38:00,3,4.549…
1001 2009-01-04 03:38:00,3,4.549…
Questions?
tiny.cloudera.com/nyquestions
Kafka Scaling – Partitions
Producer
Kafka
taxi-trip-input
Partition 1
Partition 2
Partition 3
Consumer Group
Consumer
Consumer
Consumer
Questions?
tiny.cloudera.com/nyquestions
Kafka Scaling – Partitions
Producer
Kafka
taxi-trip-input
Partition 1
Partition 2
Partition 3
Consumer Group
Consumer
Consumer
Consumer
Partition 4
Partition 5
Consumer
Consumer
Higher
throughput
Higher
throughput
More
resources
(memory)
More
resources
(file handles)
Producer
Questions?
tiny.cloudera.com/nyquestions
How many partitions?
§ Adding partitions late in the game is painful
§ Basic formula:
total desired throughput / throughput of slowest consumer or producer
§ Or ~25GB disk space
§ Not too many because:
- Each partition takes broker heap memory and file handles
- Each partition slows down node shutdown / recovery
- 1000 – 4000 partitions per broker max
- Producers will produce smaller batches – lower throughput
Questions?
tiny.cloudera.com/nyquestions
Kafka Scaling – Producers
Producer
Kafka
taxi-trip-input
Partition 1
Partition 2
Partition 3
Consumer Group
Consumer
Consumer
Consumer
Partition 4
Partition 5
Consumer
Consumer
Producer
Questions?
tiny.cloudera.com/nyquestions
Guarding Against Message Loss
§ Producer – What happens if the producer loses connection to Kafka and the buffer overflows?
- You get an exception. You can choose to… block? Write to file?
§ Source – What happens if events are lost before getting sent to producer?
- Once again use some kind of buffer to provide sufficient retention of data.
Stream Processing
Considerations
Questions?
tiny.cloudera.com/nyquestions
High level architecture
Source Transport Stream
Processing
Storage Access
Custom
Producer
or
Processing &
Ingestion Engine
Nested
Tables
Indexed
Cube
Relational
Tables
Entity Time
Series Lookup
Batch
Processing
SQL
NRT REST
NRT Dashboard
Questions?
tiny.cloudera.com/nyquestions
Streaming agenda
▪ What do we mean by streaming?
▪ Streaming use-cases
▪ Streaming semantics
▪ Which streaming engine to choose?
▪ Streaming in our use-case
What do we mean by
streaming?
Questions?
tiny.cloudera.com/nyquestions
What do we mean by streaming?
Constant low
milliseconds & under
Low milliseconds to
seconds, delay in
case of failures
10s of seconds or
more, re-run in case
of failures
Real-time Near real-time Batch
Questions?
tiny.cloudera.com/nyquestions
What do we mean by streaming?
Constant low
milliseconds & under
Low milliseconds to
seconds, delay in
case of failures
10s of seconds or
more, re-run in case
of failures
Real-time Near real-time Batch
Questions?
tiny.cloudera.com/nyquestions
But, there’s no free lunch
Constant low
milliseconds & under
Low milliseconds to
seconds, delay in
case of failures
10s of seconds or
more, re-run in case
of failures
Real-time Near real-time Batch
“Difficult” architectures, lower
latency
“Easier” architectures, higher
latency
Streaming use-cases
Questions?
tiny.cloudera.com/nyquestions
Streaming Use-cases
▪ Ingestion (most relevant in our use-case)
▪ Simple transformations
- Decision (e.g. anomaly detection)
- Enrichment (e.g. add a state based on zipcode)
▪ Advanced usage
- Machine Learning
- Windowing
Questions?
tiny.cloudera.com/nyquestions
#1 - Simple ingestion
Buffer
Event e Stream
Processing Long term
storage
Event e
Questions?
tiny.cloudera.com/nyquestions
#2 - Enrichment
Buffer
Event e Stream
Processing Storage
Event e’
e’ = enriched event e
Context store
Questions?
tiny.cloudera.com/nyquestions
#2 - Decision
Buffer
Event e Stream
Processing Storage
Event e’
e’ = e + decision
Rules
Questions?
tiny.cloudera.com/nyquestions
#3 – Advanced usage
Buffer
Event e Stream
Processing Storage
Event e’
e’ = aggregation or
windowed aggregation
Model
Questions?
tiny.cloudera.com/nyquestions
#1 – Simple Ingestion
1. Zero transformation
- No transformation, plain ingest
- Keep the original format – SequenceFile, Text, etc.
- Allows to store data that may have errors in the schema
2. Format transformation
- Simply change the format of the field
- To a structured format, say, Avro, for example
- Can do schema validation
3. Atomic transformation
- Mask a credit card number
Questions?
tiny.cloudera.com/nyquestions
#2 - Enrichment
Buffer
Event e Stream
Processing Storage
Event e’
e’ = enriched event e
Context store
Need to store the
context
somewhere
Questions?
tiny.cloudera.com/nyquestions
Where to store the context?
1. Locally Broadcast Cached Dim Data
- Local to Process (On Heap, Off Heap)
- Local to Node (Off Process)
2. Partitioned Cache
- Shuffle to move new data to partitioned cache
3. External Fetch Data (e.g. HBase, Memcached)
Questions?
tiny.cloudera.com/nyquestions
#1a - Locally broadcast cached data
Could be
On heap or Off heap
Questions?
tiny.cloudera.com/nyquestions
#1b - Off process cached data
Data is cached on the
node, outside of
process. Potentially in
an external system like
Rocks DB
Questions?
tiny.cloudera.com/nyquestions
#2 - Partitioned cache data
Data is partitioned
based on field(s) and
then cached
Questions?
tiny.cloudera.com/nyquestions
#3 - External fetch
Data fetched from
external system
Questions?
tiny.cloudera.com/nyquestions
Partitioned cache + external
Streaming semantics
Questions?
tiny.cloudera.com/nyquestions
Delivery Types
▪ At most once
- Not good for many cases
- Only where performance/SLA is more important than accuracy
▪ Exactly once
- Expensive to achieve but desirable
▪ At least once
- Easiest to achieve
Questions?
tiny.cloudera.com/nyquestions
Semantics of our architecture
Source System 1
Destination
systemSource System 2
Source System 3
Ingest Extract Streaming
engine
Push
Message broker
Questions?
tiny.cloudera.com/nyquestions
Classification of storage systems
▪ File based
- S3
- HDFS
▪ NoSQL
- HBase
- Cassandra
▪ Document based
- Search
▪ NoSQL-SQL
- Kudu
Questions?
tiny.cloudera.com/nyquestions
Classification of storage systems
▪ File based
- S3
- HDFS
▪ NoSQL
- HBase
- Cassandra
▪ Document based
- Search
▪ NoSQL-SQL
- Kudu
De-duplication at file level
Semantics at key/record level
Which streaming
engine to choose?
Questions?
tiny.cloudera.com/nyquestions
High level architecture
Source Transport Stream
Processing
Storage Access
Processing &
Ingestion Engine
Nested
Tables
Indexed
Cube
Relational
Tables
Entity Time
Series Lookup
Batch
Processing
SQL
NRT REST
NRT Dashboard
Apache
Beam
Kafka
Streams
Questions?
tiny.cloudera.com/nyquestions
Requirements
§Fault-tolerant and distributed
▪ Effectively once semantics
▪ Handle processing time vs. event time
▪ Allow stateful transformations
Questions?
tiny.cloudera.com/nyquestions
Spark Streaming
▪ Micro batch based architecture
▪ Allows stateful transformations
▪ Feature rich
- Windowing
- Sessionization
- ML
- SQL (Structured Streaming)
Questions?
tiny.cloudera.com/nyquestions
DStream
DStream
DStream
Single Pass
Source Receiver RDD
Source Receiver RDD
RDD
Filter Count Print
Source Receiver RDD
RDD
RDD
Single Pass
Filter Count Print
First
Batc
h
Second
Batch
Questions?
tiny.cloudera.com/nyquestions
DStream
DStream
DStream
Single Pass
Source Receiver RDD
Source Receiver RDD
RDD
Filter Count
Print
Source Receiver
RDD
partitions
RDD
Parition
RDD
Single Pass
Filter Count
Pre-first
Batch
First
Batc
h
Second
Batch
Stateful
RDD 1
Print
Stateful
RDD 2
Stateful
RDD 1
Questions?
tiny.cloudera.com/nyquestions
Spark Streaming - Gaps
§Not as low of a latency
- Efforts towards reducing latency e.g. RISElab’s Drizzle
§Global consistent execution state
- Stop overall execution of distributed computation
- Eagerly persist records in transit meaning larger snapshots
Questions?
tiny.cloudera.com/nyquestions
Flink
▪ True “streaming” system, but not as feature rich as Spark
▪ Much better event time handling
▪ Good built-in backpressure support
▪ Allows stateful transformations
▪ Lower Latency
- No Micro Batching
- Asynchronous Barrier Snapshotting (ABS)
Questions?
tiny.cloudera.com/nyquestions
Flink - ABS
Operator
Buffer
Questions?
tiny.cloudera.com/nyquestions
Operator
Buffer
Operator
Buffer
Flink - ABS
Barrier 1A Hit
Barrier 1B
Still Behind
Questions?
tiny.cloudera.com/nyquestions
Operator
Buffer
Flink - ABS
Both Barriers
Hit
Operator
Buffer
Barrier 1A Hit
Barrier 1B
Still Behind
Questions?
tiny.cloudera.com/nyquestions
Operator
Buffer
Flink - ABS Both Barriers
Hit
Operator
Buffer Barrier is
combined and
can move on
Buffer can be
flushed out
Questions?
tiny.cloudera.com/nyquestions
Storm
▪ Old school
▪ Didn’t manage state – had to use Trident
▪ No good support for batch processing
Questions?
tiny.cloudera.com/nyquestions
Samza
▪ Good integration with Kafka
▪ Doesn’t support batch
▪ Forked by Kafka Streams
Questions?
tiny.cloudera.com/nyquestions
Flume
▪ Well integrated with the Hadoop ecosystem
▪ Allowed interceptors (for simple transformations)
▪ Supports buffering
- Memory
- File
- Kafka
▪ But no real fault-tolerance
▪ No state management
Questions?
tiny.cloudera.com/nyquestions
Kafka Streams
▪ Good integration with Kafka
▪ Light-weight library (not a framework)
▪ No micro-batching, uses Kafka as internal messaging layer
▪ Maintains local state per node (in RocksDB, or in memory
hash map)
▪ Handles late events
▪ Stream-to-stream joins
Questions?
tiny.cloudera.com/nyquestions
Topic
Partition 1
Partition 2
Task 1 Re-partition topic
Partition 1
Partition 2
Task 3
Task 2
Task 4
Kafka Streams architecture
Questions?
tiny.cloudera.com/nyquestions
Apache Beam
§ Abstraction on top of Streaming Engines
§ Best support for Google Dataflow
Questions?
tiny.cloudera.com/nyquestions
Others
§ Apache Apex
§ Heron
Streaming in our use-
case
Questions?
tiny.cloudera.com/nyquestions
Spark Streaming
▪ We chose Spark Streaming because:
- Same execution engine for batch and streaming
- Similar code for batch and streaming
- Support for security, Kafka integration
- Thriving community
Questions?
tiny.cloudera.com/nyquestions
High level architecture
Source Transport Stream
Processing
Storage Access
Nested
Tables
Indexed
Cube
Relational
Tables
Entity Time
Series Lookup
Batch
Processing
SQL
NRT REST
NRT Dashboard
Storage Layer
Considerations
Questions?
tiny.cloudera.com/nyquestions
High level architecture
Source Transport Stream
Processing
Storage Access
Nested
Tables
Indexed
Cube
Relational
Tables
Entity Time
Series Lookup
Batch
Processing
SQL
NRT REST
NRT Dashboard
Data Modeling
Questions?
tiny.cloudera.com/nyquestions
Structured Landing Zones
Relational
Nested
Time Series
Reversed Indexed
Traditional SQL
Optimized for nested Structures like JSON
Optimized Entity 360 and time base access
Optimized faceted charts and reverse index look ups
Graph Optimized for node and edges
Special Optimized for special use cases
Questions?
tiny.cloudera.com/nyquestions
Structured Landing Zones
Relational
Nested
Time Series
Reverse Indexed
Traditional SQL
Optimized for nested structures like JSON
Optimized Entity 360 and time base access
Optimized faceted charts and reverse index look ups
Graph Optimized for node and edges
Background Information
Questions?
tiny.cloudera.com/nyquestions
Compression Styles and Entropy
Columns
Rows
Questions?
tiny.cloudera.com/nyquestions
Compression Styles and Entropy
Block
Block
Block
Column
Column
Column
Column
Column
Column
Row
Group
Row
Group
Questions?
tiny.cloudera.com/nyquestions
Compression Codecs
- Snappy: 2x-3x : Fast Read, Fast Write
- Lzo : 2x-3x : Fast Read, Fast Write
- Gzip : ~8x: ~Fast Read, Normal Write
- Default : ~8x: ~Fast Read, Normal Write
- BZip2 : ~10x ~Fast Read, Slow Write
- Others ..
- Always be skeptical
- All data compresses differently
- Use your own data
Questions?
tiny.cloudera.com/nyquestions
Introducing the Hive Metastore
- Hive Metastore
- Adds a table like metadata layer over a file system, block store, NoSql, or other
- Allows for SQL access
- Allows for greater security options
- Allows for external metadata
- Allows for partitioning
Questions?
tiny.cloudera.com/nyquestions
Typical Hive Table
- ParentFolder
- TableFolder
- Date=20171212
- DataFiles
- DataFiles
- Date=20171211
- DataFiles
- DataFiles
Questions?
tiny.cloudera.com/nyquestions
Access Patterns
- Partitioning
- Filter push down
- Indexing should be considered poor
- Ideal for large scans
Relational Storage
Questions?
tiny.cloudera.com/nyquestions
Thinking about Object/Tables
1. Lets start off easy
1. Use Case: We are a Netflix type company and we have a log of users and movies watched
that looks something like this:
User ID Age Account Start
Date
Category Of User Movie Watched Movie Category Start Time Events List
Bob 42 12/12/2012 Basic Die Hard Action 5/4/2016 12:00 Play 0, pause at
15, FF at 40 to 55,
E at 90
Kat 31 12/12/2012 Platum Beauty and the
Beast
Family 5/4/2016 12:00 Play 0, pause at
15, FF at 40 to 55,
E at 90
Questions?
tiny.cloudera.com/nyquestions
Thinking about Object/Tables
1. To make this into objects we need to do some separation
User
User_id
Age
St_dt
Category
Movie
Movie_id
Title
Category
Watch_session
Watch_id
St_dt
En_dt
User_id
Movie_id
Watch_Events
Watch_id
St_dt
Type
Duration
Category_Typ
Category_id
Stream_rt
Is_feature_enabled
1 *
*
1
1
*
1*
Questions?
tiny.cloudera.com/nyquestions
Query Considerations
- Data is normally big so
- Partition respectively to access patterns
- Join with care
- Consider sampling or local testing before experimenting
- Data is files
- Latency to accessibility it high – seconds, minutes or more.
Questions?
tiny.cloudera.com/nyquestions
Look for big tables
User
User_id
Age
St_dt
Category
Movie
Movie_id
Title
Category
Watch_session
Watch_id
St_dt
En_dt
User_id
Movie_id
Watch_Events
Watch_id
St_dt
Type
Duration
Category_Typ
Category_id
Stream_rt
Is_feature_enabled
1 *
*
1
1
*
1*
Questions?
tiny.cloudera.com/nyquestions
Mutation Patterns
- File is written once and can not be mutated
- Fine for append or snapshot use cases
- Mutation will require a compaction
Questions?
tiny.cloudera.com/nyquestions
Compaction Recap
Key Time Value
A 1 101
B 1 101
C 1 101
D 1 101
E 1 101
F 1 101
G 1 101
Key Time Value
A 2 102
D 2 102
F 2 102
F 3 103
H 3 103
Key Time Value
A 2 102
B 1 101
C 1 101
D 2 102
E 1 101
F 3 103
G 1 101
H 3 103
Questions?
tiny.cloudera.com/nyquestions
View Strategies
Hive Relational Model
Hive Nested Model
Models
Hive Normal Views
Hive Materialized Table
Views
Use in the cases where the view requires
a join that is done through a shuffle
Use only for tables that filter
records/columns or use for marking fields
Questions?
tiny.cloudera.com/nyquestions
Relational Storage Options
Questions?
tiny.cloudera.com/nyquestions
Kudu
Questions?
tiny.cloudera.com/nyquestions
Kudu Use Cases
§ Among other things, NRT availability of streaming data.
- A good fit for our application.
§ Also things like machine learning, time series, etc.
Questions?
tiny.cloudera.com/nyquestions
Kudu In Our Architecture
def sendEntityToKudu(taxiEntityTableName: String, it: Iterator[(String, NyTaxiYellowEntityStateWrapper)],
kuduClient: KuduClient): Unit = {
val table = kuduClient.openTable(taxiEntityTableName)
val session = kuduClient.newSession()
session.setFlushMode(FlushMode.AUTO_FLUSH_BACKGROUND)
it.foreach(r => {
val state = r._2.state
val entity = r._2.entity
val operation: Operation = if (state.equals("New")) {
table.newInsert()
} else if (state.equals("Modified")) {
table.newUpdate()
} else {
null
}
...
https://github.com/hadooparchitecturebook/Taxi360/blob/master/src/main/scala/com/hadooparchitecturebook/taxi360/streaming/ingestion/kudu/SparkStreamingTaxiTripToKudu.scala
Questions?
tiny.cloudera.com/nyquestions
Kudu In Our Architecture
…
sqlContext.read.options(kuduOptions).format("org.apache.kudu.spark.kudu").load.
registerTempTable("ny_taxi_trip_tmp")
//Vector
val vectorRDD:RDD[Vector] = sqlContext.sql("select * from ny_taxi_trip_tmp").map(r
=> {
val taxiTrip = NyTaxiYellowTripBuilder.build(r)
generateVectorOnly(taxiTrip)
})
println("--Running KMeans")
val clusters = KMeans.train(vectorRDD, numOfCenters, numOfIterations)
println(" > vector centers:")
clusters.clusterCenters.foreach(v => println(" >> " + v))
...
https://github.com/hadooparchitecturebook/Taxi360/blob/master/src/main/scala/com/hadooparchitecturebook/taxi360/etl/machinelearning/kudu/MlLibOnKudu.scala
Nested Structures
Questions?
tiny.cloudera.com/nyquestions
Nested
▪ Less Space than Denormalization
▪ Still have tables but the cost of joins is all but gone
▪ Also great for cartesian joins
- N x M vs N + M
▪ Not really supported yet with Kudu or HBase with SQL
Questions?
tiny.cloudera.com/nyquestions
Nested Example
CREATE TABLE fact_contacts (id BIGINT, name STRING, address
STRING) STORED AS PARQUET;
CREATE TABLE dim_phones
(
contact_id BIGINT
, category STRING
, international_code STRING
, area_code STRING
, exchange STRING
, extension STRING
, mobile BOOLEAN
, carrier STRING
, current BOOLEAN
, service_start_date TIMESTAMP
, service_end_date TIMESTAMP
)
Questions?
tiny.cloudera.com/nyquestions
Nested Example
CREATE TABLE contacts_detailed_phones
(
id BIGINT, name STRING, address STRING
, phone ARRAY < STRUCT <
category: STRING
, international_code: STRING
, area_code: STRING
, exchange: STRING
, extension: STRING
, mobile: BOOLEAN
, carrier: STRING
, current: BOOLEAN
, service_start_date: TIMESTAMP
, service_end_date: TIMESTAMP
>>
) STORED AS PARQUET;
https://www.cloudera.com/documentation/enterprise/latest/topics/impala_complex_types.html
Questions?
tiny.cloudera.com/nyquestions
De-normalized vs Nested
- Nested Pros
- Co-location
- Faster to group by
- Faster to window
- Joins are free
- Less data
- Better compression
- Tables and Columns can be read with out penalty from one not read
- Great for limiting the effort are Cartesian Joins
- Nested Cons
- Size limitation of parent row
- Adding child requires the re-write the the whole parent record
Questions?
tiny.cloudera.com/nyquestions
Options for appending Nested
- It is all about the parent record
- We can add more then one Partition key for the parent
- In our use case
- User & watch month or day
Questions?
tiny.cloudera.com/nyquestions
Storage and In Memory
- Also don t limit the idea of nested to just tables
- In Spark they can be used as in memory constructs to
- conserve on networking
- In memory cost
Questions?
tiny.cloudera.com/nyquestions
Nested Writing Example in Spark
{
"id": "0001",
"type": "donut",
"name": "Cake",
"ppu": 0.55,
"batters":
{
"batter":
[
{ "id": "1001", "type": "Regular" },
{ "id": "1002", "type": "Chocolate" },
{ "id": "1003", "type": "Blueberry" },
{ "id": "1004", "type": "Devil's Food" }
]
},
"topping":
[
{ "id": "5001", "type": "None" },
{ "id": "5002", "type": "Glazed" },
{ "id": "5005", "type": "Sugar" },
{ "id": "5007", "type": "Powdered Sugar" },
{ "id": "5006", "type": "Chocolate with Sprinkles" }
]
Questions?
tiny.cloudera.com/nyquestions
Nested Writing Example in Spark
val jsonDF = hiveContext.read.json(jsonRDD)
jsonDF.write.parquet("./parquet")
hiveContext.createExternalTable("jsonNestedTable", "./parquet")
Questions?
tiny.cloudera.com/nyquestions
Nested In Our Architecture
…
hiveContext.sql("create table " + hdfsTaxiNestedTableName + "( " +
" vender_id string," +
" trip array<struct< " +
" passenger_count: INT," +
" payment_type: STRING, " +
" total_amount: DOUBLE, " +
" fare_amount: DOUBLE " +
" >>" +
" ) stored as parquet")
val emptyDf = hiveContext.sql("select * from " + hdfsTaxiNestedTableName + " limit 0")
hiveContext.createDataFrame(newNestedDf, emptyDf.schema).registerTempTable("tmpNested")
hiveContext.sql("insert into " + hdfsTaxiNestedTableName + " select * from tmpNested")
…
https://github.com/hadooparchitecturebook/Taxi360/blob/master/src/main/scala/com/hadooparchitecturebook/taxi360/sql/kudu/KuduToNestedHDFS.scala
Time Series
Questions?
tiny.cloudera.com/nyquestions
Time Series Options
§ HBase and Cassandra
Questions?
tiny.cloudera.com/nyquestions
Entity Centric Time Series
▪ Partition by Entity ID
▪ Order by Time
▪ Allows for free windowing
▪ Allows for fetching of single time window of single entity at web scale
Questions?
tiny.cloudera.com/nyquestions
HBase Entity Time Series
Cust-A, 10
Cust-A, 20
Cust-A, 40
Cust-C, 10
Cust-C, 20
Cust-C, 30
Cust-C, 40
Cust-B, 10
Cust-B, 20
Cust-B, 30
Cust-B, 40
Cust-F, 20
Cust-F, 30
Cust-F, 40
Cust-D, 10
Cust-D, 20
Cust-D, 40
Cust-G, 10
Cust-G, 20
Cust-G, 30
Cust-G, 40
Questions?
tiny.cloudera.com/nyquestions
HBase Entity Time Series
Cust-A, 10
Cust-A, 20
Cust-A, 40
Cust-C, 10
Cust-C, 20
Cust-C, 30
Cust-C, 40
Cust-B, 10
Cust-B, 20
Cust-B, 30
Cust-B, 40
Cust-F, 20
Cust-F, 30
Cust-F, 40
Cust-D, 10
Cust-D, 20
Cust-D, 40
Cust-G, 10
Cust-G, 20
Cust-G, 30
Cust-G, 40
Rest Call Short Scan
Questions?
tiny.cloudera.com/nyquestions
HBase Entity Time Series
Cust-A, 10
Cust-A, 20
Cust-A, 40
Cust-C, 10
Cust-C, 20
Cust-C, 30
Cust-C, 40
Cust-B, 10
Cust-B, 20
Cust-B, 30
Cust-B, 40
Cust-F, 20
Cust-F, 30
Cust-F, 40
Cust-D, 10
Cust-D, 20
Cust-D, 40
Cust-G, 10
Cust-G, 20
Cust-G, 30
Cust-G, 40
Mapper Mapper Mapper
Questions?
tiny.cloudera.com/nyquestions
HBase Entity Time Series
Cust-A, 10
Cust-A, 20
Cust-A, 40
Cust-C, 10
Cust-C, 20
Cust-C, 30
Cust-C, 40
Cust-B, 10
Cust-B, 20
Cust-B, 30
Cust-B, 40
Cust-F, 20
Cust-F, 30
Cust-F, 40
Cust-D, 10
Cust-D, 20
Cust-D, 40
Cust-G, 10
Cust-G, 20
Cust-G, 30
Cust-G, 40
Mapper
Mapper Mapper
Questions?
tiny.cloudera.com/nyquestions
What is meant by Bucketing and Sorting
- Partitioning on a Key
- Then sorting on that key + another field(s)
- Example
- User_id + Watch Event Time
Questions?
tiny.cloudera.com/nyquestions
Example of Bucketed Sorted
Cust-A, 10
Cust-A, 20
Cust-A, 40
Cust-B, 10
Cust-B, 20
Cust-B, 30
Cust-B, 40Cust-C, 10
Cust-C, 20
Cust-C, 30
Cust-C, 40
Cust-F, 10
Cust-F, 20
Cust-F, 40
Cust-D, 10
Cust-D, 20
Cust-D, 40
Cust-G, 10
Cust-G, 20
Cust-G, 30
Cust-G, 40
Questions?
tiny.cloudera.com/nyquestions
Good for Appending Nested
Cust-A, 10
Cust-A, 20
Cust-A, 40
Cust-B, 10
Cust-B, 20
Cust-B, 30
Cust-B, 40Cust-C, 10
Cust-C, 20
Cust-C, 30
Cust-C, 40
Cust-F, 10
Cust-F, 20
Cust-F, 40
Cust-D, 10
Cust-D, 20
Cust-D, 40
Cust-G, 10
Cust-G, 20
Cust-G, 30
Cust-G, 40
Cust-A, 50
Cust-A, 60
Cust-B, 50
Cust-B, 60
Cust-C, 50
Cust-D, 50
Cust-G, 50
Existing DataNew Data
Questions?
tiny.cloudera.com/nyquestions
Good for Appending Nested
Cust-A, 10
Cust-A, 20
Cust-A, 40
Cust-B, 10
Cust-B, 20
Cust-B, 30
Cust-B, 40Cust-C, 10
Cust-C, 20
Cust-C, 30
Cust-C, 40
Cust-F, 10
Cust-F, 20
Cust-F, 40
Cust-D, 10
Cust-D, 20
Cust-D, 40
Cust-G, 10
Cust-G, 20
Cust-G, 30
Cust-G, 40
Cust-A, 50
Cust-A, 60
Cust-B, 50
Cust-B, 60
Cust-C, 50
Cust-D, 50
Cust-G, 50
Existing DataNew Data
Shuffle Join
Questions?
tiny.cloudera.com/nyquestions
Good for Appending Nested
Cust-A, 10
Cust-A, 20
Cust-A, 40
Cust-B, 10
Cust-B, 20
Cust-B, 30
Cust-B, 40Cust-C, 10
Cust-C, 20
Cust-C, 30
Cust-C, 40
Cust-F, 10
Cust-F, 20
Cust-F, 40
Cust-D, 10
Cust-D, 20
Cust-D, 40
Cust-G, 10
Cust-G, 20
Cust-G, 30
Cust-G, 40
Cust-B, 50
Cust-B, 60
Existing DataNew Data
Cust-A, 50
Cust-A, 60
Cust-C, 50
Cust-D, 50
Cust-G, 50
Merge Join
Questions?
tiny.cloudera.com/nyquestions
Good for Appending Nested
Cust-A, 10
Cust-A, 20
Cust-A, 40
Cust-C, 10
Cust-C, 20
Cust-C, 30
Cust-C, 40
Cust-A, 50
Cust-A, 60
Cust-C, 50
Merge Join
Cust-A, 10
Cust-A, 20
Cust-A, 40
Cust-C, 10
Cust-C, 20
Cust-C, 30
Cust-C, 40
Cust-A, 50
Cust-A, 60
Cust-C, 50
Order
Retained
Questions?
tiny.cloudera.com/nyquestions
What else could be use Bucketing and Sorting For
- Windowing
- Point retrieval
Questions?
tiny.cloudera.com/nyquestions
Bucketed & Sorted for Windowing
Cust-A, 10
Cust-A, 20
Cust-A, 40
Cust-B, 10
Cust-B, 20
Cust-B, 30
Cust-B, 40Cust-C, 10
Cust-C, 20
Cust-C, 30
Cust-C, 40
Cust-F, 10
Cust-F, 20
Cust-F, 40
Cust-D, 10
Cust-D, 20
Cust-D, 40
Cust-G, 10
Cust-G, 20
Cust-G, 30
Cust-G, 40
Spark Mapper Spark Mapper Spark Mapper
Questions?
tiny.cloudera.com/nyquestions
Bucketed Sorted in a NoSQL
Cust-A, 10
Cust-A, 20
Cust-A, 40
Cust-B, 10
Cust-B, 20
Cust-B, 30
Cust-B, 40Cust-C, 10
Cust-C, 20
Cust-C, 30
Cust-C, 40
Cust-F, 10
Cust-F, 20
Cust-F, 40
Cust-D, 10
Cust-D, 20
Cust-D, 40
Cust-G, 10
Cust-G, 20
Cust-G, 30
Cust-G, 40
Rest Call Short Scan
Questions?
tiny.cloudera.com/nyquestions
NoSQL
- Columnar
Questions?
tiny.cloudera.com/nyquestions
What is a NoSQL
- It s not NO SQL
- It s not a Database
- Think of it more like a
- HashMap
- Log
- Bucketed and Ordered
Questions?
tiny.cloudera.com/nyquestions
Hash Map
- There is a Key and a Value
- It is really fast to grab a key/value
- It is really fast to add a key/value
- Iteration is also possible
Key Value
A 1
B 1
C 1
D 1
E 1
F 1
G 1
Client
Questions?
tiny.cloudera.com/nyquestions
Log with Compactions
- When new records come in they don t rewrite the old
- They compact in
Key Time Value
A 1 101
B 1 101
C 1 101
D 1 101
E 1 101
F 1 101
G 1 101
Key Time Value
A 2 102
D 2 102
F 2 102
F 3 103
H 3 103
Key Time Value
A 2 102
B 1 101
C 1 101
D 2 102
E 1 101
F 3 103
G 1 101
H 3 103
Questions?
tiny.cloudera.com/nyquestions
HDFS
Log with Compactions
- Write Path
- Get Local for Record (Cached)
- First to WAL
- Then to Memstore
- Sorting & batching
- Flush to New Hfile
- Later Hfiles will be compacted
Client
Master
RegionServer
Memstore
HFiles New HFiles
HFiles
WAL
Questions?
tiny.cloudera.com/nyquestions
HDFS
Ordered
- All Records Columns are ordered
- Ordering allows for simpler indexing
- Ordering allows for simpler compactions
- We will also use this ordering
- Windowing
- Time series
- Local scanning
Client
Master
RegionServer
Memstore
HFiles New HFiles
HFiles
Questions?
tiny.cloudera.com/nyquestions
Bucketing or Partitions
- HBase
- Out of the Box:
- Range
- Desired:
- Salt
- Cassandra
- Out of the Box:
- HashMod
- Bucketed HashMod
Questions?
tiny.cloudera.com/nyquestions
So what about SQL
- Well SQL could totally work
- CQL for cassandra
- Hive and SparkSQL on HBase
- Why is it not the best idea
- Built more for point look ups
- Scans are not as fast as parquet
- However the mutability may be more important than speed
- Partitioning is not simple
- It must be put into the key
Questions?
tiny.cloudera.com/nyquestions
Let s talk about CAP for a Minute
- Strong Consistency
- HBase & Kudu
- Variable Consistency
- Cassandra
Questions?
tiny.cloudera.com/nyquestions
HBase Model
Client
Master
Region Server 1
Region Server 2
- Region Server owns range splits
- Region Server 1 fails
- Master needs to figure that out
- Master needs to assign new Region Server to own splits
- Region Server 2 has to get organized
- Region Server 2 is read to server reads and writes
Questions?
tiny.cloudera.com/nyquestions
Cassandra Model
Client
Replica Node
(Has Replica)
Replica Node
(Has Replica)
Replica Node
(Has Replica)
Replica Node
(Random Node)
Client
Replica Node
(Has Replica)
Replica Node
(Has Replica)
Replica Node
(Has Replica)
Replica Node
(Random Node)
Questions?
tiny.cloudera.com/nyquestions
Cassandra Model
Client
Replica Node
(Has Replica)
Replica Node
(Has Replica)
Replica Node
(Has Replica)
Client
Questions?
tiny.cloudera.com/nyquestions
Cassandra Model (Common Models)
Client
Replica Node
(Has Replica)
Replica Node
(Has Replica)
Replica Node
(Has Replica)
Client
Client
Replica Node
(Has Replica)
Replica Node
(Has Replica)
Replica Node
(Has Replica)
Client
Client
Replica Node
(Has Replica)
Replica Node
(Has Replica)
Replica Node
(Has Replica)
Client
3 Write - 1 Read
1 Write - 3 Read
1 Write - 1 Read
Questions?
tiny.cloudera.com/nyquestions
NoSQL - Others
- Document
- Mongo
- CouchBase
- Spanner-Inspired
- Kudu
- CockroachDB
- Druid.IO
Questions?
tiny.cloudera.com/nyquestions
NoSQL - Transitions
- Some have them
- Think about kafka
Indexed Search
Questions?
tiny.cloudera.com/nyquestions
Lucene Indexing (Features)
- We don t have enough time in this whole class
- Ordering logic
- NGrams
- Weights
- Text Indexing
- Translations
- Facets *
Questions?
tiny.cloudera.com/nyquestions
Lucene Indexing (Facets)
- Facets are a side effect of out wonderful indexes
- It allows us to counts all the document that below to given indexes to produce
- Grouped Counts
- Charts and Graphs (kibana or Banana)
- People will also call this access pattern cubing a dataset
Questions?
tiny.cloudera.com/nyquestions
Lucene Indexing (Kibana & Banana)
Questions?
tiny.cloudera.com/nyquestions
Lucene Indexing (Facets Example)
- Time Series Example
Document
ID
Hour of Day User State Event
1 12 4201 MD click
2 12 4202 VA click
3 12 4203 VA click
4 1 4201 MD click
5 1 4202 VA view
6 2 4204 CA click
7 2 4205 VA view
8 2 4201 MD click
Questions?
tiny.cloudera.com/nyquestions
Lucene Indexing (Facets Example)
Hour of
Day
12 1 2 3
1 4 5
2 6 7 8 9
Document
ID
Hour of
Day
User State Event
1 12 4201 MD click
2 12 4202 VA click
3 12 4203 VA click
4 1 4201 MD click
5 1 4202 VA view
6 2 4204 CA click
7 2 4205 VA view
8 2 4201 MD click
9 2 4204 CA click
User
4201 1 4 8
4202 2 5
4203 3
4204 6 9
4205 7
State
MD 1 4 8
VA 2 3 5 7
CA 6 9
Event
click 1 2 3 4 6 8 9
view 5 7
Questions?
tiny.cloudera.com/nyquestions
Lucene Indexing (Facets Example)
- Events per hour
- Simple array count
Hour of
Day
12 1 2 3
1 4 5
2 6 7 8 9
Questions?
tiny.cloudera.com/nyquestions
- Events per hour by State
- Simple array count
Lucene Indexing (Facets Example)
State
MD 1 4 8
VA 2 3 5 7
CA 6 9
Hour of
Day
12 1 2 3
1 4 5
2 6 7 8 9
Questions?
tiny.cloudera.com/nyquestions
- Note the bucketing and ordered pattern
Lucene Indexing (Facets Example)
Hour of
Day 2
State
MD
State
VA
State CA
6 1 2 6
7 4 3 9
8 8 5
9 7State
MD 1 4 8
VA 2 3 5 7
CA 6 9
Hour of
Day
12 1 2 3
1 4 5
2 6 7 8 9
Questions?
tiny.cloudera.com/nyquestions
- Note the bucketing and ordered pattern
Lucene Indexing (Facets Example)
Hour of
Day 2
State
MD
State
VA
State CA
6 1 2 6
7 4 3 9
8 8 5
9 7
Hour of
Day 2
State
MD
State
VA
State CA
6 1 2 6
7 4 3 9
8 8 5
9 7
+1 CA
Questions?
tiny.cloudera.com/nyquestions
- Note the bucketing and ordered pattern
Lucene Indexing (Facets Example)
Hour of
Day 2
State
MD
State
VA
State CA
6 1 2 6
7 4 3 9
8 8 5
9 7
+1VA
Hour of
Day 2
State
MD
State
VA
State CA
6 1 2 6
7 4 3 9
8 8 5
9 7
+1 MD
Hour of
Day 2
State
MD
State
VA
State CA
6 1 2 6
7 4 3 9
8 8 5
9 7
+1 CA
Questions?
tiny.cloudera.com/nyquestions
Partitioning
- SolR and Elastic Search partition the document o land on all nodes
- This means
- You have the power of the cluster when querying
- This mean you are accessing the cluster when querying
Questions?
tiny.cloudera.com/nyquestions
Writing Latency
- Lucene Indexing is more expensive then NoSQL work
- Think of it as micro batching
- Larger batches ~= better throughput
- Compaction is also invalid
- Deletes impact storage and performance until they are compacted
Questions?
tiny.cloudera.com/nyquestions
Storage Cost
- TTL is your friend
- Think of Lucene based systems as great if
- You dataset is manageable in size
- You have a good TTL strategy
- You have a boat load of money
Graphs
Questions?
tiny.cloudera.com/nyquestions
Thinking in terms of Graphs
- Nodes and Edges
Node:1
…
Node:3
…
Node:2
…
Node:0
…
Friend
Child
Father
CoachWife
Questions?
tiny.cloudera.com/nyquestions
Thinking in terms of Graphs
- Use cases
- Querying
- Cassandra with Sparkle
- Neo4j
- Batch operations
- Giraph
- GraphX
- GraphLab
Questions?
tiny.cloudera.com/nyquestions
BSP Bulk Synchronous Parallel
- Process every Node Atomically
- Node gets all messages sent to it
- Nodes can mutate them selves and their edges
- Nodes can send messages to other nodes
- But nothing is received yet
- BSP waits until all the Node processing is done
- Then send messages to the right partition
- Repeat
Questions?
tiny.cloudera.com/nyquestions
Storage
High level architecture
Source Transport Stream
Processing
Access
Batch Processing
Considerations
Questions?
tiny.cloudera.com/nyquestions
High level architecture
Source Transport Stream
Processing
Storage Access
Nested
Tables
Indexed
Cube
Relational
Tables
Entity Time
Series Lookup
Batch
Processing
SQL
NRT REST
NRT Dashboard
Questions?
tiny.cloudera.com/nyquestions
Why have batch processing?
▪ When you need a larger context
- Say, to train a model
▪ Complex periodic job that does something
- Convert data to a nested structure for reduced number of shuffles
▪ In our use-case,
- Kudu -> HDFS Nested is batch processing
- KMeans calculation is also in bash
Questions?
tiny.cloudera.com/nyquestions
Batch processing options
▪ Spark (+ MLlib)
▪ MapReduce (+ Mahout)
▪ Flink (+ Flink ML)
Questions?
tiny.cloudera.com/nyquestions
Spark
▪ Pretty popular
▪ Much faster than MapReduce
▪ Thriving community
Questions?
tiny.cloudera.com/nyquestions
MapReduce
▪ Sloooooow
Questions?
tiny.cloudera.com/nyquestions
Flink
▪ Pretty popular
▪ Batch is a special case of Streaming
▪ Developing community
Questions?
tiny.cloudera.com/nyquestions
In our use-case
▪ We chose Spark
- We were using Spark Streaming anyways
- Similar code between Spark and Spark Streaming
- Thriving community
Interactive
Data Access
Considerations
Questions?
tiny.cloudera.com/nyquestions
High level architecture
Source Transport Stream
Processing
Storage Access
Nested
Tables
Indexed
Cube
Relational
Tables
Entity Time
Series Lookup
Batch
Processing
SQL
NRT REST
NRT Dashboard
Questions?
tiny.cloudera.com/nyquestions
Types of data access
▪ REST server/APIs for querying entities and aggregates
▪ UI for displaying search facets
▪ SQL engine
REST servers
Considerations
Questions?
tiny.cloudera.com/nyquestions
Why have REST server?
▪ Tired of business people telling us how to access data
▪ Serves as an interface between the data engineers and business folks
▪ Lets business folks decide access patterns
▪ Engineers to optimize those patterns
▪ Brownie points from your boss
▪ And, it’s not that difficult to write!
Questions?
tiny.cloudera.com/nyquestions
Don’t believe me?
import org.mortbay.jetty.Server
import org.mortbay.jetty.servlet.{Context, ServletHolder}
…
val server = new Server(port)
val sh = new ServletHolder(classOf[ServletContainer])
sh.setInitParameter("com.sun.jersey.config.property.resourceConfigClass",
"com.sun.jersey.api.core.PackagesResourceConfig")
sh.setInitParameter("com.sun.jersey.config.property.packages",
"com.hadooparchitecturebook.taxi360.server.hbase")
sh.setInitParameter("com.sun.jersey.api.json.POJOMappingFeature", "true”)
val context = new Context(server, "/", Context.SESSIONS)
context.addServlet(sh, "/*”)
server.start()
server.join()
Questions?
tiny.cloudera.com/nyquestions
Then, write a ServiceLayer
@GET
@Path("vender/{venderId}/timeline")
@Produces(Array(MediaType.APPLICATION_JSON))
def getTripTimeLine (@PathParam("venderId") venderId:String,
@QueryParam("startTime") startTime:String = Long.MinValue.toString,
@QueryParam("endTime") endTime:String = Long.MaxValue.toString):
Array[NyTaxiYellowTrip] = {
Questions?
tiny.cloudera.com/nyquestions
Use REST! Say no to business people!
▪ Access data like so:
http://<serverURL>:8080/vendor/{vendorId}/timeline
UI
Considerations
Questions?
tiny.cloudera.com/nyquestions
UI requirements
Something that can
▪Represent search results really well
▪Integrates with Apache Solr on Hadoop
Questions?
tiny.cloudera.com/nyquestions
UI options
▪ Hue
▪ Banana
▪ Kibana
Questions?
tiny.cloudera.com/nyquestions
We choose Hue
▪ Because it’s included
▪ Please look at the others
SQL engines
Considerations
Questions?
tiny.cloudera.com/nyquestions
SQL engine criteria
▪ Low latency SQL access
▪ Allows for high concurrency
▪ JDBC/ODBC integration
▪ Capable of large scale aggregation
▪ Optionally integrates with Kudu for real-time updates to SQL tables
Questions?
tiny.cloudera.com/nyquestions
Apache Hive
▪ Good JDBC integration
▪ Not really low latency, even when using Tez
▪ Doesn’t integrate with Kudu
§ Can run with MapReduce, Spark, or Tez
Questions?
tiny.cloudera.com/nyquestions
Presto
▪ Low latency SQL engine from Facebook
▪ Provides JDBC/ODBC access
▪ Is only in-memory, large aggregations can lead to OOM errors
▪ Doesn’t integrate with Kudu
Questions?
tiny.cloudera.com/nyquestions
Apache Impala
▪ Low latency SQL access
▪ Provides JDBC/ODBC access
▪ Excellent concurrency support
▪ Integrates with Kudu for real-time SQL
Questions?
tiny.cloudera.com/nyquestions
Apache Drill
▪ Similar in architecture to Impala
▪ Provides JDBC/ODBC access
▪ Doesn’t integrate with Kudu
Questions?
tiny.cloudera.com/nyquestions
Spark SQL
▪ Builds on top of Spark
▪ JDBC/ODBC access only via Spark Thrift Server
- Doesn’t scale well with larger number of concurrent users
- Doesn’t fully provide secure access.
Questions?
tiny.cloudera.com/nyquestions
We choose
▪ Spark SQL
▪ Impala
Overall Architecture
Review
Questions?
tiny.cloudera.com/nyquestions
High level architecture
Source Transport Stream
Processing
Storage Access
Processing &
Ingestion Engine
Nested
Tables
Indexed
Cube
Relational
Tables
Entity Time
Series Lookup
Batch
Processing
SQL
NRT Rest
NRT Dashboard
Questions?
tiny.cloudera.com/nyquestions
High level architecture
Source Transport Stream
Processing
Storage Access
Nested
Tables
Indexed
Cube
Relational
Tables
Entity Time
Series Lookup
Batch
Processing
SQL
NRT REST
NRT Dashboard
Questions?
tiny.cloudera.com/nyquestions
Storage
High level architecture
Source Transport Stream
Processing
Access
Batch
Processing
SQL
NRT REST
NRT Dashboard
Questions?
tiny.cloudera.com/nyquestions
Access
High level architecture
Source Transport Stream
Processing
Storage
Questions?
tiny.cloudera.com/nyquestions
High level architecture
Source Transport Stream
Processing
Storage Access
Where else to find us?
Questions?
tiny.cloudera.com/nyquestions
Other Sessions
▪ Ask Us Anything session – Thursday, 1:15 PM
▪ The Three Realities of Modern Programming: the Cloud, Microservices, and the
Explosion of Data (Gwen) – Thursday 11:20 AM
▪ One Cluster Does Not Fit All: Architecture Patterns for Multicluster Apache Kafka
Deployments (Gwen) – Thursday 2:05 PM
▪ Managing Successful Big Data Projects (Ted Malaska and Jonathan) – Thursday
4:35 PM
Thank you!
@hadooparchbook
tiny.cloudera.com/app-arch-newyork
Mark Grover | @mark_grover
Gwen Shapira | @gwenshap
Jonathan Seidman | @jseidman
Ad

More Related Content

What's hot (20)

Architecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an exampleArchitecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an example
hadooparchbook
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
hadooparchbook
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patterns
hadooparchbook
 
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud DetectionArchitecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detection
hadooparchbook
 
Hadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata LondonHadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata London
hadooparchbook
 
Architectural Patterns for Streaming Applications
Architectural Patterns for Streaming ApplicationsArchitectural Patterns for Streaming Applications
Architectural Patterns for Streaming Applications
hadooparchbook
 
Architectural considerations for Hadoop Applications
Architectural considerations for Hadoop ApplicationsArchitectural considerations for Hadoop Applications
Architectural considerations for Hadoop Applications
hadooparchbook
 
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorialStrata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
hadooparchbook
 
Application Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User GroupApplication Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User Group
hadooparchbook
 
Fraud Detection using Hadoop
Fraud Detection using HadoopFraud Detection using Hadoop
Fraud Detection using Hadoop
hadooparchbook
 
Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015
hadooparchbook
 
Fraud Detection with Hadoop
Fraud Detection with HadoopFraud Detection with Hadoop
Fraud Detection with Hadoop
markgrover
 
Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014
hadooparchbook
 
Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platform
hadooparchbook
 
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014
hadooparchbook
 
Architecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with HadoopArchitecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with Hadoop
DataWorks Summit
 
Application architectures with Hadoop and Sessionization in MR
Application architectures with Hadoop and Sessionization in MRApplication architectures with Hadoop and Sessionization in MR
Application architectures with Hadoop and Sessionization in MR
markgrover
 
Architecting a Next Generation Data Platform – Strata Singapore 2017
Architecting a Next Generation Data Platform – Strata Singapore 2017Architecting a Next Generation Data Platform – Strata Singapore 2017
Architecting a Next Generation Data Platform – Strata Singapore 2017
Jonathan Seidman
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
hadooparchbook
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
markgrover
 
Architecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an exampleArchitecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an example
hadooparchbook
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
hadooparchbook
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patterns
hadooparchbook
 
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud DetectionArchitecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detection
hadooparchbook
 
Hadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata LondonHadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata London
hadooparchbook
 
Architectural Patterns for Streaming Applications
Architectural Patterns for Streaming ApplicationsArchitectural Patterns for Streaming Applications
Architectural Patterns for Streaming Applications
hadooparchbook
 
Architectural considerations for Hadoop Applications
Architectural considerations for Hadoop ApplicationsArchitectural considerations for Hadoop Applications
Architectural considerations for Hadoop Applications
hadooparchbook
 
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorialStrata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
hadooparchbook
 
Application Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User GroupApplication Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User Group
hadooparchbook
 
Fraud Detection using Hadoop
Fraud Detection using HadoopFraud Detection using Hadoop
Fraud Detection using Hadoop
hadooparchbook
 
Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015
hadooparchbook
 
Fraud Detection with Hadoop
Fraud Detection with HadoopFraud Detection with Hadoop
Fraud Detection with Hadoop
markgrover
 
Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014
hadooparchbook
 
Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platform
hadooparchbook
 
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014
hadooparchbook
 
Architecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with HadoopArchitecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with Hadoop
DataWorks Summit
 
Application architectures with Hadoop and Sessionization in MR
Application architectures with Hadoop and Sessionization in MRApplication architectures with Hadoop and Sessionization in MR
Application architectures with Hadoop and Sessionization in MR
markgrover
 
Architecting a Next Generation Data Platform – Strata Singapore 2017
Architecting a Next Generation Data Platform – Strata Singapore 2017Architecting a Next Generation Data Platform – Strata Singapore 2017
Architecting a Next Generation Data Platform – Strata Singapore 2017
Jonathan Seidman
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
hadooparchbook
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
markgrover
 

Similar to Architecting a next generation data platform (20)

Architecting a Next Gen Data Platform – Strata London 2018
Architecting a Next Gen Data Platform – Strata London 2018Architecting a Next Gen Data Platform – Strata London 2018
Architecting a Next Gen Data Platform – Strata London 2018
Jonathan Seidman
 
Architecting a Next Gen Data Platform – Strata New York 2018
Architecting a Next Gen Data Platform – Strata New York 2018Architecting a Next Gen Data Platform – Strata New York 2018
Architecting a Next Gen Data Platform – Strata New York 2018
Jonathan Seidman
 
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksLessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Databricks
 
Horses for Courses: Database Roundtable
Horses for Courses: Database RoundtableHorses for Courses: Database Roundtable
Horses for Courses: Database Roundtable
Eric Kavanagh
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Databricks
 
First in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter IntegrationFirst in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter Integration
Inside Analysis
 
The Heart of the Data Mesh Beats in Real-Time with Apache Kafka
The Heart of the Data Mesh Beats in Real-Time with Apache KafkaThe Heart of the Data Mesh Beats in Real-Time with Apache Kafka
The Heart of the Data Mesh Beats in Real-Time with Apache Kafka
Kai Wähner
 
Modern MySQL Monitoring and Dashboards.
Modern MySQL Monitoring and Dashboards.Modern MySQL Monitoring and Dashboards.
Modern MySQL Monitoring and Dashboards.
Mydbops
 
Cloud-Native .Net des applications containerisées .Net sur Linux, Windows e...
 Cloud-Native .Net des applications containerisées .Net sur Linux, Windows e... Cloud-Native .Net des applications containerisées .Net sur Linux, Windows e...
Cloud-Native .Net des applications containerisées .Net sur Linux, Windows e...
VMware Tanzu
 
Unbundling the Modern Streaming Stack With Dunith Dhanushka | Current 2022
Unbundling the Modern Streaming Stack With Dunith Dhanushka | Current 2022Unbundling the Modern Streaming Stack With Dunith Dhanushka | Current 2022
Unbundling the Modern Streaming Stack With Dunith Dhanushka | Current 2022
HostedbyConfluent
 
Genji: Framework for building resilient near-realtime data pipelines
Genji: Framework for building resilient near-realtime data pipelinesGenji: Framework for building resilient near-realtime data pipelines
Genji: Framework for building resilient near-realtime data pipelines
Swami Sundaramurthy
 
ShareChat’s Path to High-Performance NoSQL with ScyllaDB
ShareChat’s Path to High-Performance NoSQL with ScyllaDBShareChat’s Path to High-Performance NoSQL with ScyllaDB
ShareChat’s Path to High-Performance NoSQL with ScyllaDB
ScyllaDB
 
Set Your Data In Motion - CTO Roundtable
Set Your Data In Motion - CTO RoundtableSet Your Data In Motion - CTO Roundtable
Set Your Data In Motion - CTO Roundtable
confluent
 
Event Streaming CTO Roundtable for Cloud-native Kafka Architectures
Event Streaming CTO Roundtable for Cloud-native Kafka ArchitecturesEvent Streaming CTO Roundtable for Cloud-native Kafka Architectures
Event Streaming CTO Roundtable for Cloud-native Kafka Architectures
Kai Wähner
 
Couchbase Cloud No Equal (Rick Jacobs, Couchbase) Kafka Summit 2020
Couchbase Cloud No Equal (Rick Jacobs, Couchbase) Kafka Summit 2020Couchbase Cloud No Equal (Rick Jacobs, Couchbase) Kafka Summit 2020
Couchbase Cloud No Equal (Rick Jacobs, Couchbase) Kafka Summit 2020
HostedbyConfluent
 
Unlocking the Value of Your Data Lake
Unlocking the Value of Your Data LakeUnlocking the Value of Your Data Lake
Unlocking the Value of Your Data Lake
DATAVERSITY
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDB
Denny Lee
 
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming Visualization
Guido Schmutz
 
Unconference Round Table Notes
Unconference Round Table NotesUnconference Round Table Notes
Unconference Round Table Notes
Timothy Spann
 
Take Action: The New Reality of Data-Driven Business
Take Action: The New Reality of Data-Driven BusinessTake Action: The New Reality of Data-Driven Business
Take Action: The New Reality of Data-Driven Business
Inside Analysis
 
Architecting a Next Gen Data Platform – Strata London 2018
Architecting a Next Gen Data Platform – Strata London 2018Architecting a Next Gen Data Platform – Strata London 2018
Architecting a Next Gen Data Platform – Strata London 2018
Jonathan Seidman
 
Architecting a Next Gen Data Platform – Strata New York 2018
Architecting a Next Gen Data Platform – Strata New York 2018Architecting a Next Gen Data Platform – Strata New York 2018
Architecting a Next Gen Data Platform – Strata New York 2018
Jonathan Seidman
 
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksLessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks
Databricks
 
Horses for Courses: Database Roundtable
Horses for Courses: Database RoundtableHorses for Courses: Database Roundtable
Horses for Courses: Database Roundtable
Eric Kavanagh
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Databricks
 
First in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter IntegrationFirst in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter Integration
Inside Analysis
 
The Heart of the Data Mesh Beats in Real-Time with Apache Kafka
The Heart of the Data Mesh Beats in Real-Time with Apache KafkaThe Heart of the Data Mesh Beats in Real-Time with Apache Kafka
The Heart of the Data Mesh Beats in Real-Time with Apache Kafka
Kai Wähner
 
Modern MySQL Monitoring and Dashboards.
Modern MySQL Monitoring and Dashboards.Modern MySQL Monitoring and Dashboards.
Modern MySQL Monitoring and Dashboards.
Mydbops
 
Cloud-Native .Net des applications containerisées .Net sur Linux, Windows e...
 Cloud-Native .Net des applications containerisées .Net sur Linux, Windows e... Cloud-Native .Net des applications containerisées .Net sur Linux, Windows e...
Cloud-Native .Net des applications containerisées .Net sur Linux, Windows e...
VMware Tanzu
 
Unbundling the Modern Streaming Stack With Dunith Dhanushka | Current 2022
Unbundling the Modern Streaming Stack With Dunith Dhanushka | Current 2022Unbundling the Modern Streaming Stack With Dunith Dhanushka | Current 2022
Unbundling the Modern Streaming Stack With Dunith Dhanushka | Current 2022
HostedbyConfluent
 
Genji: Framework for building resilient near-realtime data pipelines
Genji: Framework for building resilient near-realtime data pipelinesGenji: Framework for building resilient near-realtime data pipelines
Genji: Framework for building resilient near-realtime data pipelines
Swami Sundaramurthy
 
ShareChat’s Path to High-Performance NoSQL with ScyllaDB
ShareChat’s Path to High-Performance NoSQL with ScyllaDBShareChat’s Path to High-Performance NoSQL with ScyllaDB
ShareChat’s Path to High-Performance NoSQL with ScyllaDB
ScyllaDB
 
Set Your Data In Motion - CTO Roundtable
Set Your Data In Motion - CTO RoundtableSet Your Data In Motion - CTO Roundtable
Set Your Data In Motion - CTO Roundtable
confluent
 
Event Streaming CTO Roundtable for Cloud-native Kafka Architectures
Event Streaming CTO Roundtable for Cloud-native Kafka ArchitecturesEvent Streaming CTO Roundtable for Cloud-native Kafka Architectures
Event Streaming CTO Roundtable for Cloud-native Kafka Architectures
Kai Wähner
 
Couchbase Cloud No Equal (Rick Jacobs, Couchbase) Kafka Summit 2020
Couchbase Cloud No Equal (Rick Jacobs, Couchbase) Kafka Summit 2020Couchbase Cloud No Equal (Rick Jacobs, Couchbase) Kafka Summit 2020
Couchbase Cloud No Equal (Rick Jacobs, Couchbase) Kafka Summit 2020
HostedbyConfluent
 
Unlocking the Value of Your Data Lake
Unlocking the Value of Your Data LakeUnlocking the Value of Your Data Lake
Unlocking the Value of Your Data Lake
DATAVERSITY
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDB
Denny Lee
 
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming Visualization
Guido Schmutz
 
Unconference Round Table Notes
Unconference Round Table NotesUnconference Round Table Notes
Unconference Round Table Notes
Timothy Spann
 
Take Action: The New Reality of Data-Driven Business
Take Action: The New Reality of Data-Driven BusinessTake Action: The New Reality of Data-Driven Business
Take Action: The New Reality of Data-Driven Business
Inside Analysis
 
Ad

Recently uploaded (20)

UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
Ad

Architecting a next generation data platform