Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters and Hao Zou

Dynamic DDL
Adding Structure to Streaming Data on the Fly

OUR SPEAKERS
Hao Zou
Software Engineer
Data Science &Engineering
GoPro
David Winters
Big Data Architect
Data Science &Engineering
GoPro

TOPICS TO COVER
• Background and Business
• GoPro Data Platform Architecture
• Old File-based Pipeline Architecture
• New Dynamic DDLArchitecture
• Dynamic DDL Deep Dive
• Using Cloud-Based Services (Optional)
• Questions

WHEN WE GOT HERE…
DATA ANALYTICS WAS BASED ON WORD OF MOUTH (& THIS GUY)

GoPro Data
Analytics
Platform
Consumer Devices GoPro Apps & Cloud
E-Commerce
Social Media
& OTT
CRM
Product Insight
User Segmentation
CRM/Marketing &
Personalization
ERP
Web
Mobile

DATA CHALLENGES AT GOPRO
• Variety of data - Hardware and Software products
• Software - Mobile and Desktop Apps
• Hardware - Cameras, Drones, Controllers,Accessories, etc.
• External - CRM, ERP, OTT, E-Commerce, Web, Social,etc.
• Variety of data ingestion mechanisms - LambdaArchitecture
• Real-time streaming pipeline - GoPro products
• Batch pipeline - External 3rd party systems
• Complex Transformations
• Data often stored in binary to conserve space in cameras
• Heterogeneous data formats (JSON, XML, and packed binary)
• Seamless Data Aggregations
• Blend data between different sources, hardware, and software
• Build structures which reflect state vs. event-based
• Handle Privacy & Anonymization

OLD FILE-BASED PIPELINE ARCHITECTURE
ETL Cluster
•Aggregations and
Joins
•Hive and Spark jobs
•Map/Reduce
•Airflow
Secure Data Mart
Cluster
• End User Query
• Impala / Sentry
• Parquet
• Kerberos & LDAP
Analytics Apps
•Hue
•Tableau
•Plotly
•Python
•R
Real Time Cluster
•Log file streaming
•RESTful service
•Kafka
•Spark Streaming
•HBase
Batch Induction
Framework
•Batch files
•Scheduled downloads
•Pre-processing
•Java App
•Airflow
JSON
JSON
Parquet
DDL
• Rest API
• FTP downloads
• S3 sync
Streaming
Batch
Download

STREAMING ENDPOINT
ELBHTTP
Pipeline for processing of streaming logs
To ETL Cluster
events
events
state

SPARK STREAMING PIPELINE
/path1/…
/path2/…
/path3/…
ToETL
Cluster
/path4/…
events
state
events
events
events
state
state
state

ETL PIPELINE
HDFS
Hive Metastore
To SDM Cluster
From Realtime Cluster
Batch
Induction
Framework
state
snapshot

DATA DELIVERY!
HDFS
Hive Metastore
Applications
Thrift
ODBC
Server
User
Studio
Studio -Staging
GDA
Report
SDM Cluster
From ETL Cluster

PROS AND CONS OF OLD SYSTEM
• Isolation of workloads
• Fast ingest
• Secure
• Fast delivery/queries
• Loosely coupled clusters
• Multiple copies of data
• Tightly coupled storage and compute
• Lack of elasticity
• Operational overhead of multiple clusters

NEW DYNAMIC DDL ARCHITECTURE
Amazon S3
Bucket
Real Time Cluster
Batch
Induction
Framework
Hive
Metastore
Ephemeral
ETL
Cluster
Parquet
+
DDL
Aggregates
Events
+
State
Ephemeral
Data Mart
Cluster #1
Ephemeral
Data Mart
Cluster #2
Ephemeral
Data Mart
Cluster #N
• Rest API
• FTP downloads
• S3 sync
Streaming
Batch
Download
•Notebooks
•Tableau
•Plotly
•Python
•R
Improvements
Single copy of data
Separate storage from compute
Elastic clusters
Single long running cluster to maintain
Parquet
+
DDL
Dynamic
DDL!

NEW DYNAMIC DDL ARCHITECTURE
Streaming Cluster
ELBHTTP
Pipeline for processing of streaming logs
S3
HIVE
METASTORE
transition
Centralized Hive MetaStore
For each topic, dynamically add thetable
structure and create the table or insert data
into the table if already exists

DYNAMIC DDL
• What is Dynamic DDL?
• Dynamic DDL is adding structure (schema)to the data on the fly wheneverthe providersofthe data are changing their
structure.
• Why is Dynamic DDL needed?
• Providersofdata are changingtheirstructure constantly.Without Dynamic DDL,the table schema ishard coded and hasto be
manually updatedbased on the changesofthe incoming data.
• All of the aggregation SQL would have to be manuallyupdated due to the schema change.
• Faster turnaroundforthe data ingestion. Data can be ingested and made available within minutes(sometimesseconds).
• How we did this?
• Using Spark SQL/Dataframe
• See example

DYNAMIC DDL
• Example:
{"_data":{"record":{"id": "1", "first_name": "John", "last_name": "Fork", "state": "California", "city": "san Mateo"}, "log_ts":"2016-07-
20T00:06:01Z"}}
Fixed schema
Dynamically generated schema
{"record_key":"state","record_value":"California","id":"1","log_ts":"2016-07-20T00:06:01Z"}
{"record_key":"last_name","record_value":"Fork","id":"1","log_ts":"2016-07-20T00:06:01Z"}
{"record_key":"city","record_value":"san Mateo","id":"1","log_ts":"2016-07-20T00:06:01Z"}
{"record_key":"first_name","record_value":"John","id":"1","log_ts":"2016-07-20T00:06:01Z"}
Flatten thedata first
SELECT MAX(CASE WHEN record_key = 'state' THEN record_value ELSE null END) AS data_record_state,
MAX(CASE WHEN record_key = 'last_name' THEN record_value ELSE null END) As data_record_last_name,
MAX(CASE WHEN record_key = 'first_name' THEN record_value ELSE null END) As data_record_first_name,
MAX(CASE WHEN record_key = 'city' THEN record_value ELSE null END) AS data_record_city,
id as data_record_id, log_ts as data_log_ts from test group by id, log_ts

DYNAMIC DDL USING SPARK SQL/DATAFRAME
• Code snippet of Dynamic DDL transforming new JSON attributes into relational columns
Add the partition columns
Manually create the table due to a bug in spark

Add the new columns that exist in the incoming data
frame but do not exist yet in the destination table
This syntax is not working anymore after upgrading to spark 2.x

Three temporary way to solve the problem in spark 2.x:
• Launch a hiveserver2 service, then use jdbc call hive to alter the table
• Use spark to directly connect to hivemetastore, then update the metadata
• Update spark source code to supportAlter table syntax and repackage it

Project all columns from the table
Append the data into the destination table

Add the new partition key
• Reprocessing the DDL Table with new partition Key (Tuning tips)
Choose the partition keywisely
Use coalesce ifthere too many partitions
Use Coalesce to control the job tasksUse filterif Data still too large

USING S3: WHAT IS S3?
• S3 is not a file system.
• S3 is an object store. Similar to a key-value store.
• S3 objects are presented in a hierarchical view but are not stored in that manner.
• S3 objects are stored with a key derived from a “path”.
• The key is used to fan out the objects across shards.
• The path is for display purposes only. Only the first 3 to 4 characters are used
for sharding.
• S3 does not have strong transactional semantics but instead has eventual consistency.
• S3 is not appropriate for realtime updates.
• S3 is suited for longer term storage.

USING S3: BEHAVIORS
• S3 has similar behaviors to HDFS but even more extreme.
• Larger latencies
• Larger files/writes – Think GBs
• Write and read latencies are larger but the bandwidth is much larger with S3.
• Thus throughput can be increased with parallel writers (same latency but more
throughputthrough parallel operations)
• Partition your RDDs/DataFrames and increase your workers/executors to
optimize the parallelism.
• Each write/read has more overhead due to the web service calls.
• So use larger buffers.
• Match the size of your HDFS buffers/blocks if reading/writing from/to HDFS.
• Collect data for longer durations before writing large buffers in parallel to S3.
• Retry logic – Writes to S3 can and will fail.
• Cannot stream to S3 – Complete files must be uploaded.
• Technically, you can simulate streaming with multipart upload.

USING S3: TIPS
• Tips for using S3 with HDFS
• Use the s3a scheme.
• Many optimizations including buffering options (disk-based, on-heap, or off-
heap) and incremental parallel uploads (S3A Fast Upload).
• More here: http://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-
aws/index.html#S3A
• Don’t use rename/move.
• Moves are great for HDFS to support better transactional semantics when
streaming files.
• For S3, moves/renames are copy and delete operations which can be very slow
especially due to the eventual consistency.
• Other advanced S3 techniques:
• Hash object names to better shard the objects in a bucket.
• Use multiple buckets to increase bandwidth.

Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters and Hao Zou

Recommended

More Related Content

What's hot (20)

Similar to Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters and Hao Zou (20)

More from Databricks (20)

Recently uploaded (20)

Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters and Hao Zou