SlideShare a Scribd company logo
FROM FLAT FILES TO DECONSTRUCTED
DATABASE
The evolution and future of the Big Data ecosystem.
Julien Le Dem
@J_
julien.ledem@wework.com
April 2018
Julien Le Dem
@J_
Principal Data Engineer
• Author of Parquet
• Apache member
• Apache PMCs: Arrow, Kudu, Heron, Incubator, Pig, Parquet, Tez
• Used Hadoop first at Yahoo in 2007
• Formerly Twitter Data platform and Dremio
Julien
Agenda
❖ At the beginning there was Hadoop (2005)
❖ Actually, SQL was invented in the 70s
“MapReduce: A major step backwards”
❖ The deconstructed database
❖ What next?
At the beginning there was Hadoop
Hadoop
Storage:
A distributed file system
Execution:
Map Reduce
Based on Google’s GFS and MapReduce papers
Great at looking for a needle in a haystack
… with snowplows
Great at looking for a needle in a haystack …
Original Hadoop abstractions
Execution
Map/Reduce
Storage
File System
Just flat files
● Any binary
● No schema
● No standard
● The job must know how to
split the data
M
M
M
R
R
R
Shuffle
Read
locally
write
locally
Simple
● Flexible/Composable
● Logic and optimizations
tightly coupled
● Ties execution with
persistence
“MapReduce: A major step backwards”
(2008)
SQL
Databases have been around for a long time
• Originally SEQUEL developed in the early 70s at IBM
• 1986: first SQL standard
• Updated in 1989, 1992, 1996, 1999, 2003, 2006, 2008, 2011, 2016
Relational model
Global Standard
• Universal Data access language
• Widely understood
• First described in 1969 by Edgar F. Codd
Underlying principles of relational databases
Standard
SQL is understood by many
Separation of logic and optimization
Separation of Schema and Application
High level language focusing on logic (SQL)
Indexing
Optimizer
Evolution
Views
Schemas
Integrity
Transactions
Integrity constraints
Referential integrity
Storage
Tables
Abstracted notion of data
● Defines a Schema
● Format/layout decoupled
from queries
● Has statistics/indexing
● Can evolve over time
Execution
SQL
SQL
● Decouples logic of query
from:
○ Optimizations
○ Data representation
○ Indexing
Relational Database abstractions
SELECT a, AVG(b)
FROM FOO GROUP BY a
FOO
Query evaluation
Syntax Semantic Optimization Execution
A well integrated system
Storage
SELECT f.c, AVG(b.d)
FROM FOO f
JOIN BAR b ON f.a = b.b
GROUP BY f.c WHERE f.d = x
Select
Scan
FOO
Scan
BAR
JOIN
GROUP
BY
FILTER
Select
Scan
FOO
Scan
BAR
JOIN
GROUP
BY
FILTER
Syntax Semantic
Optimization
Execution
Table
Metadata
(Schema,
stats,
layout,…)
Columnar
data
Push
downs
Columnar
data
Columnar
data
Push
downs
Push
downs
user
So why? Why Hadoop?
Why Snowplows?
The relational model was constrained
We need the right
Constraints
Need the right abstractions
Traditional SQL implementations:
• Flat schema
• Inflexible schema evolution
• History rewrite required
• No lower level abstractions
• Not scalable
Constraints are good
They allow optimizations
• Statistics
• Pick the best join algorithm
• Change the data layout
• Reusable optimization logic
It’s just code
Hadoop is flexible and scalable
• Room to scale algorithms that are not part of the standard
• Machine learning
• Your imagination is the limit
No Data shape constraint
Open source
• You can improve it
• You can expand it
• You can reinvent it
• Nested data structures
• Unstructured text with semantic annotations
• Graphs
• Non-uniform schemas
You can actually implement SQL with this
SELECT f.c, AVG(b.d)
FROM FOO f
JOIN BAR b ON f.a = b.b
GROUP BY f.c WHERE f.d = x
Select
Scan
FOO
Scan
BAR
JOIN
GROUP
BY
FILTER
Parser
FOO
Execution
GROUP BY
JOIN
FILTER
BAR
And they did…
(open-sourced 2009)
10 years later
The deconstructed database
Author: gamene https://www.flickr.com/photos/gamene/4007091102
The deconstructed database
The deconstructed database
Query
model
Machine
learning
Storage
Batch
execution
Data
Exchange
Stream
Processing
We can mix and match individual components
*not exhaustive!
Specialized
Components
Stream processing
Storage
Execution SQL
Stream persistance
Streams
Resource management
Iceberg
Machine learning
We can mix and match individual components
Storage
Row oriented or columnar
Immutable or mutable
Stream storage vs analytics optimized
Query model
SQL
Functional
…
Machine Learning
Training models
Data Exchange
Row oriented
Columnar
Batch Execution
Optimized for high throughput and
historical analysis
Streaming Execution
Optimized for High Throughput and Low
latency processing
Emergence of standard components
Emergence of standard components
Columnar Storage
Apache Parquet as columnar
representation at rest.
SQL parsing and
optimization
Apache Calcite as a versatile query
optimizer framework
Schema model
Apache Avro as pipeline friendly schema
for the analytics world.
Columnar Exchange
Apache Arrow as the next generation in-
memory representation and no-overhead
data exchange
Table abstraction
Netflix’s Iceberg has a great potential to
provide Snapshot isolation and layout
abstraction on top of distributed file
systems.
The deconstructed database’s optimizer: Calcite
Storage
Execution engine
Schema
plugins
Optimizer
rules
SELECT f.c, AVG(b.d)
FROM FOO f
JOIN BAR b ON f.a = b.b
GROUP BY f.c WHERE f.d = x
Select
Scan
FOO
Scan
BAR
JOIN
GROUP
BY
FILTER
Syntax Semantic
Optimization
Execution
…
Select
Scan
FOO
Scan
BAR
JOIN
GROUP
BY
FILTER
Apache Calcite is used in:
Streaming SQL
• Apache Apex
• Apache Flink
• Apache SamzaSQL
• Apache StormSQL
…
Batch SQL
• Apache Hive
• Apache Drill
• Apache Phoenix
• Apache Kilin
…
Columnar Row oriented
Mutable
The deconstructed database’s storage
Optimized for analytics Optimized for serving
Immutable
Query integration
To be performant a query engine requires deep integration with the storage layer.
Implementing push down and a vectorized reader producing data in an efficient
representation (for example Apache Arrow).
Storage: Push downs
PROJECTION
Read only what you need
PREDICATE
Filter
AGGREGATION
Avoid materializing intermediary data
To reduce IO, aggregation can
also be implemented during the
scan to:
• minimize materialization of
intermediary data
Evaluate filters during scan to:
• Leverage storage properties
(min/max stats, partitioning,
sort, etc)
• Avoid decoding skipped data.
• Reduce IO.
Read only the columns that are
needed:
• Columnar storage makes this
efficient.
The deconstructed database interchange: Apache Arrow
Scanner
Scanner
Scanner
Parquet files
projection push down
read only a and b
Partial
Agg
Partial
Agg
Partial
Agg
Agg
Agg
Agg
Shuffle
Arrow batches
Result
SELECT SUM(a)
FROM t
WHERE c = 5
GROUP BY b
Projection and
predicate push
downs
Incumbent Interesting
Storage: Stream persistence
Open source projects
Features
• State of consumer: how do we recover from failure
• Snapshot
• Decoupling reads from writes
• Parallel reads
• Replication
• Data isolation
Big Data infrastructure blueprint
Big Data infrastructure blueprint
Big Data infrastructure blueprint
Data Infra
Stream persistance
Stream processing
Batch processing
Real time
dashboards
Interactive
analysis
Periodic
dashboards
Analyst
Real time
publishing
Batch
publishing
DataAPI
Data API
Schema
registry
Datasets
metadata
Scheduling/
Job deps
Persistence
Legend:
Processing
Metadata
Monitoring /
Observability
Data Storage
(S3, GCS, HDFS)
(Parquet, Avro)
Data API
UI
Eng
The Future
Still Improving
Better interoperability
• More efficient interoperability:
Continued Apache Arrow adoption
A better data
abstraction
• A better metadata repository
• A better table abstraction:
Netflix/Iceberg
• Common Push down
implementations (filter,
projection, aggregation)
Better data governance
• Global Data Lineage
• Access control
• Protect privacy
• Record User consent
Some predictions
A common access layer
Distributed access
service
Centralizes:
• Schema evolution
• Access control/anonymization
• Efficient push downs
• Efficient interoperability
Table abstraction layer
(Schema, push downs, access control, anonymization)
SQL (Impala, Drill, Presto, Hive, …) Batch (Spark, MR, Beam) ML (TensorFlow, …)
File System (HDFS, S3, GCS) Other storage (HBase, Cassandra, Kudu, …)
A multi tiered streaming batch storage system
Batch-Stream storage
integration
Convergence of Kafka and Kudu
• Schema evolution
• Access control/anonymization
• Efficient push downs
• Efficient interoperability
Time based
Ingestion
API
1) In memory
row oriented
store
2) in memory
column
oriented
store
3) On Disk Column oriented
store
Stream consumer API
Projection, Predicate, Aggregation push downs
Batch consumer API
Projection, Predicate, Aggregation push downs
Mainly reads from here
Mainly reads from here
THANK YOU!
julien.ledem@wework.com
Julien Le Dem @J_
April 2018
We’re hiring!
Contact: julien.ledem@wework.com
We’re growing
We’re hiring!
• WeWork doubled in size last year and the year before
• We expect to double in size this year as well
• Broadened scope: WeWork Labs, Powered by We, Flatiron
School, Meetup, WeLive, …
WeWork in numbers
Technology jobs Locations
• San Francisco: Platform teams
• New York
• Tel Aviv
• Montevideo
• Singapore
• WeWork has 242 locations, in 72 cities internationally
• Over 210,000 members worldwide
• More than 20,000 member companies
Contact: julien.ledem@wework.com
Questions?
Julien Le Dem @J_ julien.ledem@wework.com
April 2018
Ad

More Related Content

What's hot (20)

初めてのデータ分析基盤構築をまかされた、その時何を考えておくと良いのか
初めてのデータ分析基盤構築をまかされた、その時何を考えておくと良いのか初めてのデータ分析基盤構築をまかされた、その時何を考えておくと良いのか
初めてのデータ分析基盤構築をまかされた、その時何を考えておくと良いのか
Techon Organization
 
3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta
Databricks
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
Mostafa
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
Alluxio, Inc.
 
Data Engineer's Lunch #54: dbt and Spark
Data Engineer's Lunch #54: dbt and SparkData Engineer's Lunch #54: dbt and Spark
Data Engineer's Lunch #54: dbt and Spark
Anant Corporation
 
Lakehouse in Azure
Lakehouse in AzureLakehouse in Azure
Lakehouse in Azure
Sergio Zenatti Filho
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Databricks
 
Introduction to Azure Synapse Webinar
Introduction to Azure Synapse WebinarIntroduction to Azure Synapse Webinar
Introduction to Azure Synapse Webinar
Peter Ward
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
Joud Khattab
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
DataWorks Summit
 
Snowflake Data Loading.pptx
Snowflake Data Loading.pptxSnowflake Data Loading.pptx
Snowflake Data Loading.pptx
Parag860410
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
iceberg introduction.pptx
iceberg introduction.pptxiceberg introduction.pptx
iceberg introduction.pptx
Dori Waldman
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Cathrine Wilhelmsen
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
データ分析基盤を支えるエンジニアリング
データ分析基盤を支えるエンジニアリングデータ分析基盤を支えるエンジニアリング
データ分析基盤を支えるエンジニアリング
Recruit Lifestyle Co., Ltd.
 
Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data Virtualization
Denodo
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)
James Serra
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overview
James Serra
 
初めてのデータ分析基盤構築をまかされた、その時何を考えておくと良いのか
初めてのデータ分析基盤構築をまかされた、その時何を考えておくと良いのか初めてのデータ分析基盤構築をまかされた、その時何を考えておくと良いのか
初めてのデータ分析基盤構築をまかされた、その時何を考えておくと良いのか
Techon Organization
 
3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta
Databricks
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
Mostafa
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
Alluxio, Inc.
 
Data Engineer's Lunch #54: dbt and Spark
Data Engineer's Lunch #54: dbt and SparkData Engineer's Lunch #54: dbt and Spark
Data Engineer's Lunch #54: dbt and Spark
Anant Corporation
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Databricks
 
Introduction to Azure Synapse Webinar
Introduction to Azure Synapse WebinarIntroduction to Azure Synapse Webinar
Introduction to Azure Synapse Webinar
Peter Ward
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
DataWorks Summit
 
Snowflake Data Loading.pptx
Snowflake Data Loading.pptxSnowflake Data Loading.pptx
Snowflake Data Loading.pptx
Parag860410
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
iceberg introduction.pptx
iceberg introduction.pptxiceberg introduction.pptx
iceberg introduction.pptx
Dori Waldman
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Cathrine Wilhelmsen
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
データ分析基盤を支えるエンジニアリング
データ分析基盤を支えるエンジニアリングデータ分析基盤を支えるエンジニアリング
データ分析基盤を支えるエンジニアリング
Recruit Lifestyle Co., Ltd.
 
Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data Virtualization
Denodo
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)
James Serra
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overview
James Serra
 

Similar to From flat files to deconstructed database (20)

Strata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseStrata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed database
Julien Le Dem
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Rohit Kulkarni
 
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
Predictive Analytics and Machine Learning…with SAS and Apache HadoopPredictive Analytics and Machine Learning…with SAS and Apache Hadoop
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
Hortonworks
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
Rahul Jain
 
Slide 2 collecting, storing and analyzing big data
Slide 2 collecting, storing and analyzing big dataSlide 2 collecting, storing and analyzing big data
Slide 2 collecting, storing and analyzing big data
Trieu Nguyen
 
Apache Drill talk ApacheCon 2018
Apache Drill talk ApacheCon 2018Apache Drill talk ApacheCon 2018
Apache Drill talk ApacheCon 2018
Aman Sinha
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWS
javier ramirez
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019
Jim Dowling
 
Hive Evolution: ApacheCon NA 2010
Hive Evolution:  ApacheCon NA 2010Hive Evolution:  ApacheCon NA 2010
Hive Evolution: ApacheCon NA 2010
John Sichi
 
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Michael Rys
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJA
Nicolas Poggi
 
Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014
Neeraja Rentachintala
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
nzhang
 
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
Self-Service Data Ingestion Using NiFi, StreamSets & KafkaSelf-Service Data Ingestion Using NiFi, StreamSets & Kafka
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
Guido Schmutz
 
Hyperspace: An Indexing Subsystem for Apache Spark
Hyperspace: An Indexing Subsystem for Apache SparkHyperspace: An Indexing Subsystem for Apache Spark
Hyperspace: An Indexing Subsystem for Apache Spark
Databricks
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
larsgeorge
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
Cloudera, Inc.
 
Datalake Architecture
Datalake ArchitectureDatalake Architecture
Datalake Architecture
TechYugadi IT Solutions & Consulting
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
Martin Bém
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Fwdays
 
Strata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseStrata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed database
Julien Le Dem
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Rohit Kulkarni
 
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
Predictive Analytics and Machine Learning…with SAS and Apache HadoopPredictive Analytics and Machine Learning…with SAS and Apache Hadoop
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
Hortonworks
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
Rahul Jain
 
Slide 2 collecting, storing and analyzing big data
Slide 2 collecting, storing and analyzing big dataSlide 2 collecting, storing and analyzing big data
Slide 2 collecting, storing and analyzing big data
Trieu Nguyen
 
Apache Drill talk ApacheCon 2018
Apache Drill talk ApacheCon 2018Apache Drill talk ApacheCon 2018
Apache Drill talk ApacheCon 2018
Aman Sinha
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWS
javier ramirez
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019
Jim Dowling
 
Hive Evolution: ApacheCon NA 2010
Hive Evolution:  ApacheCon NA 2010Hive Evolution:  ApacheCon NA 2010
Hive Evolution: ApacheCon NA 2010
John Sichi
 
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Michael Rys
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJA
Nicolas Poggi
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
nzhang
 
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
Self-Service Data Ingestion Using NiFi, StreamSets & KafkaSelf-Service Data Ingestion Using NiFi, StreamSets & Kafka
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
Guido Schmutz
 
Hyperspace: An Indexing Subsystem for Apache Spark
Hyperspace: An Indexing Subsystem for Apache SparkHyperspace: An Indexing Subsystem for Apache Spark
Hyperspace: An Indexing Subsystem for Apache Spark
Databricks
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
larsgeorge
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
Cloudera, Inc.
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
Martin Bém
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Fwdays
 
Ad

More from Julien Le Dem (20)

Data and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageData and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineage
Julien Le Dem
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & Marquez
Julien Le Dem
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineage
Julien Le Dem
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020
Julien Le Dem
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020
Julien Le Dem
 
Strata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapStrata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmap
Julien Le Dem
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
Julien Le Dem
 
Improving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowImproving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache Arrow
Julien Le Dem
 
Mule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet ArrowMule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet Arrow
Julien Le Dem
 
Data Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowData Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet Arrow
Julien Le Dem
 
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Julien Le Dem
 
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...
Julien Le Dem
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drill
Julien Le Dem
 
If you have your own Columnar format, stop now and use Parquet 😛
If you have your own Columnar format,  stop now and use Parquet  😛If you have your own Columnar format,  stop now and use Parquet  😛
If you have your own Columnar format, stop now and use Parquet 😛
Julien Le Dem
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
Julien Le Dem
 
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Julien Le Dem
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
Julien Le Dem
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
Julien Le Dem
 
Parquet Twitter Seattle open house
Parquet Twitter Seattle open houseParquet Twitter Seattle open house
Parquet Twitter Seattle open house
Julien Le Dem
 
Parquet overview
Parquet overviewParquet overview
Parquet overview
Julien Le Dem
 
Data and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageData and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineage
Julien Le Dem
 
Data pipelines observability: OpenLineage & Marquez
Data pipelines observability:  OpenLineage & MarquezData pipelines observability:  OpenLineage & Marquez
Data pipelines observability: OpenLineage & Marquez
Julien Le Dem
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineage
Julien Le Dem
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020
Julien Le Dem
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020
Julien Le Dem
 
Strata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapStrata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmap
Julien Le Dem
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
Julien Le Dem
 
Improving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache ArrowImproving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache Arrow
Julien Le Dem
 
Mule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet ArrowMule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet Arrow
Julien Le Dem
 
Data Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowData Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet Arrow
Julien Le Dem
 
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Julien Le Dem
 
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...
Julien Le Dem
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drill
Julien Le Dem
 
If you have your own Columnar format, stop now and use Parquet 😛
If you have your own Columnar format,  stop now and use Parquet  😛If you have your own Columnar format,  stop now and use Parquet  😛
If you have your own Columnar format, stop now and use Parquet 😛
Julien Le Dem
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
Julien Le Dem
 
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Julien Le Dem
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
Julien Le Dem
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
Julien Le Dem
 
Parquet Twitter Seattle open house
Parquet Twitter Seattle open houseParquet Twitter Seattle open house
Parquet Twitter Seattle open house
Julien Le Dem
 
Ad

Recently uploaded (20)

Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 

From flat files to deconstructed database

  • 1. FROM FLAT FILES TO DECONSTRUCTED DATABASE The evolution and future of the Big Data ecosystem. Julien Le Dem @J_ [email protected] April 2018
  • 2. Julien Le Dem @J_ Principal Data Engineer • Author of Parquet • Apache member • Apache PMCs: Arrow, Kudu, Heron, Incubator, Pig, Parquet, Tez • Used Hadoop first at Yahoo in 2007 • Formerly Twitter Data platform and Dremio Julien
  • 3. Agenda ❖ At the beginning there was Hadoop (2005) ❖ Actually, SQL was invented in the 70s “MapReduce: A major step backwards” ❖ The deconstructed database ❖ What next?
  • 4. At the beginning there was Hadoop
  • 5. Hadoop Storage: A distributed file system Execution: Map Reduce Based on Google’s GFS and MapReduce papers
  • 6. Great at looking for a needle in a haystack
  • 7. … with snowplows Great at looking for a needle in a haystack …
  • 8. Original Hadoop abstractions Execution Map/Reduce Storage File System Just flat files ● Any binary ● No schema ● No standard ● The job must know how to split the data M M M R R R Shuffle Read locally write locally Simple ● Flexible/Composable ● Logic and optimizations tightly coupled ● Ties execution with persistence
  • 9. “MapReduce: A major step backwards” (2008)
  • 10. SQL Databases have been around for a long time • Originally SEQUEL developed in the early 70s at IBM • 1986: first SQL standard • Updated in 1989, 1992, 1996, 1999, 2003, 2006, 2008, 2011, 2016 Relational model Global Standard • Universal Data access language • Widely understood • First described in 1969 by Edgar F. Codd
  • 11. Underlying principles of relational databases Standard SQL is understood by many Separation of logic and optimization Separation of Schema and Application High level language focusing on logic (SQL) Indexing Optimizer Evolution Views Schemas Integrity Transactions Integrity constraints Referential integrity
  • 12. Storage Tables Abstracted notion of data ● Defines a Schema ● Format/layout decoupled from queries ● Has statistics/indexing ● Can evolve over time Execution SQL SQL ● Decouples logic of query from: ○ Optimizations ○ Data representation ○ Indexing Relational Database abstractions SELECT a, AVG(b) FROM FOO GROUP BY a FOO
  • 13. Query evaluation Syntax Semantic Optimization Execution
  • 14. A well integrated system Storage SELECT f.c, AVG(b.d) FROM FOO f JOIN BAR b ON f.a = b.b GROUP BY f.c WHERE f.d = x Select Scan FOO Scan BAR JOIN GROUP BY FILTER Select Scan FOO Scan BAR JOIN GROUP BY FILTER Syntax Semantic Optimization Execution Table Metadata (Schema, stats, layout,…) Columnar data Push downs Columnar data Columnar data Push downs Push downs user
  • 15. So why? Why Hadoop? Why Snowplows?
  • 16. The relational model was constrained We need the right Constraints Need the right abstractions Traditional SQL implementations: • Flat schema • Inflexible schema evolution • History rewrite required • No lower level abstractions • Not scalable Constraints are good They allow optimizations • Statistics • Pick the best join algorithm • Change the data layout • Reusable optimization logic
  • 17. It’s just code Hadoop is flexible and scalable • Room to scale algorithms that are not part of the standard • Machine learning • Your imagination is the limit No Data shape constraint Open source • You can improve it • You can expand it • You can reinvent it • Nested data structures • Unstructured text with semantic annotations • Graphs • Non-uniform schemas
  • 18. You can actually implement SQL with this
  • 19. SELECT f.c, AVG(b.d) FROM FOO f JOIN BAR b ON f.a = b.b GROUP BY f.c WHERE f.d = x Select Scan FOO Scan BAR JOIN GROUP BY FILTER Parser FOO Execution GROUP BY JOIN FILTER BAR And they did… (open-sourced 2009)
  • 21. The deconstructed database Author: gamene https://www.flickr.com/photos/gamene/4007091102
  • 24. We can mix and match individual components *not exhaustive! Specialized Components Stream processing Storage Execution SQL Stream persistance Streams Resource management Iceberg Machine learning
  • 25. We can mix and match individual components Storage Row oriented or columnar Immutable or mutable Stream storage vs analytics optimized Query model SQL Functional … Machine Learning Training models Data Exchange Row oriented Columnar Batch Execution Optimized for high throughput and historical analysis Streaming Execution Optimized for High Throughput and Low latency processing
  • 26. Emergence of standard components
  • 27. Emergence of standard components Columnar Storage Apache Parquet as columnar representation at rest. SQL parsing and optimization Apache Calcite as a versatile query optimizer framework Schema model Apache Avro as pipeline friendly schema for the analytics world. Columnar Exchange Apache Arrow as the next generation in- memory representation and no-overhead data exchange Table abstraction Netflix’s Iceberg has a great potential to provide Snapshot isolation and layout abstraction on top of distributed file systems.
  • 28. The deconstructed database’s optimizer: Calcite Storage Execution engine Schema plugins Optimizer rules SELECT f.c, AVG(b.d) FROM FOO f JOIN BAR b ON f.a = b.b GROUP BY f.c WHERE f.d = x Select Scan FOO Scan BAR JOIN GROUP BY FILTER Syntax Semantic Optimization Execution … Select Scan FOO Scan BAR JOIN GROUP BY FILTER
  • 29. Apache Calcite is used in: Streaming SQL • Apache Apex • Apache Flink • Apache SamzaSQL • Apache StormSQL … Batch SQL • Apache Hive • Apache Drill • Apache Phoenix • Apache Kilin …
  • 30. Columnar Row oriented Mutable The deconstructed database’s storage Optimized for analytics Optimized for serving Immutable Query integration To be performant a query engine requires deep integration with the storage layer. Implementing push down and a vectorized reader producing data in an efficient representation (for example Apache Arrow).
  • 31. Storage: Push downs PROJECTION Read only what you need PREDICATE Filter AGGREGATION Avoid materializing intermediary data To reduce IO, aggregation can also be implemented during the scan to: • minimize materialization of intermediary data Evaluate filters during scan to: • Leverage storage properties (min/max stats, partitioning, sort, etc) • Avoid decoding skipped data. • Reduce IO. Read only the columns that are needed: • Columnar storage makes this efficient.
  • 32. The deconstructed database interchange: Apache Arrow Scanner Scanner Scanner Parquet files projection push down read only a and b Partial Agg Partial Agg Partial Agg Agg Agg Agg Shuffle Arrow batches Result SELECT SUM(a) FROM t WHERE c = 5 GROUP BY b Projection and predicate push downs
  • 33. Incumbent Interesting Storage: Stream persistence Open source projects Features • State of consumer: how do we recover from failure • Snapshot • Decoupling reads from writes • Parallel reads • Replication • Data isolation
  • 36. Big Data infrastructure blueprint Data Infra Stream persistance Stream processing Batch processing Real time dashboards Interactive analysis Periodic dashboards Analyst Real time publishing Batch publishing DataAPI Data API Schema registry Datasets metadata Scheduling/ Job deps Persistence Legend: Processing Metadata Monitoring / Observability Data Storage (S3, GCS, HDFS) (Parquet, Avro) Data API UI Eng
  • 38. Still Improving Better interoperability • More efficient interoperability: Continued Apache Arrow adoption A better data abstraction • A better metadata repository • A better table abstraction: Netflix/Iceberg • Common Push down implementations (filter, projection, aggregation) Better data governance • Global Data Lineage • Access control • Protect privacy • Record User consent
  • 40. A common access layer Distributed access service Centralizes: • Schema evolution • Access control/anonymization • Efficient push downs • Efficient interoperability Table abstraction layer (Schema, push downs, access control, anonymization) SQL (Impala, Drill, Presto, Hive, …) Batch (Spark, MR, Beam) ML (TensorFlow, …) File System (HDFS, S3, GCS) Other storage (HBase, Cassandra, Kudu, …)
  • 41. A multi tiered streaming batch storage system Batch-Stream storage integration Convergence of Kafka and Kudu • Schema evolution • Access control/anonymization • Efficient push downs • Efficient interoperability Time based Ingestion API 1) In memory row oriented store 2) in memory column oriented store 3) On Disk Column oriented store Stream consumer API Projection, Predicate, Aggregation push downs Batch consumer API Projection, Predicate, Aggregation push downs Mainly reads from here Mainly reads from here
  • 44. We’re growing We’re hiring! • WeWork doubled in size last year and the year before • We expect to double in size this year as well • Broadened scope: WeWork Labs, Powered by We, Flatiron School, Meetup, WeLive, … WeWork in numbers Technology jobs Locations • San Francisco: Platform teams • New York • Tel Aviv • Montevideo • Singapore • WeWork has 242 locations, in 72 cities internationally • Over 210,000 members worldwide • More than 20,000 member companies Contact: [email protected]