SlideShare a Scribd company logo
Why you should care about
data layout in the file system
Cheng Lian, @liancheng
Vida Ha, @femineer
Spark Summit 2017
1
TEAM
About Databricks
Started Spark project (now Apache Spark) at UC Berkeley in 2009
22
PRODUCT
Unified Analytics Platform
MISSION
Making Big Data Simple
Apache Spark is a
powerful framework
with some temper
3
4
Just like
super mario
5
Serve him the
right ingredients
6
Powers up and
gets more efficient
7
Keep serving
8
He even knows
how to Spark!
9
However,
once served
a wrong dish...
10
Meh...
11
And sometimes...
12
It can be messy...
13
Secret sauces
we feed Spark
13
File Formats
14
Choosing a compression scheme
15
The obvious
• Compression ratio: the higher the better
• De/compression speed: the faster the better
Choosing a compression scheme
16
Splittable v.s. non-splittable
• Affects parallelism, crucial for big data
• Common splittable compression schemes
• LZ4, Snappy, BZip2, LZO, and etc.
• GZip is non-splittable
• Still common if file sizes are << 1GB
• Still applicable for Parquet
Columnar formats
Smart, analytics friendly, optimized for big data
• Support for nested data types
• Efficient data skipping
• Column pruning
• Min/max statistics based predicate push-down
• Nice interoperability
• Examples:
• Spark SQL built-in support: Apache Parquet and Apache ORC
• Newly emerging: Apache CarbonData and Spinach
17
Columnar formats
Parquet
• Apache Spark default output format
• Usually the best practice for Spark SQL
• Relatively heavy write path
• Worth the time to encode for repeated analytics scenario
• Does not support fine grained appending
• Not ideal for, e.g., collecting logs
• Check out Parquet presentations for more details
18
Semi-structured text formats
Sort of structured but not self-describing
• Excellent write path performance but slow on the read path
• Good candidates for collecting raw data (e.g., logs)
• Subject to inconsistent and/or malformed records
• Schema inference provided by Spark (for JSON and CSV)
• Sampling-based
• Handy for exploratory scenario but can be inaccurate
• Always specify an accurate schema in production
19
Semi-structured text formats
JSON
• Supported by Apache Spark out of the box
• One JSON object per line for fast file splitting
• JSON object: map or struct?
• Spark schema inference always treats JSON objects as structs
• Watch out for arbitrary number of keys (may OOM executors)
• Specify an accurate schema if you decide to stick with maps
20
Semi-structured text formats
JSON
• Malformed records
• Bad records are collected into column _corrupted_record
• All other columns are set to null
21
Semi-structured text formats
CSV
• Supported by Spark 2.x out of the box
• Check out the spark-csv package for Spark 1.x
• Often used for handling legacy data providers & consumers
• Lacks of a standard file specification
– Separator, escaping, quoting, and etc.
• Lacks of support for nested data types
22
Raw text files
23
Arbitrary line-based text files
• Splitting files into lines using spark.read.text()
• Keep your lines a reasonable size
• Keep file size < 1GB if compressed with a non-splittable
compression scheme (e.g., GZip)
• Handing inevitable malformed data
• Use a filter() transformation to drop bad lines, or
• Use a map() transformation to fix bad line
Directory layout
24
Partitioning
year=2017 genre=classic albums
albumsgenre=folk
25
Overview
• Coarse-grained data skipping
• Available for both persisted
tables and raw directories
• Automatically discovers Hive
style partitioned directories
CREATE TABLE ratings
USING PARQUET
PARTITIONED BY (year, genre)
AS SELECT artist, rating, year, genre
FROM music
Partitioning
SQL DataFrame API
spark
.table(“music”)
.select(’artist, ’rating, ’year, ’genre)
.write
.format(“parquet”)
.partitionBy(’year, ’genre)
.saveAsTable(“ratings”)
26
Partitioning
year=2017 albums
albums
genre=classic
genre=folk
27
Filter predicates
Use simple filter predicates
containing partition columns to
leverage partition pruning
Partitioning
Filter predicates
• year = 2000 AND genre = ‘folk’
• year > 2000 AND rating > 3
• year > 2000 OR genre <> ‘rock’
28
Partitioning
29
Filter predicates
• year > 2000 OR rating = 5
• year > rating
Partitioning
Avoid excessive partitions
• Stress metastore for persisted tables
• Stress file system when reading directly from the file system
• Suggestions
• Avoid using too many partition columns
• Avoid using partition columns with too many distinct values
– Try hashing the values
– E.g., partition by first letter of first name rather than first name
30
Partitioning
Using persisted partitioned tables
with Spark 2.1+
• Per-partition metadata gets
persisted into the metastore
• Avoids unnecessary partition
discovery (esp. valuable for S3)
Check our blog post for more details
31
Scalable partition handling
• Pre-shuffles and optionally
pre-sorts the data while writing
• Layout information gets persisted
in the metastore
• Avoids shuffling and sorting when
joining large datasets
• Only available for persisted tables
Bucketing
Overview
32
CREATE TABLE ratings
USING PARQUET
PARTITIONED BY (year, genre)
CLUSTERED BY (rating) INTO 5 BUCKETS
SORTED BY (rating)
AS SELECT artist, rating, year, genre
FROM music
Bucketing
SQL
33
DataFrame
ratings
.select(’artist, ’rating, ’year, ’genre)
.write
.format(“parquet”)
.partitionBy(“year”, “genre”)
.bucketBy(5, “rating”)
.sortBy(“rating”)
.saveAsTable(“ratings”)
Bucketing
In combo with columnar formats
• Bucketing
• Per-bucket sorting
• Columnar formats
• Efficient data skipping based on min/max statistics
• Works best when the searched columns are sorted
34
Bucketing
35
Bucketing
36
min=0, max=99
min=100, max=199
min=200, max=249
Bucketing
In combo with columnar formats
Perfect combination, makes your Spark jobs FLY!
37
More tips
38
File size and compaction
Avoid small files
• Cause excessive parallelism
• Spark 2.x improves this by packing small files
• Cause extra file metadata operations
• Particularly bad when hosted on S3
39
File size and compaction
40
How to control output file sizes
• In general, one task in the output stage writes one file
• Tune parallelism of the output stage
• coalesce(N), for
• Reduces parallelism for small jobs
• repartition(N), for
• Increasing parallelism for all jobs, or
• Reducing parallelism of final output stage for large jobs
• Still preserves high parallelism for previous stages
True story
Customer
• Spark ORC Read Performance is much slower than Parquet
• The same query took
• 3 seconds on a Parquet dataset
• 4 minutes on an equivalent ORC dataset
41
True story
Me
• Ran a simple count(*), which took
• Seconds on the Parquet dataset with a handful IO requests
• 35 minutes on the ORC dataset with 10,000s of IO requests
• Most task execution threads are reading ORC stripe footers
42
True story
43
True story
44
import org.apache.hadoop.hive.ql.io.orc._
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.Path
val conf = new Configuration
def countStripes(file: String): Int = {
val path = new Path(file)
val reader = OrcFile.createReader(path, OrcFile.readerOptions(conf))
val metadata = reader.getMetadata
metadata.getStripeStatistics.size
}
True story
45
Maximum file size: ~15 MB
Maximum ORC stripe counts: ~1,400
True story
46
Root cause
Malformed (but not corrupted) ORC dataset
• ORC readers read the footer first before consuming a strip
• ~1,400 stripes within a single file as small as 15 MB
• ~1,400 x 2 read requests issued to S3 for merely 15 MB of data
True story
47
Root cause
Malformed (but not corrupted) ORC dataset
• ORC readers read the footer first before consuming a strip
• ~1,400 stripes within a single file as small as 15 MB
• ~1,400 x 2 read requests issued to S3 for merely 15 MB of data
Much worse than even CSV, not mention Parquet
True story
48
Why?
• Tiny ORC files (~10 KB) generated by Streaming jobs
• Resulting one tiny ORC stripe inside each ORC file
• The footers might take even more space than the actual data!
True story
49
Why?
Tiny files got compacted into larger ones using
ALTER TABLE ... PARTITION (...) CONCATENATE;
The CONCATENATE command just, well, concatenated those tiny
stripes and produced larger (~15 MB) files with a huge number of
tiny stripes.
True story
50
Lessons learned
Again, avoid writing small files in columnar formats
• Output files using CSV or JSON for Streaming jobs
• For better write path performance
• Compact small files into large chunks of columnar files later
• For better read path performance
True story
51
The cure
Simply read the ORC dataset and write it back using
spark.read.orc(input).write.orc(output)
So that stripes are adjusted into more reasonable sizes.
Schema evolution
Columns come and go
• Never ever change the data type of a published column
• Columns with the same name should have the same data type
• If you really dislike the data type of some column
• Add a new column with a new name and the right data type
• Deprecate the old one
• Optionally, drop it after updating all downstream consumers
52
Schema evolution
Columns come and go
Spark built-in data sources that support schema evolution
• JSON
• Parquet
• ORC
53
Schema evolution
Common columnar formats are less tolerant of data type
mismatch. E.g.:
• INT cannot be promoted to LONG
• FLOAT cannot be promoted to DOUBLE
JSON is more tolerating, though
• LONG → DOUBLE → STRING
54
True story
Customer
Parquet dataset corrupted!!! HALP!!!
55
True story
What happened?
Original schema
• {col1: DECIMAL(19, 4), col2: INT}
Accidentally appended data with schema
• {col1: DOUBLE, col2: DOUBLE}
All files written into the same directory
56
True story
What happened?
Common columnar formats are less tolerant of data type
mismatch. E.g.:
• INT cannot be promoted to LONG
• FLOAT cannot be promoted to DOUBLE
Parquet considered these schemas as incompatible ones and
refused to merge them.
57
True story
BTW
JSON schema inference is more tolerating
• LONG → DOUBLE → STRING
However
• JSON is NOT suitable for analytics scenario
• Schema inference is unreliable, not suitable for production
58
True story
The cure
Correct the schema
• Filter out all the files with the wrong schema
• Rewrite those files using the correct schema
Exhausting because all files are appended into a single directory
59
True story
Lessons learned
• Be very careful on the write path
• Consider partitioning when possible
• Better read path performance
• Easier to fix the data when something went wrong
60
Recap
File formats Directory layout
• Partitioning
• Bucketing
• Compression schemes
• Columnar (Parquet, ORC)
• Semi-structured (JSON, CSV)
• Raw text format
61
Other tips
• File sizes and compaction
• Schema evolution
UNIFIED ANALYTICS PLATFORM
Try Apache Spark in Databricks!
• Collaborative cloud environment
• Free version (community edition)
6262
DATABRICKS RUNTIME 3.0
• Apache Spark - optimized for the cloud
• Caching and optimization layer - DBIO
• Enterprise security - DBES
Try for free today.
databricks.com
Early draft available
for free today!
go.databricks.com/book
6363
Thank you
Q & A
64
Ad

More Related Content

What's hot (20)

Dynamic filtering for presto join optimisation
Dynamic filtering for presto join optimisationDynamic filtering for presto join optimisation
Dynamic filtering for presto join optimisation
Ori Reshef
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache Spark
Databricks
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Bo Yang
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
 
Introduction to Apache Calcite
Introduction to Apache CalciteIntroduction to Apache Calcite
Introduction to Apache Calcite
Jordan Halterman
 
Spark and Spark Streaming
Spark and Spark StreamingSpark and Spark Streaming
Spark and Spark Streaming
宇 傅
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
Databricks
 
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesPresto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation Engines
Databricks
 
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Databricks
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
Kostas Tzoumas
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
Databricks
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
Databricks
 
Migrating from InnoDB and HBase to MyRocks at Facebook
Migrating from InnoDB and HBase to MyRocks at FacebookMigrating from InnoDB and HBase to MyRocks at Facebook
Migrating from InnoDB and HBase to MyRocks at Facebook
MariaDB plc
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
Databricks
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Databricks
 
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
Owen O'Malley
 
Dynamic filtering for presto join optimisation
Dynamic filtering for presto join optimisationDynamic filtering for presto join optimisation
Dynamic filtering for presto join optimisation
Ori Reshef
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache Spark
Databricks
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Bo Yang
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
 
Introduction to Apache Calcite
Introduction to Apache CalciteIntroduction to Apache Calcite
Introduction to Apache Calcite
Jordan Halterman
 
Spark and Spark Streaming
Spark and Spark StreamingSpark and Spark Streaming
Spark and Spark Streaming
宇 傅
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
Databricks
 
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesPresto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation Engines
Databricks
 
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Databricks
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
Kostas Tzoumas
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
Databricks
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
Databricks
 
Migrating from InnoDB and HBase to MyRocks at Facebook
Migrating from InnoDB and HBase to MyRocks at FacebookMigrating from InnoDB and HBase to MyRocks at Facebook
Migrating from InnoDB and HBase to MyRocks at Facebook
MariaDB plc
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
Databricks
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Databricks
 
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
Owen O'Malley
 

Similar to Why you should care about data layout in the file system with Cheng Lian and Vida Ha (20)

Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Spark Summit
 
SQL Server 2014 Memory Optimised Tables - Advanced
SQL Server 2014 Memory Optimised Tables - AdvancedSQL Server 2014 Memory Optimised Tables - Advanced
SQL Server 2014 Memory Optimised Tables - Advanced
Tony Rogerson
 
Elasticsearch Arcihtecture & What's New in Version 5
Elasticsearch Arcihtecture & What's New in Version 5Elasticsearch Arcihtecture & What's New in Version 5
Elasticsearch Arcihtecture & What's New in Version 5
Burak TUNGUT
 
Low Level CPU Performance Profiling Examples
Low Level CPU Performance Profiling ExamplesLow Level CPU Performance Profiling Examples
Low Level CPU Performance Profiling Examples
Tanel Poder
 
Dissecting Scalable Database Architectures
Dissecting Scalable Database ArchitecturesDissecting Scalable Database Architectures
Dissecting Scalable Database Architectures
hypertable
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Spark Summit
 
Parquet and impala overview external
Parquet and impala overview externalParquet and impala overview external
Parquet and impala overview external
mattlieber
 
Cassandra Day Chicago 2015: Introduction to Apache Cassandra & DataStax Enter...
Cassandra Day Chicago 2015: Introduction to Apache Cassandra & DataStax Enter...Cassandra Day Chicago 2015: Introduction to Apache Cassandra & DataStax Enter...
Cassandra Day Chicago 2015: Introduction to Apache Cassandra & DataStax Enter...
DataStax Academy
 
Cassandra Day Atlanta 2015: Introduction to Apache Cassandra & DataStax Enter...
Cassandra Day Atlanta 2015: Introduction to Apache Cassandra & DataStax Enter...Cassandra Day Atlanta 2015: Introduction to Apache Cassandra & DataStax Enter...
Cassandra Day Atlanta 2015: Introduction to Apache Cassandra & DataStax Enter...
DataStax Academy
 
Cassandra Day London 2015: Introduction to Apache Cassandra and DataStax Ente...
Cassandra Day London 2015: Introduction to Apache Cassandra and DataStax Ente...Cassandra Day London 2015: Introduction to Apache Cassandra and DataStax Ente...
Cassandra Day London 2015: Introduction to Apache Cassandra and DataStax Ente...
DataStax Academy
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
Owen O'Malley
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
DataWorks Summit
 
Fabian Hueske – Juggling with Bits and Bytes
Fabian Hueske – Juggling with Bits and BytesFabian Hueske – Juggling with Bits and Bytes
Fabian Hueske – Juggling with Bits and Bytes
Flink Forward
 
Apache Spark's Built-in File Sources in Depth
Apache Spark's Built-in File Sources in DepthApache Spark's Built-in File Sources in Depth
Apache Spark's Built-in File Sources in Depth
Databricks
 
Using existing language skillsets to create large-scale, cloud-based analytics
Using existing language skillsets to create large-scale, cloud-based analyticsUsing existing language skillsets to create large-scale, cloud-based analytics
Using existing language skillsets to create large-scale, cloud-based analytics
Microsoft Tech Community
 
Managing data and operation distribution in MongoDB
Managing data and operation distribution in MongoDBManaging data and operation distribution in MongoDB
Managing data and operation distribution in MongoDB
Antonios Giannopoulos
 
Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101
Mark Kromer
 
1650607.ppt
1650607.ppt1650607.ppt
1650607.ppt
KalsoomTahir2
 
Managing Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDBManaging Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDB
Jason Terpko
 
MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014
Avinash Ramineni
 
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Spark Summit
 
SQL Server 2014 Memory Optimised Tables - Advanced
SQL Server 2014 Memory Optimised Tables - AdvancedSQL Server 2014 Memory Optimised Tables - Advanced
SQL Server 2014 Memory Optimised Tables - Advanced
Tony Rogerson
 
Elasticsearch Arcihtecture & What's New in Version 5
Elasticsearch Arcihtecture & What's New in Version 5Elasticsearch Arcihtecture & What's New in Version 5
Elasticsearch Arcihtecture & What's New in Version 5
Burak TUNGUT
 
Low Level CPU Performance Profiling Examples
Low Level CPU Performance Profiling ExamplesLow Level CPU Performance Profiling Examples
Low Level CPU Performance Profiling Examples
Tanel Poder
 
Dissecting Scalable Database Architectures
Dissecting Scalable Database ArchitecturesDissecting Scalable Database Architectures
Dissecting Scalable Database Architectures
hypertable
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Spark Summit
 
Parquet and impala overview external
Parquet and impala overview externalParquet and impala overview external
Parquet and impala overview external
mattlieber
 
Cassandra Day Chicago 2015: Introduction to Apache Cassandra & DataStax Enter...
Cassandra Day Chicago 2015: Introduction to Apache Cassandra & DataStax Enter...Cassandra Day Chicago 2015: Introduction to Apache Cassandra & DataStax Enter...
Cassandra Day Chicago 2015: Introduction to Apache Cassandra & DataStax Enter...
DataStax Academy
 
Cassandra Day Atlanta 2015: Introduction to Apache Cassandra & DataStax Enter...
Cassandra Day Atlanta 2015: Introduction to Apache Cassandra & DataStax Enter...Cassandra Day Atlanta 2015: Introduction to Apache Cassandra & DataStax Enter...
Cassandra Day Atlanta 2015: Introduction to Apache Cassandra & DataStax Enter...
DataStax Academy
 
Cassandra Day London 2015: Introduction to Apache Cassandra and DataStax Ente...
Cassandra Day London 2015: Introduction to Apache Cassandra and DataStax Ente...Cassandra Day London 2015: Introduction to Apache Cassandra and DataStax Ente...
Cassandra Day London 2015: Introduction to Apache Cassandra and DataStax Ente...
DataStax Academy
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
Owen O'Malley
 
Fabian Hueske – Juggling with Bits and Bytes
Fabian Hueske – Juggling with Bits and BytesFabian Hueske – Juggling with Bits and Bytes
Fabian Hueske – Juggling with Bits and Bytes
Flink Forward
 
Apache Spark's Built-in File Sources in Depth
Apache Spark's Built-in File Sources in DepthApache Spark's Built-in File Sources in Depth
Apache Spark's Built-in File Sources in Depth
Databricks
 
Using existing language skillsets to create large-scale, cloud-based analytics
Using existing language skillsets to create large-scale, cloud-based analyticsUsing existing language skillsets to create large-scale, cloud-based analytics
Using existing language skillsets to create large-scale, cloud-based analytics
Microsoft Tech Community
 
Managing data and operation distribution in MongoDB
Managing data and operation distribution in MongoDBManaging data and operation distribution in MongoDB
Managing data and operation distribution in MongoDB
Antonios Giannopoulos
 
Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101
Mark Kromer
 
Managing Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDBManaging Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDB
Jason Terpko
 
MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014MongoDB Replication fundamentals - Desert Code Camp - October 2014
MongoDB Replication fundamentals - Desert Code Camp - October 2014
Avinash Ramineni
 
Ad

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

IAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdfIAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
mcgardenlevi9
 
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnTemplate_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
cegiver630
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
chapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptxchapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptx
justinebandajbn
 
Defense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptxDefense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptx
Greg Makowski
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
DPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdfDPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdf
inmishra17121973
 
Data Analytics Overview and its applications
Data Analytics Overview and its applicationsData Analytics Overview and its applications
Data Analytics Overview and its applications
JanmejayaMishra7
 
Medical Dataset including visualizations
Medical Dataset including visualizationsMedical Dataset including visualizations
Medical Dataset including visualizations
vishrut8750588758
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 
Principles of information security Chapter 5.ppt
Principles of information security Chapter 5.pptPrinciples of information security Chapter 5.ppt
Principles of information security Chapter 5.ppt
EstherBaguma
 
183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag
fardin123rahman07
 
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptxPerencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
PareaRusan
 
FPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptxFPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptx
ssuser4ef83d
 
Data Science Courses in India iim skills
Data Science Courses in India iim skillsData Science Courses in India iim skills
Data Science Courses in India iim skills
dharnathakur29
 
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Abodahab
 
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
gmuir1066
 
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdfIAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
mcgardenlevi9
 
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnTemplate_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
cegiver630
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
chapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptxchapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptx
justinebandajbn
 
Defense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptxDefense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptx
Greg Makowski
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
DPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdfDPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdf
inmishra17121973
 
Data Analytics Overview and its applications
Data Analytics Overview and its applicationsData Analytics Overview and its applications
Data Analytics Overview and its applications
JanmejayaMishra7
 
Medical Dataset including visualizations
Medical Dataset including visualizationsMedical Dataset including visualizations
Medical Dataset including visualizations
vishrut8750588758
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 
Principles of information security Chapter 5.ppt
Principles of information security Chapter 5.pptPrinciples of information security Chapter 5.ppt
Principles of information security Chapter 5.ppt
EstherBaguma
 
183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag
fardin123rahman07
 
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptxPerencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
PareaRusan
 
FPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptxFPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptx
ssuser4ef83d
 
Data Science Courses in India iim skills
Data Science Courses in India iim skillsData Science Courses in India iim skills
Data Science Courses in India iim skills
dharnathakur29
 
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Abodahab
 
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
gmuir1066
 
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 

Why you should care about data layout in the file system with Cheng Lian and Vida Ha

  • 1. Why you should care about data layout in the file system Cheng Lian, @liancheng Vida Ha, @femineer Spark Summit 2017 1
  • 2. TEAM About Databricks Started Spark project (now Apache Spark) at UC Berkeley in 2009 22 PRODUCT Unified Analytics Platform MISSION Making Big Data Simple
  • 3. Apache Spark is a powerful framework with some temper 3
  • 5. 5 Serve him the right ingredients
  • 6. 6 Powers up and gets more efficient
  • 12. 12 It can be messy...
  • 15. Choosing a compression scheme 15 The obvious • Compression ratio: the higher the better • De/compression speed: the faster the better
  • 16. Choosing a compression scheme 16 Splittable v.s. non-splittable • Affects parallelism, crucial for big data • Common splittable compression schemes • LZ4, Snappy, BZip2, LZO, and etc. • GZip is non-splittable • Still common if file sizes are << 1GB • Still applicable for Parquet
  • 17. Columnar formats Smart, analytics friendly, optimized for big data • Support for nested data types • Efficient data skipping • Column pruning • Min/max statistics based predicate push-down • Nice interoperability • Examples: • Spark SQL built-in support: Apache Parquet and Apache ORC • Newly emerging: Apache CarbonData and Spinach 17
  • 18. Columnar formats Parquet • Apache Spark default output format • Usually the best practice for Spark SQL • Relatively heavy write path • Worth the time to encode for repeated analytics scenario • Does not support fine grained appending • Not ideal for, e.g., collecting logs • Check out Parquet presentations for more details 18
  • 19. Semi-structured text formats Sort of structured but not self-describing • Excellent write path performance but slow on the read path • Good candidates for collecting raw data (e.g., logs) • Subject to inconsistent and/or malformed records • Schema inference provided by Spark (for JSON and CSV) • Sampling-based • Handy for exploratory scenario but can be inaccurate • Always specify an accurate schema in production 19
  • 20. Semi-structured text formats JSON • Supported by Apache Spark out of the box • One JSON object per line for fast file splitting • JSON object: map or struct? • Spark schema inference always treats JSON objects as structs • Watch out for arbitrary number of keys (may OOM executors) • Specify an accurate schema if you decide to stick with maps 20
  • 21. Semi-structured text formats JSON • Malformed records • Bad records are collected into column _corrupted_record • All other columns are set to null 21
  • 22. Semi-structured text formats CSV • Supported by Spark 2.x out of the box • Check out the spark-csv package for Spark 1.x • Often used for handling legacy data providers & consumers • Lacks of a standard file specification – Separator, escaping, quoting, and etc. • Lacks of support for nested data types 22
  • 23. Raw text files 23 Arbitrary line-based text files • Splitting files into lines using spark.read.text() • Keep your lines a reasonable size • Keep file size < 1GB if compressed with a non-splittable compression scheme (e.g., GZip) • Handing inevitable malformed data • Use a filter() transformation to drop bad lines, or • Use a map() transformation to fix bad line
  • 25. Partitioning year=2017 genre=classic albums albumsgenre=folk 25 Overview • Coarse-grained data skipping • Available for both persisted tables and raw directories • Automatically discovers Hive style partitioned directories
  • 26. CREATE TABLE ratings USING PARQUET PARTITIONED BY (year, genre) AS SELECT artist, rating, year, genre FROM music Partitioning SQL DataFrame API spark .table(“music”) .select(’artist, ’rating, ’year, ’genre) .write .format(“parquet”) .partitionBy(’year, ’genre) .saveAsTable(“ratings”) 26
  • 27. Partitioning year=2017 albums albums genre=classic genre=folk 27 Filter predicates Use simple filter predicates containing partition columns to leverage partition pruning
  • 28. Partitioning Filter predicates • year = 2000 AND genre = ‘folk’ • year > 2000 AND rating > 3 • year > 2000 OR genre <> ‘rock’ 28
  • 29. Partitioning 29 Filter predicates • year > 2000 OR rating = 5 • year > rating
  • 30. Partitioning Avoid excessive partitions • Stress metastore for persisted tables • Stress file system when reading directly from the file system • Suggestions • Avoid using too many partition columns • Avoid using partition columns with too many distinct values – Try hashing the values – E.g., partition by first letter of first name rather than first name 30
  • 31. Partitioning Using persisted partitioned tables with Spark 2.1+ • Per-partition metadata gets persisted into the metastore • Avoids unnecessary partition discovery (esp. valuable for S3) Check our blog post for more details 31 Scalable partition handling
  • 32. • Pre-shuffles and optionally pre-sorts the data while writing • Layout information gets persisted in the metastore • Avoids shuffling and sorting when joining large datasets • Only available for persisted tables Bucketing Overview 32
  • 33. CREATE TABLE ratings USING PARQUET PARTITIONED BY (year, genre) CLUSTERED BY (rating) INTO 5 BUCKETS SORTED BY (rating) AS SELECT artist, rating, year, genre FROM music Bucketing SQL 33 DataFrame ratings .select(’artist, ’rating, ’year, ’genre) .write .format(“parquet”) .partitionBy(“year”, “genre”) .bucketBy(5, “rating”) .sortBy(“rating”) .saveAsTable(“ratings”)
  • 34. Bucketing In combo with columnar formats • Bucketing • Per-bucket sorting • Columnar formats • Efficient data skipping based on min/max statistics • Works best when the searched columns are sorted 34
  • 37. Bucketing In combo with columnar formats Perfect combination, makes your Spark jobs FLY! 37
  • 39. File size and compaction Avoid small files • Cause excessive parallelism • Spark 2.x improves this by packing small files • Cause extra file metadata operations • Particularly bad when hosted on S3 39
  • 40. File size and compaction 40 How to control output file sizes • In general, one task in the output stage writes one file • Tune parallelism of the output stage • coalesce(N), for • Reduces parallelism for small jobs • repartition(N), for • Increasing parallelism for all jobs, or • Reducing parallelism of final output stage for large jobs • Still preserves high parallelism for previous stages
  • 41. True story Customer • Spark ORC Read Performance is much slower than Parquet • The same query took • 3 seconds on a Parquet dataset • 4 minutes on an equivalent ORC dataset 41
  • 42. True story Me • Ran a simple count(*), which took • Seconds on the Parquet dataset with a handful IO requests • 35 minutes on the ORC dataset with 10,000s of IO requests • Most task execution threads are reading ORC stripe footers 42
  • 44. True story 44 import org.apache.hadoop.hive.ql.io.orc._ import org.apache.hadoop.conf.Configuration import org.apache.hadoop.fs.Path val conf = new Configuration def countStripes(file: String): Int = { val path = new Path(file) val reader = OrcFile.createReader(path, OrcFile.readerOptions(conf)) val metadata = reader.getMetadata metadata.getStripeStatistics.size }
  • 45. True story 45 Maximum file size: ~15 MB Maximum ORC stripe counts: ~1,400
  • 46. True story 46 Root cause Malformed (but not corrupted) ORC dataset • ORC readers read the footer first before consuming a strip • ~1,400 stripes within a single file as small as 15 MB • ~1,400 x 2 read requests issued to S3 for merely 15 MB of data
  • 47. True story 47 Root cause Malformed (but not corrupted) ORC dataset • ORC readers read the footer first before consuming a strip • ~1,400 stripes within a single file as small as 15 MB • ~1,400 x 2 read requests issued to S3 for merely 15 MB of data Much worse than even CSV, not mention Parquet
  • 48. True story 48 Why? • Tiny ORC files (~10 KB) generated by Streaming jobs • Resulting one tiny ORC stripe inside each ORC file • The footers might take even more space than the actual data!
  • 49. True story 49 Why? Tiny files got compacted into larger ones using ALTER TABLE ... PARTITION (...) CONCATENATE; The CONCATENATE command just, well, concatenated those tiny stripes and produced larger (~15 MB) files with a huge number of tiny stripes.
  • 50. True story 50 Lessons learned Again, avoid writing small files in columnar formats • Output files using CSV or JSON for Streaming jobs • For better write path performance • Compact small files into large chunks of columnar files later • For better read path performance
  • 51. True story 51 The cure Simply read the ORC dataset and write it back using spark.read.orc(input).write.orc(output) So that stripes are adjusted into more reasonable sizes.
  • 52. Schema evolution Columns come and go • Never ever change the data type of a published column • Columns with the same name should have the same data type • If you really dislike the data type of some column • Add a new column with a new name and the right data type • Deprecate the old one • Optionally, drop it after updating all downstream consumers 52
  • 53. Schema evolution Columns come and go Spark built-in data sources that support schema evolution • JSON • Parquet • ORC 53
  • 54. Schema evolution Common columnar formats are less tolerant of data type mismatch. E.g.: • INT cannot be promoted to LONG • FLOAT cannot be promoted to DOUBLE JSON is more tolerating, though • LONG → DOUBLE → STRING 54
  • 55. True story Customer Parquet dataset corrupted!!! HALP!!! 55
  • 56. True story What happened? Original schema • {col1: DECIMAL(19, 4), col2: INT} Accidentally appended data with schema • {col1: DOUBLE, col2: DOUBLE} All files written into the same directory 56
  • 57. True story What happened? Common columnar formats are less tolerant of data type mismatch. E.g.: • INT cannot be promoted to LONG • FLOAT cannot be promoted to DOUBLE Parquet considered these schemas as incompatible ones and refused to merge them. 57
  • 58. True story BTW JSON schema inference is more tolerating • LONG → DOUBLE → STRING However • JSON is NOT suitable for analytics scenario • Schema inference is unreliable, not suitable for production 58
  • 59. True story The cure Correct the schema • Filter out all the files with the wrong schema • Rewrite those files using the correct schema Exhausting because all files are appended into a single directory 59
  • 60. True story Lessons learned • Be very careful on the write path • Consider partitioning when possible • Better read path performance • Easier to fix the data when something went wrong 60
  • 61. Recap File formats Directory layout • Partitioning • Bucketing • Compression schemes • Columnar (Parquet, ORC) • Semi-structured (JSON, CSV) • Raw text format 61 Other tips • File sizes and compaction • Schema evolution
  • 62. UNIFIED ANALYTICS PLATFORM Try Apache Spark in Databricks! • Collaborative cloud environment • Free version (community edition) 6262 DATABRICKS RUNTIME 3.0 • Apache Spark - optimized for the cloud • Caching and optimization layer - DBIO • Enterprise security - DBES Try for free today. databricks.com
  • 63. Early draft available for free today! go.databricks.com/book 6363