SlideShare a Scribd company logo
Agile Data
Lake?
An Oxymoron?
Agenda
1. Part 1 - Data Lake Overview
2. Part 2 - Technology Deep Dive
Please interrupt with questions/comments so I know which slides to focus on.
Part 1 - Data Lake Overview
Data Lake - Definition
Agile data lake? An oxymoron?
Data Lake - Definition - Martin Kleppman
"Hadoop opened up the possibility of indiscriminately dumping data into HDFS, and only later figuring out how to process it further.
By contrast, MPP databases typically require careful up-front modeling of the data and query patterns before importing the data
into the database's proprietary storage format.
From a purist's point of view, it may seem that this careful modeling and import is desirable, because it means users of the
database have better-quality data to work with. However, in practice, it appears that simply making data available quickly - even if it
is in a quirky, difficult-to-use, raw format [/schema] - is often more valuable than trying to decide on the ideal data model up front.
The idea is similar to a data warehouse: simply bringing data from various parts of a large organization together in one place is
valuable, because it enables joins across datasets that were previously disparate. The careful schema design required by an MPP
database slows down that centralised data collection; collecting data in it's raw form, and worrying about schema design later,
allows the data collection to be speeded up (a concept sometimes known as a "data lake" or "enterprise data hub").
... the interpretation of the data becomes the consumer's problem (the schema-on-read approach). ... There may not even be one
ideal data model, but rather different views onto the data that are suitable for different purposes. Simply dumping data in its raw
form allows for several such transformations. This approach has been dubbed the sushi principle: "raw data is better".
Page 415, 416, Martin Kleppmann - Designing Data-Intensive Applications
Data Lake - Definition - Martin Fowler
> The idea [Data Lake] is to have a single store for all of the raw data that anyone in an organization might
need to analyze.
...
> But there is a vital distinction between the data lake and the data warehouse. The data lake stores raw
data, in whatever form the data source provides. There is no assumptions about the schema of the data,
each data source can use whatever schema it likes. It's up to the consumers of that data to make sense of
that data for their own purposes.
> data put into the lake is immutable
> The data lake is “schemaless” [or Schema-on-Read]
> storage is oriented around the notion of a large schemaless structure … HDFS
https://martinfowler.com/bliki/DataLake.html ← MUST READ!
Agile data lake? An oxymoron?
Prefer this
Individuals and
interactions /engineers
Over this
Pretty GUI based Tools,
Deskilling,
Centralisation
and Processes
Trends (Sample = 11 Data Lakes)
Good Bad
Cost < £1m, business value in weeks/months Cost many millions, years before business value
Schema on Read
Documentation as code - Internal Open Source
Schema on Write
Metastores, data dictionaries, confluence
Cloud, PAAS, e.g. EMR, Dataproc
S3 for long term storage
On prem, Cloudera/Hortonworks
HDFS for long term storage
Scala / Java apps, Jenkins, CircleCi, etc with Bash
for deployment and lightweight scheduling
Hive / Pig scripts, manual releases, heavyweight
scheduling (Oozie, Airflow, Control-M, Luigi, etc)
High ratio 80%+; committed developers/engineers
that write code.
Small in house teams, highly skilled.
High ratio of involved people who do not commit
code.
Large low skilled offshore teams
Flat structure, cross-functional teams
Agile
Hierarchical, authoritarian
Waterfall
Trends (Sample = 11 Data Lakes)
Success Failure
XP, KISS, YAGNI BUFD, Tools, Processes, Governance, Complexity,
Documentation
Cross functional individuals (can architect, code &
do analysis) form a team that can deliver end to
end business value right from source to
consumption.
Co-dependent component teams, no one team can
deliver an end to end solution.
Clear focus on 1 business problem, solve it, then
solve 2nd business problem, solve it, then
deduplicate (DRY)
No clear business focus, too many goals, lofty
overly ambitious ideas, silver bullets, big hammers
Motivation - The WHY:
Satisfaction from solving problems & automation
Motivation - The WHY:
Deskilling & centralisation of power
Hive, Impala, Drill, Dremio,
Presto, Delta, Athena, Kylo, Hudi,
Ab Initio, etc
Silver Bullets & Big Hammers
- Often built to demo/pitch to Architects (that don’t code) & non-technical/non-engineers
- Consequently have well polished UIs but often lacking quality under the hood
- Generally only handle happy cases
- Tend to assume all use cases are the same. Your use case will probably invalidate their assumptions
- The devil is in the details, and they obscure those details
- Generally make performance problems more complicated due to inherited and obscured complexity
- Often commercially motivated
- Few engineers/data scientists would recommend as they know what it's really like to build a Data Lake
and know that most of these tools won't work
- Often aim at deskilling, literally claiming that Engineers/Data Scientists are not necessary. They are
necessary, but now you have to pay a vendor/consultancy for those skills
- at very high markup
- with a lot of lost in translation issues and communication bottlenecks
- long delays in implementation
- Generally appeal to non-technical people that want centralisation and power, some tools literally
referring to users as “power users”
Note that there are exceptions, for example Uber’s Hudi seems to be built to solve real internal PB data problems, then later Open Sourced. There may be other exceptions.
Data Lake Principles
1. Immutability & Reproducibility - Datasets should be immutable, Any
queries/jobs run on the Data Lake should be reproducible
2. A Dataset corresponds to a directory and all the files in that directory, not files
- Big Data is too big to fit into single files. Avoid appending to a directory as
this is just like mutating it, thus violating 1.
3. An easy way to identify when new data has arrived - no scanning, no joining,
or complex event notification systems should be necessary. Simply partition
by landed date and consumers keep track of their own offsets (like in Kafka)
4. Schema On Read - Parquet headers plus directory structure form self
describing metadata (more next!).
Metadata
- Schema-on-read - parquet header has the schema
- Add lineage fields to data at each stage of a pipeline, especially later stages
- Internal Open Source via Monorepo
- Code is unambiguous
- Invest in high quality code control - Stop here!
- Analogy:
- An enterprise investing large amounts in meta-data services is like a restaurant investing large
amounts in menus
- In the best restaurants chefs write the specials of the day on a blackboard
- In the best enterprises innovation and code is created every day
- etc
Technology Choices - Analytics
Requirement Recommendation
SQL Interface to Parquet & Databases (via JDBC) Apache Zeppelin
Spark, Scala, Java, Python, R, CLI, and more Apache Zeppelin
Charts, graphs & visualisations (inc JS, D3 etc) Apache Zeppelin
Free and open source Apache Zeppelin
Integrated into infra (EMR, HDInsights) out-of-box - NoOps Apache Zeppelin
Lightweight scheduling and job management Apache Zeppelin
Basic Source Control & JSON Exports Apache Zeppelin
In memory compute (via Spark) Apache Zeppelin
Quickly implement dashboards & reports via WYSIWYG or JS Apache Zeppelin
Active Directory & ACL integration Apache Zeppelin
Technology Choices - Software
Requirement Recommendation
Parquet Format Spark
Production quality stable Spark APIs Scala/Java
Streaming Architecture on Kafka Scala/Java
Quick development & Dev Cycle Statically typed languages
Production quality software / low bug density Statically typed languages
Huge market of low skilled cheap resource
where speed of delivery, software quality and
data quality is not important
(please read The Mythical Man-Month!)
Python
https://insights.stackoverflow.com/survey/2019?utm_source=Iterable&utm_medium=email&utm_campaign=dev-survey-2019#top-paying-technologies
Agile data lake? An oxymoron?
Agile data lake? An oxymoron?
Conclusion
Yes you can build a Data Lake in an Agile way.
● Code first
● Upskill, not deskill
● Do not trust all vendor marketing literature & blogs
● Avoid most big tools, especially proprietary ones
Part 2 - Technology Deep Dive
Brief History of Spark
Version & Date Notes
Up to 0.5, 2010 - 2012 Spark created as a better Hadoop MapReduce.
- Awesome functional typed Scala API (RDD)
- In memory caching
- Broadcast variables
- Mesos support
0.6, 14/10/2012 - Java API (anticipating Java 8 & Scala 2.12 interop!)
0.7, 27/02/2013 - Python API: PySpark
- Spark”Streaming” Alpha
0.8, 25/09/2013 - Apache Incubator in June 2013
- September 2013, Databricks raises $13.9 million
- MLlib (nice idea, poor API) see https://github.com/samthebest/sceval/blob/master/README.md
0.9 - 1.6, 02/02/2014 - 04/01/2016
Hype years!
- February 2014, Spark becomes Top-Level Apache Project
- SparkSQL, and more Shiny (GraphX, Dataframe API, SparkR)
- Covariant RDDs requested
https://issues.apache.org/jira/browse/SPARK-1296
Key: Good idea - technically motivated,
Not so good idea (probably commercially motivated?)
Brief History of Spark
Version & Date Notes
2.0, 26/07/2016 Datasets API (nice idea, poor design):
- Typed, semi declarative class based API
- Improved serialisation
- No way to inject custom serialisation
StructuredStreaming API
- Uses same API as Datasets (so what are .cache, .filter, .mapPartitions supposed
to do? How do we branch? How to access a microbatch? How to control
microbatch sizes? etc)
2.3, 28/02/2018 StructuredStreaming trying to play catch up with Kafka Streams, Akka Streams, etc
???, 2500? - Increase parallelism without shuffling https://issues.apache.org/jira/browse/SPARK-5997
- Num partitions no longer respects num files
https://issues.apache.org/jira/browse/SPARK-24425
- Multiple SparkContexts https://issues.apache.org/jira/browse/SPARK-2243
- Closure cleaner bugs https://issues.apache.org/jira/browse/SPARK-26534
- Spores https://docs.scala-lang.org/sips/spores.html
- RDD covariance https://issues.apache.org/jira/browse/SPARK-1296
- Frameless to become native? https://github.com/typelevel/frameless
- Datasets to offer injectable custom serialisation based on this
https://typelevel.org/frameless/Injection.html
Key: Good idea - technically motivated,
Not so good idea (probably commercially motivated?)
Spark APIs - RDD
- Motivated by true Open Source & Unix philosophy - solve a specific real
problem well, simply and flexibly
- Oldest most stable API, has very few bugs
- Boils down to two functions that neatly correspond to MapReduce paradigm:
- `mapPartitions`
- `combineByKey`
- Simple flexible API design
- Can customise serialisation (using `mapParitions` and byte arrays)
- Can customise reading and writing (e.g. `binaryFiles`)
- Fairly functional, but does mutate state (e.g. `.cache()`)
- Advised API for experienced developers / data engineers, especially in the Big
Data space
Spark APIs - Dataset / Dataframe
- Motivated by increasing market size for vendors by targeting non-developers,
e.g. Analysts, Data Scientists and Architects
- Very buggy, e.g. (bugs I found in the last couple of months)
- Reading: nulling entire rows, parsing timestamps incorrectly, not handling escaping properly, etc
- Non-optional reference types are treated as nullable
- Closure cleaner seems more buggy with Datasets (unnecessarily serialises)
- API Design inflexible
- cannot inject custom serialisation
- No functional `combineByKey` API counterpart, have to instantiate an Aggregator
- Declarative API breaks MapReduce semantics E.g.
- A call to `groupBy` may not actually cause a groupby operation
- Advised API for those new to Big Data and generally trying to solve
“little/middle data” problems (i.e. extensive optimisations are not necessary),
and where data quality and application stability less important (e.g. POCs).
Spark APIs - SparkSQL
- Buggy, unstable, unpredictable
- SQL optimiser is quite immature
- MapReduce is a functional paradigm while SQL is declarative, consequently these
two don’t get along very well
- All the usual problems with SQL; hard to test, no compiler, not turing complete, etc
- Advised API for interactive analytical use only - never use for production
applications!
Frameless - Awesome!
- All of the benefits of Datasets without string literals
scala> fds.filter(fds('i) === 10).select(fds('x))
<console>:24: error: No column Symbol with shapeless.tag.Tagged[String("x")] of type A in Foo
fds.filter(fds('i) === 10).select(fds('x))
^
- Custom serialisation https://typelevel.org/frameless/Injection.html
- Cats integration, e.g. can join RDDs using `|+|`
- Advised API for both experienced Big Data Engineers and people new to Big
Data
Alternatives to Spark
- Kafka: see Kafka Streams & Akka Streams
- Flink
Ad

More Related Content

What's hot (20)

Spark mhug2
Spark mhug2Spark mhug2
Spark mhug2
Joseph Niemiec
 
HBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQLHBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQL
DataWorks Summit
 
SQL on Hadoop in Taiwan
SQL on Hadoop in TaiwanSQL on Hadoop in Taiwan
SQL on Hadoop in Taiwan
Treasure Data, Inc.
 
Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala
Edureka!
 
Apache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating SystemApache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating System
Adarsh Pannu
 
Front Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesFront Range PHP NoSQL Databases
Front Range PHP NoSQL Databases
Jon Meredith
 
Allyourbase
AllyourbaseAllyourbase
Allyourbase
Alex Scotti
 
Machine Learning by Example - Apache Spark
Machine Learning by Example - Apache SparkMachine Learning by Example - Apache Spark
Machine Learning by Example - Apache Spark
Meeraj Kunnumpurath
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
bddmoscow
 
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...
Daniel Abadi
 
Hadoop and Graph Data Management: Challenges and Opportunities
Hadoop and Graph Data Management: Challenges and OpportunitiesHadoop and Graph Data Management: Challenges and Opportunities
Hadoop and Graph Data Management: Challenges and Opportunities
Daniel Abadi
 
Hadoop
HadoopHadoop
Hadoop
Girish Khanzode
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
DataFactZ
 
Scaling Up AI Research to Production with PyTorch and MLFlow
Scaling Up AI Research to Production with PyTorch and MLFlowScaling Up AI Research to Production with PyTorch and MLFlow
Scaling Up AI Research to Production with PyTorch and MLFlow
Databricks
 
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...
DataStax Academy
 
Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUs
Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUsCreating an 86,000 Hour Speech Dataset with Apache Spark and TPUs
Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUs
Databricks
 
SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015
Lance Co Ting Keh
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
Amr Awadallah
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
cdmaxime
 
Spark 101
Spark 101Spark 101
Spark 101
Mohit Garg
 
HBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQLHBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQL
DataWorks Summit
 
Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala
Edureka!
 
Apache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating SystemApache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating System
Adarsh Pannu
 
Front Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesFront Range PHP NoSQL Databases
Front Range PHP NoSQL Databases
Jon Meredith
 
Machine Learning by Example - Apache Spark
Machine Learning by Example - Apache SparkMachine Learning by Example - Apache Spark
Machine Learning by Example - Apache Spark
Meeraj Kunnumpurath
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
bddmoscow
 
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...
Daniel Abadi
 
Hadoop and Graph Data Management: Challenges and Opportunities
Hadoop and Graph Data Management: Challenges and OpportunitiesHadoop and Graph Data Management: Challenges and Opportunities
Hadoop and Graph Data Management: Challenges and Opportunities
Daniel Abadi
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
DataFactZ
 
Scaling Up AI Research to Production with PyTorch and MLFlow
Scaling Up AI Research to Production with PyTorch and MLFlowScaling Up AI Research to Production with PyTorch and MLFlow
Scaling Up AI Research to Production with PyTorch and MLFlow
Databricks
 
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...
DataStax Academy
 
Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUs
Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUsCreating an 86,000 Hour Speech Dataset with Apache Spark and TPUs
Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUs
Databricks
 
SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015
Lance Co Ting Keh
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
Amr Awadallah
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
cdmaxime
 

Similar to Agile data lake? An oxymoron? (20)

Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
Ofir Manor
 
Data Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data EngineeringData Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data Engineering
Anant Corporation
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
Frank Schroeter
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
James Serra
 
Webcast Q&A- Big Data Architectures Beyond Hadoop
Webcast Q&A- Big Data Architectures Beyond HadoopWebcast Q&A- Big Data Architectures Beyond Hadoop
Webcast Q&A- Big Data Architectures Beyond Hadoop
Impetus Technologies
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
StreamNative
 
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Frank Munz
 
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
 Planning and Optimizing Data Lake Architecture - Milos Milovanovic Planning and Optimizing Data Lake Architecture - Milos Milovanovic
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
Institute of Contemporary Sciences
 
Planing and optimizing data lake architecture
Planing and optimizing data lake architecturePlaning and optimizing data lake architecture
Planing and optimizing data lake architecture
Milos Milovanovic
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on Hadoop
Josh Patterson
 
Architecting Agile Data Applications for Scale
Architecting Agile Data Applications for ScaleArchitecting Agile Data Applications for Scale
Architecting Agile Data Applications for Scale
Databricks
 
Scalable data pipeline at Traveloka - Facebook Dev Bandung
Scalable data pipeline at Traveloka - Facebook Dev BandungScalable data pipeline at Traveloka - Facebook Dev Bandung
Scalable data pipeline at Traveloka - Facebook Dev Bandung
Rendy Bambang Junior
 
Real time analytics
Real time analyticsReal time analytics
Real time analytics
Leandro Totino Pereira
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
AshishRathore72
 
Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter Point
Inside Analysis
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptx
betalab
 
Azure BI Cloud Architectural Guidelines.pdf
Azure BI Cloud Architectural Guidelines.pdfAzure BI Cloud Architectural Guidelines.pdf
Azure BI Cloud Architectural Guidelines.pdf
pbonillo1
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus Model
Editor IJCATR
 
Comparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkComparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and spark
AgnihotriGhosh2
 
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
Vinoth Chandar
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
Ofir Manor
 
Data Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data EngineeringData Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data Engineering
Anant Corporation
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
James Serra
 
Webcast Q&A- Big Data Architectures Beyond Hadoop
Webcast Q&A- Big Data Architectures Beyond HadoopWebcast Q&A- Big Data Architectures Beyond Hadoop
Webcast Q&A- Big Data Architectures Beyond Hadoop
Impetus Technologies
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
StreamNative
 
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Frank Munz
 
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
 Planning and Optimizing Data Lake Architecture - Milos Milovanovic Planning and Optimizing Data Lake Architecture - Milos Milovanovic
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
Institute of Contemporary Sciences
 
Planing and optimizing data lake architecture
Planing and optimizing data lake architecturePlaning and optimizing data lake architecture
Planing and optimizing data lake architecture
Milos Milovanovic
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on Hadoop
Josh Patterson
 
Architecting Agile Data Applications for Scale
Architecting Agile Data Applications for ScaleArchitecting Agile Data Applications for Scale
Architecting Agile Data Applications for Scale
Databricks
 
Scalable data pipeline at Traveloka - Facebook Dev Bandung
Scalable data pipeline at Traveloka - Facebook Dev BandungScalable data pipeline at Traveloka - Facebook Dev Bandung
Scalable data pipeline at Traveloka - Facebook Dev Bandung
Rendy Bambang Junior
 
Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter Point
Inside Analysis
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptx
betalab
 
Azure BI Cloud Architectural Guidelines.pdf
Azure BI Cloud Architectural Guidelines.pdfAzure BI Cloud Architectural Guidelines.pdf
Azure BI Cloud Architectural Guidelines.pdf
pbonillo1
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus Model
Editor IJCATR
 
Comparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkComparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and spark
AgnihotriGhosh2
 
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
Vinoth Chandar
 
Ad

Recently uploaded (20)

1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
Simran112433
 
Principles of information security Chapter 5.ppt
Principles of information security Chapter 5.pptPrinciples of information security Chapter 5.ppt
Principles of information security Chapter 5.ppt
EstherBaguma
 
183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag
fardin123rahman07
 
Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...
Pixellion
 
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
ThanushsaranS
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
gmuir1066
 
Classification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptxClassification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptx
wencyjorda88
 
Geometry maths presentation for begginers
Geometry maths presentation for begginersGeometry maths presentation for begginers
Geometry maths presentation for begginers
zrjacob283
 
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdfIAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
mcgardenlevi9
 
FPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptxFPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptx
ssuser4ef83d
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia
Alexander Romero Arosquipa
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
How to join illuminati Agent in uganda call+256776963507/0741506136
How to join illuminati Agent in uganda call+256776963507/0741506136How to join illuminati Agent in uganda call+256776963507/0741506136
How to join illuminati Agent in uganda call+256776963507/0741506136
illuminati Agent uganda call+256776963507/0741506136
 
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 
Cleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdfCleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdf
alcinialbob1234
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
Simran112433
 
Principles of information security Chapter 5.ppt
Principles of information security Chapter 5.pptPrinciples of information security Chapter 5.ppt
Principles of information security Chapter 5.ppt
EstherBaguma
 
183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag
fardin123rahman07
 
Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...
Pixellion
 
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
ThanushsaranS
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
gmuir1066
 
Classification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptxClassification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptx
wencyjorda88
 
Geometry maths presentation for begginers
Geometry maths presentation for begginersGeometry maths presentation for begginers
Geometry maths presentation for begginers
zrjacob283
 
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdfIAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
mcgardenlevi9
 
FPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptxFPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptx
ssuser4ef83d
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 
Cleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdfCleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdf
alcinialbob1234
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
Ad

Agile data lake? An oxymoron?

  • 2. Agenda 1. Part 1 - Data Lake Overview 2. Part 2 - Technology Deep Dive Please interrupt with questions/comments so I know which slides to focus on.
  • 3. Part 1 - Data Lake Overview
  • 4. Data Lake - Definition
  • 6. Data Lake - Definition - Martin Kleppman "Hadoop opened up the possibility of indiscriminately dumping data into HDFS, and only later figuring out how to process it further. By contrast, MPP databases typically require careful up-front modeling of the data and query patterns before importing the data into the database's proprietary storage format. From a purist's point of view, it may seem that this careful modeling and import is desirable, because it means users of the database have better-quality data to work with. However, in practice, it appears that simply making data available quickly - even if it is in a quirky, difficult-to-use, raw format [/schema] - is often more valuable than trying to decide on the ideal data model up front. The idea is similar to a data warehouse: simply bringing data from various parts of a large organization together in one place is valuable, because it enables joins across datasets that were previously disparate. The careful schema design required by an MPP database slows down that centralised data collection; collecting data in it's raw form, and worrying about schema design later, allows the data collection to be speeded up (a concept sometimes known as a "data lake" or "enterprise data hub"). ... the interpretation of the data becomes the consumer's problem (the schema-on-read approach). ... There may not even be one ideal data model, but rather different views onto the data that are suitable for different purposes. Simply dumping data in its raw form allows for several such transformations. This approach has been dubbed the sushi principle: "raw data is better". Page 415, 416, Martin Kleppmann - Designing Data-Intensive Applications
  • 7. Data Lake - Definition - Martin Fowler > The idea [Data Lake] is to have a single store for all of the raw data that anyone in an organization might need to analyze. ... > But there is a vital distinction between the data lake and the data warehouse. The data lake stores raw data, in whatever form the data source provides. There is no assumptions about the schema of the data, each data source can use whatever schema it likes. It's up to the consumers of that data to make sense of that data for their own purposes. > data put into the lake is immutable > The data lake is “schemaless” [or Schema-on-Read] > storage is oriented around the notion of a large schemaless structure … HDFS https://martinfowler.com/bliki/DataLake.html ← MUST READ!
  • 10. Over this Pretty GUI based Tools, Deskilling, Centralisation and Processes
  • 11. Trends (Sample = 11 Data Lakes) Good Bad Cost < £1m, business value in weeks/months Cost many millions, years before business value Schema on Read Documentation as code - Internal Open Source Schema on Write Metastores, data dictionaries, confluence Cloud, PAAS, e.g. EMR, Dataproc S3 for long term storage On prem, Cloudera/Hortonworks HDFS for long term storage Scala / Java apps, Jenkins, CircleCi, etc with Bash for deployment and lightweight scheduling Hive / Pig scripts, manual releases, heavyweight scheduling (Oozie, Airflow, Control-M, Luigi, etc) High ratio 80%+; committed developers/engineers that write code. Small in house teams, highly skilled. High ratio of involved people who do not commit code. Large low skilled offshore teams Flat structure, cross-functional teams Agile Hierarchical, authoritarian Waterfall
  • 12. Trends (Sample = 11 Data Lakes) Success Failure XP, KISS, YAGNI BUFD, Tools, Processes, Governance, Complexity, Documentation Cross functional individuals (can architect, code & do analysis) form a team that can deliver end to end business value right from source to consumption. Co-dependent component teams, no one team can deliver an end to end solution. Clear focus on 1 business problem, solve it, then solve 2nd business problem, solve it, then deduplicate (DRY) No clear business focus, too many goals, lofty overly ambitious ideas, silver bullets, big hammers Motivation - The WHY: Satisfaction from solving problems & automation Motivation - The WHY: Deskilling & centralisation of power
  • 13. Hive, Impala, Drill, Dremio, Presto, Delta, Athena, Kylo, Hudi, Ab Initio, etc
  • 14. Silver Bullets & Big Hammers - Often built to demo/pitch to Architects (that don’t code) & non-technical/non-engineers - Consequently have well polished UIs but often lacking quality under the hood - Generally only handle happy cases - Tend to assume all use cases are the same. Your use case will probably invalidate their assumptions - The devil is in the details, and they obscure those details - Generally make performance problems more complicated due to inherited and obscured complexity - Often commercially motivated - Few engineers/data scientists would recommend as they know what it's really like to build a Data Lake and know that most of these tools won't work - Often aim at deskilling, literally claiming that Engineers/Data Scientists are not necessary. They are necessary, but now you have to pay a vendor/consultancy for those skills - at very high markup - with a lot of lost in translation issues and communication bottlenecks - long delays in implementation - Generally appeal to non-technical people that want centralisation and power, some tools literally referring to users as “power users” Note that there are exceptions, for example Uber’s Hudi seems to be built to solve real internal PB data problems, then later Open Sourced. There may be other exceptions.
  • 15. Data Lake Principles 1. Immutability & Reproducibility - Datasets should be immutable, Any queries/jobs run on the Data Lake should be reproducible 2. A Dataset corresponds to a directory and all the files in that directory, not files - Big Data is too big to fit into single files. Avoid appending to a directory as this is just like mutating it, thus violating 1. 3. An easy way to identify when new data has arrived - no scanning, no joining, or complex event notification systems should be necessary. Simply partition by landed date and consumers keep track of their own offsets (like in Kafka) 4. Schema On Read - Parquet headers plus directory structure form self describing metadata (more next!).
  • 16. Metadata - Schema-on-read - parquet header has the schema - Add lineage fields to data at each stage of a pipeline, especially later stages - Internal Open Source via Monorepo - Code is unambiguous - Invest in high quality code control - Stop here! - Analogy: - An enterprise investing large amounts in meta-data services is like a restaurant investing large amounts in menus - In the best restaurants chefs write the specials of the day on a blackboard - In the best enterprises innovation and code is created every day - etc
  • 17. Technology Choices - Analytics Requirement Recommendation SQL Interface to Parquet & Databases (via JDBC) Apache Zeppelin Spark, Scala, Java, Python, R, CLI, and more Apache Zeppelin Charts, graphs & visualisations (inc JS, D3 etc) Apache Zeppelin Free and open source Apache Zeppelin Integrated into infra (EMR, HDInsights) out-of-box - NoOps Apache Zeppelin Lightweight scheduling and job management Apache Zeppelin Basic Source Control & JSON Exports Apache Zeppelin In memory compute (via Spark) Apache Zeppelin Quickly implement dashboards & reports via WYSIWYG or JS Apache Zeppelin Active Directory & ACL integration Apache Zeppelin
  • 18. Technology Choices - Software Requirement Recommendation Parquet Format Spark Production quality stable Spark APIs Scala/Java Streaming Architecture on Kafka Scala/Java Quick development & Dev Cycle Statically typed languages Production quality software / low bug density Statically typed languages Huge market of low skilled cheap resource where speed of delivery, software quality and data quality is not important (please read The Mythical Man-Month!) Python https://insights.stackoverflow.com/survey/2019?utm_source=Iterable&utm_medium=email&utm_campaign=dev-survey-2019#top-paying-technologies
  • 21. Conclusion Yes you can build a Data Lake in an Agile way. ● Code first ● Upskill, not deskill ● Do not trust all vendor marketing literature & blogs ● Avoid most big tools, especially proprietary ones
  • 22. Part 2 - Technology Deep Dive
  • 23. Brief History of Spark Version & Date Notes Up to 0.5, 2010 - 2012 Spark created as a better Hadoop MapReduce. - Awesome functional typed Scala API (RDD) - In memory caching - Broadcast variables - Mesos support 0.6, 14/10/2012 - Java API (anticipating Java 8 & Scala 2.12 interop!) 0.7, 27/02/2013 - Python API: PySpark - Spark”Streaming” Alpha 0.8, 25/09/2013 - Apache Incubator in June 2013 - September 2013, Databricks raises $13.9 million - MLlib (nice idea, poor API) see https://github.com/samthebest/sceval/blob/master/README.md 0.9 - 1.6, 02/02/2014 - 04/01/2016 Hype years! - February 2014, Spark becomes Top-Level Apache Project - SparkSQL, and more Shiny (GraphX, Dataframe API, SparkR) - Covariant RDDs requested https://issues.apache.org/jira/browse/SPARK-1296 Key: Good idea - technically motivated, Not so good idea (probably commercially motivated?)
  • 24. Brief History of Spark Version & Date Notes 2.0, 26/07/2016 Datasets API (nice idea, poor design): - Typed, semi declarative class based API - Improved serialisation - No way to inject custom serialisation StructuredStreaming API - Uses same API as Datasets (so what are .cache, .filter, .mapPartitions supposed to do? How do we branch? How to access a microbatch? How to control microbatch sizes? etc) 2.3, 28/02/2018 StructuredStreaming trying to play catch up with Kafka Streams, Akka Streams, etc ???, 2500? - Increase parallelism without shuffling https://issues.apache.org/jira/browse/SPARK-5997 - Num partitions no longer respects num files https://issues.apache.org/jira/browse/SPARK-24425 - Multiple SparkContexts https://issues.apache.org/jira/browse/SPARK-2243 - Closure cleaner bugs https://issues.apache.org/jira/browse/SPARK-26534 - Spores https://docs.scala-lang.org/sips/spores.html - RDD covariance https://issues.apache.org/jira/browse/SPARK-1296 - Frameless to become native? https://github.com/typelevel/frameless - Datasets to offer injectable custom serialisation based on this https://typelevel.org/frameless/Injection.html Key: Good idea - technically motivated, Not so good idea (probably commercially motivated?)
  • 25. Spark APIs - RDD - Motivated by true Open Source & Unix philosophy - solve a specific real problem well, simply and flexibly - Oldest most stable API, has very few bugs - Boils down to two functions that neatly correspond to MapReduce paradigm: - `mapPartitions` - `combineByKey` - Simple flexible API design - Can customise serialisation (using `mapParitions` and byte arrays) - Can customise reading and writing (e.g. `binaryFiles`) - Fairly functional, but does mutate state (e.g. `.cache()`) - Advised API for experienced developers / data engineers, especially in the Big Data space
  • 26. Spark APIs - Dataset / Dataframe - Motivated by increasing market size for vendors by targeting non-developers, e.g. Analysts, Data Scientists and Architects - Very buggy, e.g. (bugs I found in the last couple of months) - Reading: nulling entire rows, parsing timestamps incorrectly, not handling escaping properly, etc - Non-optional reference types are treated as nullable - Closure cleaner seems more buggy with Datasets (unnecessarily serialises) - API Design inflexible - cannot inject custom serialisation - No functional `combineByKey` API counterpart, have to instantiate an Aggregator - Declarative API breaks MapReduce semantics E.g. - A call to `groupBy` may not actually cause a groupby operation - Advised API for those new to Big Data and generally trying to solve “little/middle data” problems (i.e. extensive optimisations are not necessary), and where data quality and application stability less important (e.g. POCs).
  • 27. Spark APIs - SparkSQL - Buggy, unstable, unpredictable - SQL optimiser is quite immature - MapReduce is a functional paradigm while SQL is declarative, consequently these two don’t get along very well - All the usual problems with SQL; hard to test, no compiler, not turing complete, etc - Advised API for interactive analytical use only - never use for production applications!
  • 28. Frameless - Awesome! - All of the benefits of Datasets without string literals scala> fds.filter(fds('i) === 10).select(fds('x)) <console>:24: error: No column Symbol with shapeless.tag.Tagged[String("x")] of type A in Foo fds.filter(fds('i) === 10).select(fds('x)) ^ - Custom serialisation https://typelevel.org/frameless/Injection.html - Cats integration, e.g. can join RDDs using `|+|` - Advised API for both experienced Big Data Engineers and people new to Big Data
  • 29. Alternatives to Spark - Kafka: see Kafka Streams & Akka Streams - Flink