SlideShare a Scribd company logo
QUICK GUIDE TO REFRESH SPARK SKILLS
1. limitations of spark
A. Small files problem,Expensive, near real time
processing.
2. Difference b/w map, flat-map
A. flatMap() : transforms an data of length N into another
data of length M. M > N or M < N or M = N.
Example - 1
data : ["aa bb,cc", "k,m", "dd"]
data.flatMap(lines=> lines.split(",")) // M > N.
[["aa bb","cc"],["k","m"],"dd"] ->flat->
["aa bb","cc","k","m","dd"]
Example - 2
data.flatMap(lines=> "".contact(lines.split(","))) // M<N
[["aa bbcc"],["km"],"dd"] ->flat-> ["aa bbcckm,dd"]
map() - : transforms an data of length N into another
data of length N
Example
data.map(lines => lines.split(","))
[["aa bb","cc"],["k","m"],"dd"]
3. Difference b/w map, flat-map
A. a HashMap is a collection of key and value pairs which
are stored internally using a Hash Table data structure.
HashMap is an implementation of Map. As you can see
in their definitions HashMap is a class and Map is a trait
4. What is a case class
A. Caseclasses arespecial classes in Scala that provide you
with the boiler plate implementation of the constructor,
getters (accessors), equals and hashCode, and
implement Serializable.Case classes work really well to
encapsulatedata as objects.Readers, familiarwith Java,
can relate it to plain old Java objects (POJOs) or Java
bean
Spark 1.6 : supports only 22 fields
Spark 2.0 : supports any number field
used to apply schema over RDD to convert to
DataFrame and DataFrame to Dataset
5. What is a trait
A. Traits in Scala are very much similar to Java Interfaces
but traits can have methods with implementation. traits
cannot have constructor parameters.
advantage - One big advantage of traits is that you can
extend multiple traits with clause. but we can extend
only one abstract class.
https://stackoverflow.com/questions/1229743/what-
are-the-pros-of-using-traits-over-abstract-classes
6. spark 1.6 vs spark 2.0
A. i. Single entry point with sparkSession instead of
SqlContext, HiveContext and sparkContext.
ii. Dataset API and DataFrame API are unified. In Scala,
DataFrame = Dataset[Row], while Java API users must
replace DataFrame with Dataset
iii.Dataset and DataFrame API unionAll has been
deprecated and replaced by union
iv.Dataset and DataFrame API explode has been
deprecated, alternatively, use functions.explode() with
select or flatMap
v. Dataset and DataFrame API registerTempTable has
been deprecated and replaced by
createOrReplaceTempView
vi. From Spark 2.0, CREATE TABLE ... LOCATION is
equivalent to CREATE EXTERNAL TABLE ... LOCATION in
order to prevent accidental dropping the existing data
in the user-provided locations. That means, a Hive table
created in Spark SQL with the user-specified location is
always a Hive external table.
vii. unified API for both batch and streaming
7. repartition vs coalesce
A. coalesce uses existing partitions to minimize the
amount of data that's shuffled. repartition creates new
partitions and does a full shuffle. coalesce results in
partitions with different amounts of data.
coalesce may run faster than repartition, but unequal
sized partitions are generally slower to work with than
equal sized partitions. You'll usually need to repartition
datasets after filtering a large data set. I've found
repartition to be faster overall because Spark is built to
work with equal sized partitions.
repartition algorithm doesn't distribute data as equally
for very small data sets
8. Spark joins
A. Broadcast hash join - broadcastsmall tableso no need
of shuffles
Shuffle Hash Join: if the average sizeof a singlepartition
is small enough to build a hash table
Sort Merge: if the matching join keys are sort able
Shuffle Has join is not part of 1.6, but part of spark 2.2
and 2.3.
Precedence order in 2.0 Broadcast, Shuffle and sort
merge.
https://sujithjay.com/spark-sql/2018/06/28/Shuffle-
Hash-and-Sort-Merge-Joins-in-Apache-Spark/
9. Spark joins optimizations
A. There aren't strictrules for optimization
i.Analyze the data sizes
ii.Analyze the keys
iii.based on that use Broadcasthash join or filters
potential causes of poor join performance
Dealing with Key Skew in a ShuffleHashJoin
Uneven sharding and limited parallelism - One table is
small and One table is big, no.of join keys are less
special cases
CartesianJoin (cross join)- what to do - may be enable
the cross join in spark
One to Many Joins - what to do - use parquet format
Theta Joins - what to do - create buckets on keys
https://databricks.com/session/optimizing-apache-
spark-sql-joins
10. RDD vs DataFrame vs DataSet
A. Underlying API for DataFrame or DataSet is RDD
RDD - return an Iterator[T]. underlying the Iterator
functionality is opaque or unknown so optimization is
difficult. even T the data is opaque so RDD doesn't know
which columns are important and which aren't. only
serializers like Kyro, zLib used for optimization.
Dataframe : DataFrame is an abstraction which gives a
schema view of data. dataframe is like a table in
database. offers custom memory management using
tungsten, so Data is stored in off-heap memory in binary
format. This saves a lot of memory space. Also there is
no Garbage Collection overhead involved. By knowing
the schema of data in advance and storing efficiently in
binary format, expensive java Serialization is also
avoided. Optimized Execution Plans using catalyst
optimizer.
DataSets : It is an extension to Dataframe API, the latest
abstraction which tries to provide best of both RDD and
Dataframe. compile time safety like RDD as well as
performance boosting features of Dataframe. dataset
scores over Dataframe is an additional feature it has:
Encoders - Encoders generate byte code to interact with
off-heap data and provide on-demand access to
individual attributes without having to de-serialize an
entire object.
https://www.linkedin.com/pulse/apache-spark-rdd-vs-
dataframe-dataset-chandan-prakash
11. Drawbacks of spark streaming
A. discretized stream or DStream and it is represented as a
sequence of RDDs.
Micro-batch systems have an inherent problem with
backpressure. If processing of a batch takes longer in
downstream operations, because of computational
complexity or just slow sink, than in the batching
operator (usually source), the micro-batch will take
longer than configured. This leads either to more and
more batches queueing up, or to a growing micro-batch
size.
Micro-batching, thus Spark Streaming can achieve high
throughput and once guarantees, but gives away low
latency, flow control and the native streaming
programming model.
12. How checkpoints are useful
A. Checkpoints freeze the content of your data frames
before you do something else. They're essential to
keeping track of your data frames
Spark has been offering checkpoints on streaming since
earlier versions (atleastv1.2.0),but checkpoints on data
frames are a different beast.
Metadata Checkpointing – Metadata means the data
about data. It refers to saving the metadata to fault
tolerant storage like HDFS. Metadata includes
configurations, DStream operations, and incomplete
batches. Configuration refers to the configuration used
to create streaming DStream operations are operations
which define the steaming application. Incomplete
batches are batches which are in the queue but are not
complete.
Data Checkpointing –: It refers to save the RDD to
reliable storage because its need arises in some of the
stateful transformations. It is in the case when the
upcoming RDD depends on the RDDs of previous
batches. Because of this, the dependency keeps on
increasing with time. Thus, to avoid such increase in
recovery time the intermediate RDDs are periodically
checkpointed to some reliable storage. As a result, it
cuts down the dependency chain.
Eager Checkpoint
An eager checkpoint will cut the lineage from previous
data frames and will allow you to start “fresh” from this
point on. In clear, Spark will dump your data frame in a
filespecified by setCheckpointDir() and will start a fresh
new data frame from it. You will also need to wait for
completion of the operation
Non-Eager Checkpoint
On the other hand, a non-eager checkpoint will keep the
lineage from previous operations in the data frame.
There are various differences between Spark Checkpoint
vs Persist. Let’s discuss it one by one-
Persist vs Checkpoint
When we persist RDD with DISK_ONLY storage level the
RDD gets stored in a location where the subsequent use
of that RDD will not reach that points in recomputing
the lineage.
After persist() is called, Spark remembers the lineage of
the RDD even though it doesn’t call it.
Secondly, after the job run is complete, the cache is
cleared and the files are destroyed.
Checkpointing
Checkpointing stores the RDD in HDFS. It deletes the
lineage which created it.
On completing the job run unlike cache the checkpoint
file is not deleted.
When we checkpointing an RDD it results in double
computation. The operation will firstcall a cache before
accomplishing the actual job of computing. Secondly, it
is written to checkpointing directory.
https://data-flair.training/blogs/apache-spark-
streaming-checkpoint/
13. Unit Testing frame for spark
A. Spark Fast Test library – I prefer
FunSuite, and ScalaTest, DataFrameSuiteBase
14. Various challenges faced while coding spark app
A. Heap space issue or out of memory – resolved by
increasing executor memory
Running Indefinitely long – check the spark UI for any
data skew problems due to duplicates in join keys,
unequal partitions and tune by repartitioning the data
for more parallelism
Nulls in join conditions – avoid null in join keys or use
null safe joins like 
Ambiguous column reference issues – occurs when
derived DF is used in the join with source DF, if it is equi
join condition can be resolved with Seq(join_columns)
otherwise rename the DF2 columns before joining.
15. Spark streaming diff in 2.0 and 2.3
A. Structured Streaming in Apache Spark 2.0 decoupled
micro-batch processing from its high-level APIs for a
couple of reasons. First, it made developer’s experience
with the APIs simpler: the APIs did not have to account
for micro-batches. Second, it allowed developers to
treat a stream as an infinite table to which they could
issue queries as they would a static table.
However, to provide developers with different modes of
stream processing, new millisecond low-latency mode
of streaming: continuous mode.
Structured Streaming in Spark 2.0 has supported joins
between a streaming DataFrame/Dataset and a static
one
Spark 2.3 supports stream-to-stream joins, both inner
and outer joins for numerous real-time use cases. The
canonical use case of joining two streams is that of ad-
monetization. Like joiningad impressions (views) and ad
clicks
https://databricks.com/blog/2018/02/28/introducing-
apache-spark-2-3.html

More Related Content

What's hot (18)

PDF
Delta Lake: Open Source Reliability w/ Apache Spark
George Chow
 
PDF
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Cloudera, Inc.
 
PPTX
Delta Lake with Azure Databricks
Dustin Vannoy
 
PPTX
SQL Server Disaster Recovery Implementation
Syed Jahanzaib Bin Hassan - JBH Syed
 
PDF
Data Mesh for Dinner
Kent Graziano
 
PDF
Machine Learning for z/OS
Cuneyt Goksu
 
PPT
Data Federation
Stephen Lahanas
 
PDF
Expert Big Data Tips
Qubole
 
PDF
So You Want to Build a Data Lake?
David P. Moore
 
PPTX
2022 02 Integration Bootcamp
Michael Stephenson
 
PPTX
Disaster Recovery Site Implementation with MySQL
Syed Jahanzaib Bin Hassan - JBH Syed
 
PPTX
Piranha vs. mammoth predator appliances that chew up big data
Jack (Yaakov) Bezalel
 
PPTX
Chug building a data lake in azure with spark and databricks
Brandon Berlinrut
 
PDF
IBM Cloud Day January 2021 Data Lake Deep Dive
Torsten Steinbach
 
PDF
DataStax GeekNet Webinar - Apache Cassandra: Enterprise NoSQL
DataStax
 
PDF
5 Steps for Architecting a Data Lake
MetroStar
 
PPT
Kb 40 kevin_klineukug_reading20070717[1]
shuwutong
 
PDF
How to select a modern data warehouse and get the most out of it?
Slim Baltagi
 
Delta Lake: Open Source Reliability w/ Apache Spark
George Chow
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Cloudera, Inc.
 
Delta Lake with Azure Databricks
Dustin Vannoy
 
SQL Server Disaster Recovery Implementation
Syed Jahanzaib Bin Hassan - JBH Syed
 
Data Mesh for Dinner
Kent Graziano
 
Machine Learning for z/OS
Cuneyt Goksu
 
Data Federation
Stephen Lahanas
 
Expert Big Data Tips
Qubole
 
So You Want to Build a Data Lake?
David P. Moore
 
2022 02 Integration Bootcamp
Michael Stephenson
 
Disaster Recovery Site Implementation with MySQL
Syed Jahanzaib Bin Hassan - JBH Syed
 
Piranha vs. mammoth predator appliances that chew up big data
Jack (Yaakov) Bezalel
 
Chug building a data lake in azure with spark and databricks
Brandon Berlinrut
 
IBM Cloud Day January 2021 Data Lake Deep Dive
Torsten Steinbach
 
DataStax GeekNet Webinar - Apache Cassandra: Enterprise NoSQL
DataStax
 
5 Steps for Architecting a Data Lake
MetroStar
 
Kb 40 kevin_klineukug_reading20070717[1]
shuwutong
 
How to select a modern data warehouse and get the most out of it?
Slim Baltagi
 

Similar to Quick Guide to Refresh Spark skills (20)

PDF
Optimizations in Spark; RDD, DataFrame
Knoldus Inc.
 
ODP
Spark Deep Dive
Corey Nolet
 
PPTX
Dive into spark2
Gal Marder
 
PPTX
Meetup spark structured streaming
José Carlos García Serrano
 
PDF
Apache Spark - A High Level overview
Karan Alang
 
PPTX
Big data processing with Apache Spark and Oracle Database
Martin Toshev
 
PPTX
Building highly scalable data pipelines with Apache Spark
Martin Toshev
 
PDF
xPatterns on Spark, Tachyon and Mesos - Bucharest meetup
Radu Chilom
 
PPTX
Intro to Spark - for Denver Big Data Meetup
Gwen (Chen) Shapira
 
PPTX
Spark Overview and Performance Issues
Antonios Katsarakis
 
PDF
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
Chetan Khatri
 
PPTX
Apache Spark
SugumarSarDurai
 
PPTX
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
confluent
 
PPTX
Spark 计算模型
wang xing
 
PDF
Introduction to Structured streaming
datamantra
 
PPTX
Spark
Heena Madan
 
PPTX
Spark Concepts - Spark SQL, Graphx, Streaming
Petr Zapletal
 
PDF
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
PDF
A look ahead at spark 2.0
Databricks
 
PDF
Introduction to Spark Training
Spark Summit
 
Optimizations in Spark; RDD, DataFrame
Knoldus Inc.
 
Spark Deep Dive
Corey Nolet
 
Dive into spark2
Gal Marder
 
Meetup spark structured streaming
José Carlos García Serrano
 
Apache Spark - A High Level overview
Karan Alang
 
Big data processing with Apache Spark and Oracle Database
Martin Toshev
 
Building highly scalable data pipelines with Apache Spark
Martin Toshev
 
xPatterns on Spark, Tachyon and Mesos - Bucharest meetup
Radu Chilom
 
Intro to Spark - for Denver Big Data Meetup
Gwen (Chen) Shapira
 
Spark Overview and Performance Issues
Antonios Katsarakis
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
Chetan Khatri
 
Apache Spark
SugumarSarDurai
 
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
confluent
 
Spark 计算模型
wang xing
 
Introduction to Structured streaming
datamantra
 
Spark Concepts - Spark SQL, Graphx, Streaming
Petr Zapletal
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
A look ahead at spark 2.0
Databricks
 
Introduction to Spark Training
Spark Summit
 
Ad

Recently uploaded (20)

PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
PDF
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
Biography of Daniel Podor.pdf
Daniel Podor
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
IoT-Powered Industrial Transformation – Smart Manufacturing to Connected Heal...
Rejig Digital
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Biography of Daniel Podor.pdf
Daniel Podor
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
Ad

Quick Guide to Refresh Spark skills

  • 1. QUICK GUIDE TO REFRESH SPARK SKILLS 1. limitations of spark A. Small files problem,Expensive, near real time processing. 2. Difference b/w map, flat-map A. flatMap() : transforms an data of length N into another data of length M. M > N or M < N or M = N. Example - 1 data : ["aa bb,cc", "k,m", "dd"] data.flatMap(lines=> lines.split(",")) // M > N. [["aa bb","cc"],["k","m"],"dd"] ->flat-> ["aa bb","cc","k","m","dd"] Example - 2 data.flatMap(lines=> "".contact(lines.split(","))) // M<N [["aa bbcc"],["km"],"dd"] ->flat-> ["aa bbcckm,dd"] map() - : transforms an data of length N into another data of length N Example data.map(lines => lines.split(",")) [["aa bb","cc"],["k","m"],"dd"] 3. Difference b/w map, flat-map A. a HashMap is a collection of key and value pairs which are stored internally using a Hash Table data structure. HashMap is an implementation of Map. As you can see in their definitions HashMap is a class and Map is a trait 4. What is a case class A. Caseclasses arespecial classes in Scala that provide you with the boiler plate implementation of the constructor, getters (accessors), equals and hashCode, and implement Serializable.Case classes work really well to encapsulatedata as objects.Readers, familiarwith Java, can relate it to plain old Java objects (POJOs) or Java bean Spark 1.6 : supports only 22 fields Spark 2.0 : supports any number field used to apply schema over RDD to convert to DataFrame and DataFrame to Dataset 5. What is a trait A. Traits in Scala are very much similar to Java Interfaces but traits can have methods with implementation. traits cannot have constructor parameters. advantage - One big advantage of traits is that you can extend multiple traits with clause. but we can extend only one abstract class. https://stackoverflow.com/questions/1229743/what- are-the-pros-of-using-traits-over-abstract-classes 6. spark 1.6 vs spark 2.0 A. i. Single entry point with sparkSession instead of SqlContext, HiveContext and sparkContext. ii. Dataset API and DataFrame API are unified. In Scala, DataFrame = Dataset[Row], while Java API users must replace DataFrame with Dataset iii.Dataset and DataFrame API unionAll has been deprecated and replaced by union iv.Dataset and DataFrame API explode has been deprecated, alternatively, use functions.explode() with select or flatMap v. Dataset and DataFrame API registerTempTable has been deprecated and replaced by createOrReplaceTempView vi. From Spark 2.0, CREATE TABLE ... LOCATION is equivalent to CREATE EXTERNAL TABLE ... LOCATION in order to prevent accidental dropping the existing data in the user-provided locations. That means, a Hive table created in Spark SQL with the user-specified location is always a Hive external table. vii. unified API for both batch and streaming 7. repartition vs coalesce A. coalesce uses existing partitions to minimize the amount of data that's shuffled. repartition creates new partitions and does a full shuffle. coalesce results in partitions with different amounts of data. coalesce may run faster than repartition, but unequal sized partitions are generally slower to work with than equal sized partitions. You'll usually need to repartition datasets after filtering a large data set. I've found repartition to be faster overall because Spark is built to work with equal sized partitions. repartition algorithm doesn't distribute data as equally for very small data sets 8. Spark joins A. Broadcast hash join - broadcastsmall tableso no need of shuffles Shuffle Hash Join: if the average sizeof a singlepartition is small enough to build a hash table Sort Merge: if the matching join keys are sort able Shuffle Has join is not part of 1.6, but part of spark 2.2 and 2.3. Precedence order in 2.0 Broadcast, Shuffle and sort merge. https://sujithjay.com/spark-sql/2018/06/28/Shuffle- Hash-and-Sort-Merge-Joins-in-Apache-Spark/
  • 2. 9. Spark joins optimizations A. There aren't strictrules for optimization i.Analyze the data sizes ii.Analyze the keys iii.based on that use Broadcasthash join or filters potential causes of poor join performance Dealing with Key Skew in a ShuffleHashJoin Uneven sharding and limited parallelism - One table is small and One table is big, no.of join keys are less special cases CartesianJoin (cross join)- what to do - may be enable the cross join in spark One to Many Joins - what to do - use parquet format Theta Joins - what to do - create buckets on keys https://databricks.com/session/optimizing-apache- spark-sql-joins 10. RDD vs DataFrame vs DataSet A. Underlying API for DataFrame or DataSet is RDD RDD - return an Iterator[T]. underlying the Iterator functionality is opaque or unknown so optimization is difficult. even T the data is opaque so RDD doesn't know which columns are important and which aren't. only serializers like Kyro, zLib used for optimization. Dataframe : DataFrame is an abstraction which gives a schema view of data. dataframe is like a table in database. offers custom memory management using tungsten, so Data is stored in off-heap memory in binary format. This saves a lot of memory space. Also there is no Garbage Collection overhead involved. By knowing the schema of data in advance and storing efficiently in binary format, expensive java Serialization is also avoided. Optimized Execution Plans using catalyst optimizer. DataSets : It is an extension to Dataframe API, the latest abstraction which tries to provide best of both RDD and Dataframe. compile time safety like RDD as well as performance boosting features of Dataframe. dataset scores over Dataframe is an additional feature it has: Encoders - Encoders generate byte code to interact with off-heap data and provide on-demand access to individual attributes without having to de-serialize an entire object. https://www.linkedin.com/pulse/apache-spark-rdd-vs- dataframe-dataset-chandan-prakash 11. Drawbacks of spark streaming A. discretized stream or DStream and it is represented as a sequence of RDDs. Micro-batch systems have an inherent problem with backpressure. If processing of a batch takes longer in downstream operations, because of computational complexity or just slow sink, than in the batching operator (usually source), the micro-batch will take longer than configured. This leads either to more and more batches queueing up, or to a growing micro-batch size. Micro-batching, thus Spark Streaming can achieve high throughput and once guarantees, but gives away low latency, flow control and the native streaming programming model. 12. How checkpoints are useful A. Checkpoints freeze the content of your data frames before you do something else. They're essential to keeping track of your data frames Spark has been offering checkpoints on streaming since earlier versions (atleastv1.2.0),but checkpoints on data frames are a different beast. Metadata Checkpointing – Metadata means the data about data. It refers to saving the metadata to fault tolerant storage like HDFS. Metadata includes configurations, DStream operations, and incomplete batches. Configuration refers to the configuration used to create streaming DStream operations are operations which define the steaming application. Incomplete batches are batches which are in the queue but are not complete. Data Checkpointing –: It refers to save the RDD to reliable storage because its need arises in some of the stateful transformations. It is in the case when the upcoming RDD depends on the RDDs of previous batches. Because of this, the dependency keeps on increasing with time. Thus, to avoid such increase in recovery time the intermediate RDDs are periodically checkpointed to some reliable storage. As a result, it cuts down the dependency chain. Eager Checkpoint An eager checkpoint will cut the lineage from previous data frames and will allow you to start “fresh” from this point on. In clear, Spark will dump your data frame in a filespecified by setCheckpointDir() and will start a fresh new data frame from it. You will also need to wait for completion of the operation Non-Eager Checkpoint On the other hand, a non-eager checkpoint will keep the lineage from previous operations in the data frame. There are various differences between Spark Checkpoint vs Persist. Let’s discuss it one by one-
  • 3. Persist vs Checkpoint When we persist RDD with DISK_ONLY storage level the RDD gets stored in a location where the subsequent use of that RDD will not reach that points in recomputing the lineage. After persist() is called, Spark remembers the lineage of the RDD even though it doesn’t call it. Secondly, after the job run is complete, the cache is cleared and the files are destroyed. Checkpointing Checkpointing stores the RDD in HDFS. It deletes the lineage which created it. On completing the job run unlike cache the checkpoint file is not deleted. When we checkpointing an RDD it results in double computation. The operation will firstcall a cache before accomplishing the actual job of computing. Secondly, it is written to checkpointing directory. https://data-flair.training/blogs/apache-spark- streaming-checkpoint/ 13. Unit Testing frame for spark A. Spark Fast Test library – I prefer FunSuite, and ScalaTest, DataFrameSuiteBase 14. Various challenges faced while coding spark app A. Heap space issue or out of memory – resolved by increasing executor memory Running Indefinitely long – check the spark UI for any data skew problems due to duplicates in join keys, unequal partitions and tune by repartitioning the data for more parallelism Nulls in join conditions – avoid null in join keys or use null safe joins like  Ambiguous column reference issues – occurs when derived DF is used in the join with source DF, if it is equi join condition can be resolved with Seq(join_columns) otherwise rename the DF2 columns before joining. 15. Spark streaming diff in 2.0 and 2.3 A. Structured Streaming in Apache Spark 2.0 decoupled micro-batch processing from its high-level APIs for a couple of reasons. First, it made developer’s experience with the APIs simpler: the APIs did not have to account for micro-batches. Second, it allowed developers to treat a stream as an infinite table to which they could issue queries as they would a static table. However, to provide developers with different modes of stream processing, new millisecond low-latency mode of streaming: continuous mode. Structured Streaming in Spark 2.0 has supported joins between a streaming DataFrame/Dataset and a static one Spark 2.3 supports stream-to-stream joins, both inner and outer joins for numerous real-time use cases. The canonical use case of joining two streams is that of ad- monetization. Like joiningad impressions (views) and ad clicks https://databricks.com/blog/2018/02/28/introducing- apache-spark-2-3.html