spark theory
spark theory
2. Spark vs Mapreduce
3. Why Spark was Developed ?
4. What is spark ?
5. What is PySpark?
6. What are the characteristics of PySpark?
7. Feature of Spark and Advantages & Disadvantages of pyspark ?
8. What is Spark Driver ?
9. PySpark Architecture ?
10. PySpark Modules & Packages:
11. Spark Components?
12. What is SparkContext?
13.What is SparkSession Explained ?
14. SparkContext vs SparkSession
15. Repartition() vs Coalesce() ?
16. Difference between Cache and Persist ?
17. What is Unpersist ?
18.What is diffrance between brodcast variable and Accumulator variable ?
19. What is shuffling in spark ?
20. difrance between Groupbykey() vs reduceByKey() vs aggregateByKey() vs sortBy() vs sortByKey()
21. What is RDD ?
22.How to Create RDD ?
23. Types of RDD ?
24.When to use RDDs?
25.What is RDD Operations Transformations and RDD Actions ?
26. map vs flatmap vs filter ?
27.collect vs collectlAslist vs select()
28.Why DF is faster than RDD?
29. RDDs vs. Dataframes vs. Datasets ?
30. Pivot and Unpivot a Spark Data Frame ?
31. What is Spark Schema ?
32. Groupby clause ?
33. What is Spark SQL DataFrame ?
34.Why DataFrame?
35. Is PySpark faster than pandas?
36. What is DAG and lineage graph, RDD lineage?
37.What is paired RDD?
38.What is skewness?
39.How to mitigate skewed data?
40.Optimization Techniques in spark :
41. How to read CSV File Using delimiter ‘,’?
42. What is Star Schema & snowflake Schema Differentiate Star & Snowflake?
43. What is Data Skewness?
44. What is catalyst optimizer?
45. Explain Serialization and Deserialization?
46. What are PySpark serializers?
47.Salting Techniques?
48. Explain MapPartition in Spark?
49. How to track failed Jobs in Spark?
50. What is Broadcast Join?
51. deployment Mode ? Cluster mode and Client Mode
52. Spark Submit Command?
53. Orc vs Parquet vs Csv vs Json
54. Deal with bad data :
55. Why out of memory issue occure ?
56. How to Remove Duplicate Rows ?
57.Create SparkContext & sparkSession
58.How to Create RDD :
59.Create (Read) Spark DataFrame from CSV ,Txt ,JSON,XML
-----------------------------------------------------------------------------------
---------------------------
1. Why Spark processing is faster than MapReduce jobs?
⦁ Spark processes data in-memory (RAM) computation .
⦁ Hadoop MapReduce has to persist data back to the disk after every Map or Reduce action.
2. Spark vs Mapreduce
1. Spark : is an open-source distributed system for handling Big Data workloads.
⦁ It improves query processing performance on varying data sizes by using efficient query execution and in-
memory caching.
2. Hadoop MapReduce : MapReduce is a Java-based distributed computing programming model
within the Hadoop framework. [HDFS : hadoop file system : Mapper Reducer ]
⦁ MapReduce can also be used for dealing with large data sets that don’t fit in memory.
Mapper is responsible for sorting all the available data, while Reducer is in charge of
aggregating it and turning it into smaller chunks
5. What is PySpark?
⦁ PySpark is a Spark library written in Python to run Python applications using Apache Spark
capabilities, using PySpark we can run applications parallelly on the distributed cluster (multiple nodes).
⦁ python + spark = pyspark
⦁ Spark basically written in Scala.
⦁ later on due to its industry adaptation it’s API PySpark released for Python using Py4J.
⦁ Py4J is a Java library.
⦁ that is integrated within PySpark and allows python to dynamically interface with JVM objects, hence to run
PySpark you also need Java to be installed along with Python, and Apache Spark.
9. PySpark Architecture ?
⦁ Apache Spark is a unified analytics engine for large-scale data processing.”
⦁ It is an in-memory computation processing engine and is processed in parallel.
⦁ Apache Spark works in a master-slave architecture where the master is called “Driver” and slaves are
called “Workers”.
⦁ The Spark architecture depends upon two abstractions:
1. RDD : group of data items that can be stored in-memory on worker nodes.
[RDD : Resilient: Restore the data on failure. Distributed: Data is distributed among different nodes. Dataset:
Group of data ]
2. Directed Acyclic Graph (DAG):It is a finite direct graph that performs a sequence of computations on
data.
2. Cluster Manager:
⦁ The role of the cluster manager is to allocate resources across applications.
⦁ The Spark is capable enough of running on a large number of clusters.
⦁ It consists of various types of cluster managers, such as Hadoop YARN, Apache Mesos and Standalone
Scheduler.
⦁ Here, the Standalone Scheduler is a standalone spark cluster manager that facilitates to install Spark on
an empty set of machines.
3. Worker Node:
⦁ Worker Node is a slave node,Its role is to run the application code in the cluster.
4. Executor:
⦁ An executor is a process launched for an application on a worker node.
⦁ It runs tasks and keeps data in memory or disk storage across them.
⦁ It read and write data to the external sources.
⦁ Every application contains its executor.
5. Task
⦁ A unit of work that will be sent to one executor.
SparkSession.builder () :
⦁ Return SparkSession.Builder class.This is a builder for SparkSession.
⦁ master(), appName() and getOrCreate() are methods of SparkSession.Builder.
master() – If you are running it on the cluster you need to use your master name as an argument to
master().usually,it would be either yarn or mesos depends on your cluster setup.
Use local[x] - when running in Standalone mode. x should be an integer value and should be greater than 0;
this represents how many partitions it should create when using RDD, DataFrame, and Dataset.Ideally, x
value should be the number of CPU cores you have.
For standalone use - spark://master:7077
appName() – Sets a name to the Spark application that shows in the Spark web UI. If no application name is
set, it sets a random name.
getOrCreate() – This returns a SparkSession object if already exists.Creates a new one if not exist.
# check UI : http://localhost:4040/jobs/
SparkContext SparkSession
SparkContext is available since Spark 1.x SparkSession was introduced in version 2.0.
SparkContext is an entry point to Spark SparkSession has been introduced and
programming with RDD and to connect became an entry point to start
to Spark Cluster. programming with DataFrame and
Dataset.
sortBy() :
is used to sort the data by value efficiently in pyspark. It is a method available in rdd. It uses a lambda expression
to sort the data based on columns.
sortByKey() : is used to sort the values of the key by ascending or descending order.
sortByKey() function operates on pair RDD (key/value pair)
Benefits :
1. In-Memory Processing :loads data from disk and process in memory and keeps the data in memory.
2. Immutability :RDDs are created you cannot modify.When apply transformations on RDD, PySpark creates
a new RDD.
3. Fault Tolerance : any RDD operation fails, it automatically reloads the data from other partitions.
4. Lazy Evolution :Spark will not start the execution of the process until an ACTION.
Limitations :
1. No Input Optimization Engine
2. RDDs is that the execution process does not start instantly.
3. RDD lacks enough storage memory.
4. The run-time type safety is absent in RDDs.
34.Why DataFrame?
⦁ dataframe is one step ahead of RDD.
⦁ There is no built-in optimization engine in RDD.RDD cannot handle structured data.
⦁ DF provides memory management and optimized execution plan.
⦁ Dataframes are able to process the data in different sizes, like the size of kilobytes to petabytes on a
single node cluster to large cluster.
⦁ PySpark DataFrame from data sources like TXT, CSV, JSON, ORV, Avro, Parquet, XML formats
by reading from HDFS, S3, DBFS, Azure Blob file systems e.t.c.
Lineage graph :all the dependencies between the RDDs will be logged in a graph, rather than the actual data.
This graph is called the lineage graph.
RDD lineage : RDD are immutable in nature, transformations always create new RDD without updating
an existing one hence, this creates an RDD lineage.
38.What is skewness?
The state of partitions where data is unevenly distributed.
This is common problem with big data after shuffling.
Key distribution is not uniform (highly skewed), causing some partitions to be very large and not allowing
spark to process data in parallel
42. What is Star Schema & snowflake Schema Differentiate Star & Snowflake?
Snowflake schema:
Snowflake Schema is an extension of a Star Schema, and it adds additional
dimensions. The dimension tables are normalized which splits data into additional tables.
Star Schema:
In data warehouse, in which the centre of the star can have one fact table and a
number of associated dimension tables
Star Schema Snowflake Schema:
In star schema, The fact tables While in snowflake schema, The fact tables,
and the dimension tables are dimension tables as well as sub dimension tables
contained. are contained.
Star schema is a top-down model. While it is a bottom-up model.
It has high data redundancy While it has low data redundancy.
47.Salting Techniques?
The idea here is to divide larger partitions into smaller ones using “salt” (our own created
newcolumn) but it also comes with side effect of getting smaller partitions divided into even more
smaller ones.
This strategy is more of guaranting execution of all tasks (avoiding OOM’s errors) and NOT a
uniform duration of each task.
48. Explain MapPartition in Spark?
This is exactly the same as map ();the difference being, Spark mapPartitions() provides
a facility to do heavy initializations (for example Database connection) once for each
partition instead of doing it on every DataFrame row. This helps the performance of
the job when you dealing with heavy-weighted initialization on larger datasets.
Cluster Client
In Cluster Mode, the Driver & Executor In Client Mode ,Driver is outside of the
both run inside the Cluster. Cluster.
This is the approach used in Production The Executors will be running
use cases. inside the Cluster.
In cluster mode, the Spark driver runs client mode only the driver runs locally and
inside an application master process all tasks run on cluster worker nodes
which is managed by YARN on the
cluster. and the client can go away after client mode is majorly used for interactive
initiating the application. and debugging purposes.
spark-submit --deploy-mode cluster -- spark-submit --deploy-mode client --
driver-memory xxxx ........ driver-memory xxxx ......
./bin/spark-submit \
--master <master-url> \
--deploy-mode <deploy-mode> \
python_file_code.py
df.show() >>>> show whole data bada + good record in add new column ( u need to add new
column with help struct methode)
df.filter("_corrupt_record" is null).show() >>> show valid data
df.filter("_corrupt_record is not null ") >>> get only bad data
df.drop("_corrupt_record ") >>> delete column
#or
spark.driver.maxResultSize
Duplicate rows could be remove or drop from Spark SQL DataFrame using distinct() and dropDuplicates()
functions,
distinct() can be used to remove rows that have the same values on all columns
dropDuplicates() can be used to remove rows that have the same values on multiple selected columns.
DF = dataframe.distinct()
print("Distinct count: ",DF.count())
DF.show()
-----------------------------------------------------------------------------------
----------
Create SparkContext : Create SparkSession from builder :
from pyspark import SparkContext from pyspark import SparkContext,SparkConf
sc = SparkContext("local", "Spark") from pyspark.sql import SparkSession
print(sc.appName)
from pyspark.sql.functions import *
Stop sparkcontext: from pyspark.sql import Window
sc.stop() from pyspark.sql.types import *
_______________________
from pyspark import SparkConf,
SparkContext
conf = SparkConf( ) import pyspark
conf.setMaster("local").setAppNam from pyspark.sql import SparkSession
e("Spark") spark = SparkSession.builder.master( "local[1]" ) \
sc = .appName( ' Spark ' ) \
SparkContext.getOrCreate(conf) .getOrCreate()
print(sc.appName)
⦁ How to Create DF :
Spark Create DataFrame from RDD : using toDF()
df = rdd.toDF()
Txtdf3 = spark.read.text(r"/path/filename.txt").show()
JSONdf4 = spark.read.json(r"/path/filename.json").show()
withColumn() : it is a transformation function of DataFrame which is used to change the value, convert
the datatype of an existing column, create a new column, and many more.
1.Change DataType using PySpark withColumn() : need to use cast() function along with withColumn().
df.withColumn("salary",col("salary").cast("Integer")).show()
3. Update Column Based on Condition : updates gender column with value Male for M, Female for F
from pyspark.sql.functions import when
df3 = df.withColumn("gender", when(df.gender == "M","Male") \
.when(df.gender == "F","Female") \
.otherwise(df.gender))
df3.show()
4.Update DataFrame Column Data Type : updates salary column to String type.
df4=df.withColumn("salary",df.salary.cast("String"))
df4.printSchema()
root
|-- firstname: string (nullable = true)
|-- lastname: string (nullable = true)
|-- gender: string (nullable = true)
|-- salary: string (nullable = true)
df.filter("state is NULL").show()
df.filter(col("state").isNull()).show()
2. Filter Rows with NULL on Multiple Columns
df.filter("state IS NULL AND gender IS NULL").show()
df.filter(df.state.isNotNull()).show()
df.filter(col("state").isNotNull()).show()
print(df.filter(col("FIRST_NAME").isNotNull()).count())
df.filter(! col("name").startsWith("James")).show()
df.filter( col("name").startsWith("James") === false).show()
+---+---------------+
| id| name|
+---+---------------+
| 2| Michael Rose|
| 3|Robert Williams|
| 4| Rames Rose|
| 5| Rames rose|
+---+---------------+
df.filter(! col("name").endsWith("Rose")).show()
df.filter(col("name").endsWith("Rose")==false).show()
+---+---------------+
| id| name|
+---+---------------+
| 1| James Smith|
| 3|Robert Williams|
| 5| Rames rose|
+---+---------------+
df.drop("FIRST_NAME","LAST_NAME","EMAIL").printSchema()
# Custom transformation 1 :
from pyspark.sql.functions import upper
def to_upper_str_columns(DF):
return DF.withColumn("name",upper(DF.name))
# Custom transformation 2 :
def reduce_price(df,reduceBy):
return df.withColumn("newsalary",df.salary - reduceBy)
ex: empDF. .transform(reduce_price,(1000)).show()
# Custom transformation 3 :
def apply_discount(df):
return df.withColumn("discounted_fee", \
df.new_fee - (df.new_fee * df.discount) / 100)
ex : df .transform(apply_discount).show()
from pyspark.sql.function
import spark_partition_id
repartition:
df1 = df.repartition(4).withColumn("partition_id",spark_partition_id())
df.rdd.getNumPartitions()
coalesce :
df2 = df1.rdd.coalesce(4,True).toDF().withColumn("partition_id",spark_partition_id())
df2.rdd.getNumPartitions()
df1.rdd.glom().collect()