0% found this document useful (0 votes)
7 views9 pages

Py Spark

The document provides an introduction to Big Data and PySpark, covering its history, features, and setup. It explains the differences between RDDs and DataFrames, detailing DataFrame operations and functions. Additionally, it discusses SparkContext and SparkSession as entry points for Spark functionalities, highlighting their roles in managing data processing tasks.

Uploaded by

Abhishek Dutta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views9 pages

Py Spark

The document provides an introduction to Big Data and PySpark, covering its history, features, and setup. It explains the differences between RDDs and DataFrames, detailing DataFrame operations and functions. Additionally, it discusses SparkContext and SparkSession as entry points for Spark functionalities, highlighting their roles in managing data processing tasks.

Uploaded by

Abhishek Dutta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 9

PySpark

Introduction to Big Data and


PySpark
• The Birth of Big Data (The Beginning)
• The Rise of Distributed Computing (The Strategy)
• Enter Apache Spark (The Hero)
• The Hero's Journey (History and Evolution)
• Features and Use Cases (The Hero's Powers)
• Spark vs. Hadoop (The Rivalry)
• Setting up PySpark
DataFrames
• Introduction to DataFrames
• Differences between RDD and DataFrame.
• Creating DataFrames
• From RDDs, files, and external sources (e.g., databases).
• DataFrame Operations
• Selecting, filtering, and sorting data.
• Aggregations and groupBy operations.
• Joining DataFrames.
• Handling missing data.
• DataFrame Functions
• Built-in functions (e.g., col, lit, when, etc.).
• User-defined functions (UDFs).
• Window functions.
Between RDD and DataFrame in PySpark

What Spark does?


• Simply put what it does is to execute operations on distributed data. Thus, the
operations also need to be distributed. Some operations are simple, such as filter
out all items that doesn't respect some rule. Others are more complex, such as
groupBy that needs to move data around, and join that needs to associate items
from 2 or more datasets.
• Another important fact is that input and output are stored in different formats,
spark has connectors to read and write those. But that means to serialize and
deserialize them. While being transparent, serialization is often the most expensive
operation.
• Finally, spark tries to keep data in memory for processing but it will
[ser/deser]ialize data on each worker locally when it doesn't fit in memory. Once
again, it is done transparently but can be costly
Difference : RDD vs Dataframe
• RDD : It's the first API provided by spark. To put is simply it is a not-ordered sequence of
scala/java objects distributed over a cluster. All operations executed on it are jvm methods
(passed to map, flatmap, groupBy, ...) that need to be serialized, send to all workers, and be
applied to the jvm objects there. This is pretty much the same as using a scala Seq, but
distributed. It is strongly typed, meaning that "if it compiles then it works" (if you don't cheat).
However, there are lots of distribution issues that can arise. Especially if spark doesn't know
how to [de]serialize the jvm classes and methods.
• Dataframe : It came after and is semantically very different from RDD. The data are
considered as tables and operations such as sql operations can be applied on it. It is not typed
at all, so error can arise at any time during execution. However, there are I think 2 pros: (1)
many people are used to the table/sql semantic and operations, and (2) spark doesn't need to
deserialize the whole line to process one of its column, if the data format provide suitable
column access. And many do, such as the parquet file format that is the most commonly used.
Spark Context and Spark
Session
• Spark Context :
• SparkContext is the traditional entry point to any Spark
functionality. It represents the connection to a Spark cluster, and
is the place where the user can configure the common properties
for the entire application and acts as a gateway to creating
Resilient Distributed Datasets (RDDs). RDDs are the fundamental
data structure in Spark, providing fault-tolerant and parallelized
data processing. SparkContext is designed for low-level
programming and fine-grained control over Spark operations.
However, it requires explicit managment and can only be used
once in a Spark application, and it must be created before
creating any RDDs or SQLContext.
Spark Session
• SparkSession was introduced in Spark 2.0, is a unified interface
that combines Spark’s various functionalities into a single entry
point. SparkSession integrates SparkContext and provides a
higher-level API for working with structured data through Spark
SQL, streaming data with Spark Streaming, and performing
machine learning tasks with MLlib. It simplifies application
development by automatically creating a SparkContext and
providing a seamless experience across different Spark modules.
With SparkSession, developers can leverage Spark’s capabilities
without explicitly managing multiple contexts.
RDD vs Dataframe
• RDD : You have a sales dataset, and you need to calculate
the total revenue per product.

• Dataframe : You have a sales dataset, and you need to


calculate the total revenue per product.
PySpark Basics
• Creating DataFrames
• From RDDs, files, and external sources (e.g., databases).
• DataFrame Operations
• Selecting, filtering, and sorting data.
• Aggregations and groupBy operations.
• Joining DataFrames.
• Handling missing data.
• DataFrame Functions
• Built-in functions (e.g., col, lit, when, etc.).
• User-defined functions (UDFs).
• Window functions.

You might also like