The document provides an introduction to Big Data and PySpark, covering its history, features, and setup. It explains the differences between RDDs and DataFrames, detailing DataFrame operations and functions. Additionally, it discusses SparkContext and SparkSession as entry points for Spark functionalities, highlighting their roles in managing data processing tasks.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
7 views9 pages
Py Spark
The document provides an introduction to Big Data and PySpark, covering its history, features, and setup. It explains the differences between RDDs and DataFrames, detailing DataFrame operations and functions. Additionally, it discusses SparkContext and SparkSession as entry points for Spark functionalities, highlighting their roles in managing data processing tasks.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 9
PySpark
Introduction to Big Data and
PySpark • The Birth of Big Data (The Beginning) • The Rise of Distributed Computing (The Strategy) • Enter Apache Spark (The Hero) • The Hero's Journey (History and Evolution) • Features and Use Cases (The Hero's Powers) • Spark vs. Hadoop (The Rivalry) • Setting up PySpark DataFrames • Introduction to DataFrames • Differences between RDD and DataFrame. • Creating DataFrames • From RDDs, files, and external sources (e.g., databases). • DataFrame Operations • Selecting, filtering, and sorting data. • Aggregations and groupBy operations. • Joining DataFrames. • Handling missing data. • DataFrame Functions • Built-in functions (e.g., col, lit, when, etc.). • User-defined functions (UDFs). • Window functions. Between RDD and DataFrame in PySpark
What Spark does?
• Simply put what it does is to execute operations on distributed data. Thus, the operations also need to be distributed. Some operations are simple, such as filter out all items that doesn't respect some rule. Others are more complex, such as groupBy that needs to move data around, and join that needs to associate items from 2 or more datasets. • Another important fact is that input and output are stored in different formats, spark has connectors to read and write those. But that means to serialize and deserialize them. While being transparent, serialization is often the most expensive operation. • Finally, spark tries to keep data in memory for processing but it will [ser/deser]ialize data on each worker locally when it doesn't fit in memory. Once again, it is done transparently but can be costly Difference : RDD vs Dataframe • RDD : It's the first API provided by spark. To put is simply it is a not-ordered sequence of scala/java objects distributed over a cluster. All operations executed on it are jvm methods (passed to map, flatmap, groupBy, ...) that need to be serialized, send to all workers, and be applied to the jvm objects there. This is pretty much the same as using a scala Seq, but distributed. It is strongly typed, meaning that "if it compiles then it works" (if you don't cheat). However, there are lots of distribution issues that can arise. Especially if spark doesn't know how to [de]serialize the jvm classes and methods. • Dataframe : It came after and is semantically very different from RDD. The data are considered as tables and operations such as sql operations can be applied on it. It is not typed at all, so error can arise at any time during execution. However, there are I think 2 pros: (1) many people are used to the table/sql semantic and operations, and (2) spark doesn't need to deserialize the whole line to process one of its column, if the data format provide suitable column access. And many do, such as the parquet file format that is the most commonly used. Spark Context and Spark Session • Spark Context : • SparkContext is the traditional entry point to any Spark functionality. It represents the connection to a Spark cluster, and is the place where the user can configure the common properties for the entire application and acts as a gateway to creating Resilient Distributed Datasets (RDDs). RDDs are the fundamental data structure in Spark, providing fault-tolerant and parallelized data processing. SparkContext is designed for low-level programming and fine-grained control over Spark operations. However, it requires explicit managment and can only be used once in a Spark application, and it must be created before creating any RDDs or SQLContext. Spark Session • SparkSession was introduced in Spark 2.0, is a unified interface that combines Spark’s various functionalities into a single entry point. SparkSession integrates SparkContext and provides a higher-level API for working with structured data through Spark SQL, streaming data with Spark Streaming, and performing machine learning tasks with MLlib. It simplifies application development by automatically creating a SparkContext and providing a seamless experience across different Spark modules. With SparkSession, developers can leverage Spark’s capabilities without explicitly managing multiple contexts. RDD vs Dataframe • RDD : You have a sales dataset, and you need to calculate the total revenue per product.
• Dataframe : You have a sales dataset, and you need to
calculate the total revenue per product. PySpark Basics • Creating DataFrames • From RDDs, files, and external sources (e.g., databases). • DataFrame Operations • Selecting, filtering, and sorting data. • Aggregations and groupBy operations. • Joining DataFrames. • Handling missing data. • DataFrame Functions • Built-in functions (e.g., col, lit, when, etc.). • User-defined functions (UDFs). • Window functions.