
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Cleaning Data with Apache Spark in Python
In today's time, when we have high volume and velocities of data flowing, Apache Spark, an open source big data processing framework, is a common choice as it allows parallel and distributed processing of data. Cleaning of such data is an important step and Apache Spark provides us with a variety of tools and methods for the cleaning of data. In this method, we are going to be seeing how to clean data with Apache Spark in Python and the steps to do so are as follows:
Loading the data into a Spark DataFrame The SparkSession.read method allows us to read data from various sources like CSV, JSON, Parquet etc.
Handling of missing Values The DataFrame.dropna or DataFrame.fillna methods allow us to drop the data with any missing values or fill the missing values with a particular value respectively.
Handling the duplicates Often data contains duplicate entries. To handle this, the DataFrame.dropDuplicates method allows us to remove duplicates from the DataFrame.
Handling the outliers The DataFrame.filter method allows us to remove all those rows that contain outliers.
Handling the outliers To convert the data types of columns, we have a method called DataFrame.cast
Before going forward and seeing how to clean data with pyspark, we will have to compulsorily install the pyspark library. To accomplish this, we must run the following command in the terminal:
pip install pyspark
Handling Missing Values
Handling missing values in Apache Spark involves identifying and dealing with missing or incomplete data in a data set stored in a Spark DataFrame. There are several methods for handling missing values in Spark, including:
Dropping rows This involves removing records that contain missing values from the DataFrame.
Impute missing values This involves replacing missing values with a calculated value such as the mean, median, or mode of the data in the column.
Fill missing values This involves replacing missing values with a specific value, such as zero, or a default value.
Interpolate missing values This involves using mathematical methods such as linear interpolation or spline interpolation to estimate missing values.
The method used for handling missing values depends on the specific requirements and goals of the data analysis. It is important to handle missing values in a consistent and repeatable manner to ensure the integrity of the data and the accuracy of the results.
In Apache Spark, the functions provided by the pyspark.sql.DataFrame and pyspark.sql.DataFrameNaFunctions modules can be used to handle missing values. These functions include dropna, fillna, and interpolate.
Example
from pyspark.sql import SparkSession spark = SparkSession.builder.appName("MissingData").getOrCreate() # Create a sample data frame data = [("John", 25, None), ("Jane", 30, 35.5), ("Jim", None, 40.0), ("Joan", 32, None)] columns = ["Name", "Age", "Salary"] df = spark.createDataFrame(data, columns) # Display the original data frame print("Original Data Frame:") df.show() # Replacing Missing Values with Mean from pyspark.sql.functions import mean mean_age = df.agg(mean(df["Age"])).first()[0] mean_salary = df.agg(mean(df["Salary"])).first()[0] df = df.fillna({"Age": mean_age, "Salary": mean_salary}) # Display the cleaned data frame print("Cleaned Data Frame:") df.show() spark.stop()
Output
Original Data Frame: +----+----+------+ |Name| Age|Salary| +----+----+------+ |John| 25| null| |Jane| 30| 35.5| | Jim|null| 40.0| |Joan| 32| null| +----+----+------+ Cleaned Data Frame: +----+---+------+ |Name|Age|Salary| +----+---+------+ |John| 25| 37.75| |Jane| 30| 35.5| | Jim| 29| 40.0| |Joan| 32| 37.75| +----+---+------+
Handling Duplicates
Handling duplicates in Apache Spark involves identifying and dealing with duplicate records in a data set stored in a Spark DataFrame. There are several methods for handling duplicates in Spark, including:
Removing duplicates This involves identifying and removing duplicate records from the DataFrame. The dropDuplicates function can be used to remove duplicate records in Spark.
Retaining duplicates This involves retaining all instances of duplicate records in the DataFrame, usually by adding a unique identifier or index to each record.
Marking duplicates This involves marking duplicate records in the DataFrame, but not removing them, so that they can be further analyzed or processed.
The method used for handling duplicates depends on the specific requirements and goals of the data analysis. It is important to handle duplicates in a consistent and repeatable manner to ensure the integrity of the data and the accuracy of the results.
In Apache Spark, the dropDuplicates function can be used to remove duplicate records in a DataFrame. The function takes one or more columns as input, and removes all records where the values in the specified columns are identical. The dropDuplicates function returns a new DataFrame with the duplicate records removed.
Example
from pyspark.sql import SparkSession spark = SparkSession.builder.appName("DuplicateData").getOrCreate() # Create a sample data frame data = [("John", 25, 90.0), ("Jane", 30, 35.5), ("Jim", 20, 200.0), ("Joan", 32, 50.0), ("John", 25, 90.0), ("Jim", 20, 200.0)] columns = ["Name", "Age", "Salary"] df = spark.createDataFrame(data, columns) # Display the original data frame print("Original Data Frame:") df.show() # Remove duplicates df = df.dropDuplicates() # Display the cleaned data frame print("Cleaned Data Frame:") df.show() spark.stop()
Output
Original Data Frame: +----+---+------+ |Name|Age|Salary| +----+---+------+ |John| 25| 90.0| |Jane| 30| 35.5| | Jim| 20| 200.0| |Joan| 32| 50.0| |John| 25| 90.0| | Jim| 20| 200.0| +----+---+------+ Cleaned Data Frame: +----+---+------+ |Name|Age|Salary| +----+---+------+ |Jane| 30| 35.5| |John| 25| 90.0| | Jim| 20| 200.0| |Joan| 32| 50.0| +----+---+------+
Handling Outliers
Handling outliers in Apache Spark refers to the process of identifying and either removing or transforming values in a dataset that are considered extreme or outside of the normal range of values. Outliers can have a significant impact on the results of statistical analyses, so it is often necessary to handle them in some way.
There are several common approaches to handling outliers in Apache Spark, including:
Removing records containing outliers: This involves filtering out records where the value of a particular column is outside of a specified range or falls outside of a certain number of standard deviations from the mean.
Replacing outliers with the mean or median This involves replacing values that are considered outliers with the mean or median of the remaining values in the column.
Winsorizing outliers This involves replacing values that are outliers with a specified percentile value, such as the 5th or 95th percentile.
Clipping outliers This involves replacing values that are outliers with a specified maximum or minimum value.
To handle outliers in Apache Spark, you can use the built-in functions available in the pyspark.sql.functions module to calculate statistics such as the mean and standard deviation, and then use the filter or withColumn methods to remove or replace outliers as desired.
Example
from pyspark.sql import SparkSession from pyspark.sql.functions import mean, stddev, abs spark = SparkSession.builder.appName("OutlierHandlingExample").getOrCreate() # Create a sample data frame data = [("John", 25, 90.0), ("Jane", 30, 35.5), ("Jim", 20, 200.0), ("Joan", 32, 50.0)] columns = ["Name", "Age", "Salary"] df = spark.createDataFrame(data, columns) # Display the original data frame print("Original Data Frame:") df.show() # Calculate mean and standard deviation mean_salary = df.agg(mean(df["Salary"])).first()[0] stddev_salary = df.agg(stddev(df["Salary"])).first()[0] # Identify and filter out outliers df = df.filter(abs(df["Salary"] - mean_salary) < stddev_salary) # Display the cleaned data frame print("Cleaned Data Frame:") df.show() spark.stop()
Output
Original Data Frame: +----+---+------+ |Name|Age|Salary| +----+---+------+ |John| 25| 90.0| |Jane| 30| 35.5| | Jim| 20| 200.0| |Joan| 32| 50.0| +----+---+------+ Cleaned Data Frame: +----+---+------+ |Name|Age|Salary| +----+---+------+ |John| 25| 90.0| |Jane| 30| 35.5| |Joan| 32| 50.0| +----+---+------+
Converting Data Types
Converting data types refers to the process of changing the representation of data from one data type to another. In data processing and analysis, it is common to encounter data in different formats that are not suitable for the desired analysis. In such cases, converting the data type to a suitable format is necessary to perform the analysis correctly.
For example, in a data frame, a column may have the data type of string, but the values in that column are numbers. In this case, the data type of the column needs to be changed to integer or float, depending on the requirements of the analysis. Similarly, a column may have the data type of integer, but the values in that column are date strings. In this case, the data type of the column needs to be changed to a date type.
Converting data types is an important step in data cleaning and preprocessing, as it ensures that the data is in the correct format for the analysis.
Example
from pyspark.sql import SparkSession from pyspark.sql.types import IntegerType, FloatType spark = SparkSession.builder.appName("DataTypeConversion").getOrCreate() # Create a sample data frame data = [("John", "25", "90"), ("Jane", "30", "35"), ("Jim", "20", "200"), ("Joan", "32", "50")] columns = ["Name", "Age", "Salary"] df = spark.createDataFrame(data, columns) # Display the original data frame print("Original Data Frame:") df.show() # Convert the data type of the 'Age' column to integer df = df.withColumn("Age", df["Age"].cast(IntegerType())) # Convert the data type of the 'Salary' column to float df = df.withColumn("Salary", df["Salary"].cast(FloatType())) # Display the converted data frame print("Converted Data Frame:") df.show() spark.stop()
Output
Original Data Frame: +----+---+------+ |Name|Age|Salary| +----+---+------+ |John| 25| 90| |Jane| 30| 35| | Jim| 20| 200| |Joan| 32| 50| +----+---+------+ Converted Data Frame: +----+---+------+ |Name|Age|Salary| +----+---+------+ |John| 25| 90.0| |Jane| 30| 35.0| | Jim| 20| 200.0| |Joan| 32| 50.0| +----+---+------+
Conclusion
Cleaning of data in Apache Spark is an essential part of the data preparation process. Apache Spark provides us a powerful as well as an efficient platform in order to process large datasets and help us perform various data cleaning tasks like handling of missing values, duplicates etc at the same time. The pyspark.sql.functions module provides us with a large number of functions which when combined with the ability of performing complex operations in a distributed environment, makes Apache Spark a perfect choice for data cleaning and preparation. By making use of the right tools and techniques, we can prepare our data for analysis,ML or any other type of data driven application with a higher chance of accurate results.