0% found this document useful (0 votes)
25 views

Loading and Saving Data

The document discusses loading and saving data from JSON and CSV files using PySpark. It shows how to read CSV and JSON files into DataFrames, manipulate the DataFrames by selecting columns, filtering rows, adding new columns, and sorting data. It also demonstrates how to write the modified DataFrames back to CSV and JSON files.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

Loading and Saving Data

The document discusses loading and saving data from JSON and CSV files using PySpark. It shows how to read CSV and JSON files into DataFrames, manipulate the DataFrames by selecting columns, filtering rows, adding new columns, and sorting data. It also demonstrates how to write the modified DataFrames back to CSV and JSON files.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Loading and saving data

from json and csv files


Reading from csv file into a dataframe
from pyspark.sql import SparkSession
import findspark
findspark.init()

# Create a SparkSession
spark = SparkSession.builder.appName("CSVToDataFrame").getOrCreate()

# Specify the path to the CSV file


csv_file_path = "bollywood.csv"

# Read the CSV file into a DataFrame


df = spark.read.csv(csv_file_path, header=True, inferSchema=True)

# Show the first few rows of the DataFrame


df.show()

# Stop the SparkSession


spark.stop()
Reading, manipulating and writing
modified dataframe to another csv
from pyspark.sql import SparkSession # Sorting data
import findspark sorted_data = df.orderBy("Budget", ascending=False)
print("Sorted Data:")
findspark.init() sorted_data.show()
spark = SparkSession.builder.appName("CSVDataManipulation").getOrCreate()
csv_file_path = "bollywood.csv" # Adding a new column
# Read the CSV file into a DataFrame df_with_new_column = df.withColumn("BudgetPlusTen",
df = spark.read.csv(csv_file_path, header=True, inferSchema=True) df.Budget + 10)
print("DataFrame with a New Column :")
# Show the first few rows of the DataFrame df_with_new_column.show()
print("Initial DataFrame:")
df.show() # Save the final DataFrame into another CSV file
# Selecting specific columns output_csv_file_path = "new1.csv"
selected_columns = df.select("Release Date", "MovieName")
df_with_new_column.toPandas().to_csv(output_csv_file_path,
print("Selected Columns:") header=True, index=False)
selected_columns.show()
# Stop the SparkSession
# Filtering data based on a condition spark.stop()
filtered_data = df.filter(df.Budget >10)
print("Filtered Data:")
filtered_data.show()
JSON read()
from pyspark.sql import SparkSession
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("JSONToDataFrame").getOrCreate()
# Specify the path to the JSON file
json_file_path = "inp.json"
# Read the JSON file into a DataFrame
df = spark.read.json(json_file_path)
# Show the DataFrame
df.show()
# Stop the SparkSession
spark.stop()
Json read, modify and write
from pyspark.sql import SparkSession
import findspark
findspark.init()
import json
spark = SparkSession.builder.appName("JSONToDataFrame").getOrCreate()
json_file_path = "inp.json"
df = spark.read.json(json_file_path)
df.show()
df_with_new_column = df.withColumn("age modified", df.age + 10)
print("DataFrame with a New Column :")
df_with_new_column.show()
# Convert the DataFrame to JSON format as a string
json_data = df_with_new_column.toJSON().collect()
# Write the JSON data as a single JSON object to a file
with open("new22.json", "w") as json_file:
json.dump(json_data, json_file, indent=4)
spark.stop()

You might also like