The document discusses loading and saving data from JSON and CSV files using PySpark. It shows how to read CSV and JSON files into DataFrames, manipulate the DataFrames by selecting columns, filtering rows, adding new columns, and sorting data. It also demonstrates how to write the modified DataFrames back to CSV and JSON files.
Download as PPTX, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
25 views
Loading and Saving Data
The document discusses loading and saving data from JSON and CSV files using PySpark. It shows how to read CSV and JSON files into DataFrames, manipulate the DataFrames by selecting columns, filtering rows, adding new columns, and sorting data. It also demonstrates how to write the modified DataFrames back to CSV and JSON files.
spark.stop() Reading, manipulating and writing modified dataframe to another csv from pyspark.sql import SparkSession # Sorting data import findspark sorted_data = df.orderBy("Budget", ascending=False) print("Sorted Data:") findspark.init() sorted_data.show() spark = SparkSession.builder.appName("CSVDataManipulation").getOrCreate() csv_file_path = "bollywood.csv" # Adding a new column # Read the CSV file into a DataFrame df_with_new_column = df.withColumn("BudgetPlusTen", df = spark.read.csv(csv_file_path, header=True, inferSchema=True) df.Budget + 10) print("DataFrame with a New Column :") # Show the first few rows of the DataFrame df_with_new_column.show() print("Initial DataFrame:") df.show() # Save the final DataFrame into another CSV file # Selecting specific columns output_csv_file_path = "new1.csv" selected_columns = df.select("Release Date", "MovieName") df_with_new_column.toPandas().to_csv(output_csv_file_path, print("Selected Columns:") header=True, index=False) selected_columns.show() # Stop the SparkSession # Filtering data based on a condition spark.stop() filtered_data = df.filter(df.Budget >10) print("Filtered Data:") filtered_data.show() JSON read() from pyspark.sql import SparkSession import findspark findspark.init() from pyspark.sql import SparkSession spark = SparkSession.builder.appName("JSONToDataFrame").getOrCreate() # Specify the path to the JSON file json_file_path = "inp.json" # Read the JSON file into a DataFrame df = spark.read.json(json_file_path) # Show the DataFrame df.show() # Stop the SparkSession spark.stop() Json read, modify and write from pyspark.sql import SparkSession import findspark findspark.init() import json spark = SparkSession.builder.appName("JSONToDataFrame").getOrCreate() json_file_path = "inp.json" df = spark.read.json(json_file_path) df.show() df_with_new_column = df.withColumn("age modified", df.age + 10) print("DataFrame with a New Column :") df_with_new_column.show() # Convert the DataFrame to JSON format as a string json_data = df_with_new_column.toJSON().collect() # Write the JSON data as a single JSON object to a file with open("new22.json", "w") as json_file: json.dump(json_data, json_file, indent=4) spark.stop()
Mayuri Mehta (Editor), Kalpdrum Passi (Editor), Indranath Chatterjee (Editor), Rajan Patel (Editor) - Knowledge Modelling and Big Data Analytics in Healthcare - Advances and Applications-CRC Press