0% found this document useful (0 votes)
17 views

spark setup

This document provides a step-by-step guide to set up an Apache Spark cluster, including system preparation, master node configuration, and worker node setup. It details the installation of Java, downloading Spark, configuring environment variables, and writing a Spark script to analyze student data. Finally, it covers starting the cluster, submitting a Spark job, viewing results, and stopping the cluster.

Uploaded by

giblibaba
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

spark setup

This document provides a step-by-step guide to set up an Apache Spark cluster, including system preparation, master node configuration, and worker node setup. It details the installation of Java, downloading Spark, configuring environment variables, and writing a Spark script to analyze student data. Finally, it covers starting the cluster, submitting a Spark job, viewing results, and stopping the cluster.

Uploaded by

giblibaba
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Step 1: Prepare All Systems (Master + Workers)

Perform these steps on every system in the cluster:

1. Update the System:


o Open the terminal on each machine and run:
o sudo apt update && sudo apt upgrade -y
2. Install Java:
o Install Java on each machine:
o sudo apt install openjdk-11-jdk -y
o Verify the installation:
o java -version
Output should show Java 11 installed.
o
3. Download and Install Apache Spark:
o On each machine, download the latest Spark version:
o wget https://dlcdn.apache.org/spark/spark-<version>/spark-
<version>-bin-hadoop3.tgz

Replace <version> with the latest version, e.g., 3.5.1.

o Extract the downloaded file:


o tar -xvzf spark-<version>-bin-hadoop3.tgz
o Move the folder to /opt:
o sudo mv spark-<version>-bin-hadoop3 /opt/spark
4. Set Environment Variables:
o Edit the .bashrc file:
o nano ~/.bashrc
o Add these lines:
o export SPARK_HOME=/opt/spark
o export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
o export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
o Save and exit:
 Press Ctrl+O, then Enter, and Ctrl+X.
o Apply changes:
o source ~/.bashrc

Step 2: Configure the Master Node

Do these steps only on the master node:

1. Configure Spark for the Master:


o Navigate to the Spark conf directory:
o cd /opt/spark/conf
o Copy the template files:
o cp spark-env.sh.template spark-env.sh
o cp slaves.template slaves
o Edit spark-env.sh:
o nano spark-env.sh

Add this line:


SPARK_MASTER_HOST='your-master-ip'

Replace 'your-master-ip' with the actual IP address of the master node.

o Edit slaves:
o nano slaves

Add the IP addresses of all worker nodes (including the master node if it’s also
acting as a worker), e.g.:

master-ip
worker1-ip
worker2-ip

2. Prepare the Data File:


o Create a CSV file with student data:
o nano students.csv

Add this sample data:

student_id,department,cgpa,semester
101,CSE,8.5,1
102,ECE,7.8,1
103,CSE,9.0,2
104,CSE,7.5,3
105,CSE,8.1,4

o Save and exit.


3. Write the Spark Script:
o Create the Python file:
o nano analyze_cse_students.py
o Add this code:
o from pyspark.sql import SparkSession
o
o # Initialize Spark session
o spark = SparkSession.builder \
o .appName("Analyze CSE Students") \
o .getOrCreate()
o
o # Load the student database
o data_file = "students.csv"
o students_df = spark.read.csv(data_file, header=True,
inferSchema=True)
o
o # Filter CSE students with CGPA > 8
o filtered_df = students_df.filter((students_df['department'] ==
'CSE') & (students_df['cgpa'] > 8))
o
o # Show the results
o filtered_df.show()
o
o # Save results to a file
o filtered_df.write.csv("output/cse_top_students.csv",
header=True)
o
o # Stop Spark session
o spark.stop()
o Save and exit.

Step 3: Start the Cluster

Master Node:

1. Start the Spark master process:


2. /opt/spark/sbin/start-master.sh
o You’ll see a URL, e.g., spark://192.168.1.100:7077. Copy this URL.

Worker Nodes:

1. On each worker node, connect to the master:


2. /opt/spark/sbin/start-worker.sh spark://<master-ip>:7077

Replace <master-ip> with the IP address of the master node.

3. Verify workers are connected:


o Open a browser on the master node and go to:
o http://<master-ip>:8080
o You should see all the workers listed here.

Step 4: Submit the Spark Job

1. Run the Spark job on the master node:


2. /opt/spark/bin/spark-submit --master spark://<master-ip>:7077
analyze_cse_students.py
3. The script will:
o Read the students.csv file.
o Filter CSE students with CGPA > 8.
o Save the filtered data to the output directory.

Step 5: View the Results

1. Navigate to the output directory:


2. cd output/
3. View the result file:
4. cat cse_top_students.csv

Step 6: Stop the Cluster

Worker Nodes:
1. Stop the worker process on each worker:
2. /opt/spark/sbin/stop-worker.sh

Master Node:

1. Stop the master process:


2. /opt/spark/sbin/stop-master.sh

You might also like