0% found this document useful (0 votes)

17 views

spark setup

This document provides a step-by-step guide to set up an Apache Spark cluster, including system preparation, master node configuration, and worker node setup. It details the installation of Java, downloading Spark, configuring environment variables, and writing a Spark script to analyze student data. Finally, it covers starting the cluster, submitting a Spark job, viewing results, and stopping the cluster.

Uploaded by

giblibaba

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views

spark setup

Uploaded by

giblibaba

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 4

Step 1: Prepare All Systems (Master + Workers)

Perform these steps on every system in the cluster:

1. Update the System:

o Open the terminal on each machine and run:
o sudo apt update && sudo apt upgrade -y
2. Install Java:
o Install Java on each machine:
o sudo apt install openjdk-11-jdk -y
o Verify the installation:
o java -version
Output should show Java 11 installed.
o
3. Download and Install Apache Spark:
o On each machine, download the latest Spark version:
o wget https://dlcdn.apache.org/spark/spark-<version>/spark-
<version>-bin-hadoop3.tgz

Replace <version> with the latest version, e.g., 3.5.1.

o Extract the downloaded file:

o tar -xvzf spark-<version>-bin-hadoop3.tgz
o Move the folder to /opt:
o sudo mv spark-<version>-bin-hadoop3 /opt/spark
4. Set Environment Variables:
o Edit the .bashrc file:
o nano ~/.bashrc
o Add these lines:
o export SPARK_HOME=/opt/spark
o export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
o export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
o Save and exit:
 Press Ctrl+O, then Enter, and Ctrl+X.
o Apply changes:
o source ~/.bashrc

Step 2: Configure the Master Node

Do these steps only on the master node:

1. Configure Spark for the Master:

o Navigate to the Spark conf directory:
o cd /opt/spark/conf
o Copy the template files:
o cp spark-env.sh.template spark-env.sh
o cp slaves.template slaves
o Edit spark-env.sh:
o nano spark-env.sh

Add this line:

SPARK_MASTER_HOST='your-master-ip'

Replace 'your-master-ip' with the actual IP address of the master node.

o Edit slaves:
o nano slaves

Add the IP addresses of all worker nodes (including the master node if it’s also
acting as a worker), e.g.:

master-ip
worker1-ip
worker2-ip

2. Prepare the Data File:

o Create a CSV file with student data:
o nano students.csv

Add this sample data:

student_id,department,cgpa,semester
101,CSE,8.5,1
102,ECE,7.8,1
103,CSE,9.0,2
104,CSE,7.5,3
105,CSE,8.1,4

o Save and exit.

3. Write the Spark Script:
o Create the Python file:
o nano analyze_cse_students.py
o Add this code:
o from pyspark.sql import SparkSession
o
o # Initialize Spark session
o spark = SparkSession.builder \
o .appName("Analyze CSE Students") \
o .getOrCreate()
o
o # Load the student database
o data_file = "students.csv"
o students_df = spark.read.csv(data_file, header=True,
inferSchema=True)
o
o # Filter CSE students with CGPA > 8
o filtered_df = students_df.filter((students_df['department'] ==
'CSE') & (students_df['cgpa'] > 8))
o
o # Show the results
o filtered_df.show()
o
o # Save results to a file
o filtered_df.write.csv("output/cse_top_students.csv",
header=True)
o
o # Stop Spark session
o spark.stop()
o Save and exit.

Step 3: Start the Cluster

Master Node:

1. Start the Spark master process:

2. /opt/spark/sbin/start-master.sh
o You’ll see a URL, e.g., spark://192.168.1.100:7077. Copy this URL.

Worker Nodes:

1. On each worker node, connect to the master:

2. /opt/spark/sbin/start-worker.sh spark://<master-ip>:7077

Replace <master-ip> with the IP address of the master node.

3. Verify workers are connected:

o Open a browser on the master node and go to:
o http://<master-ip>:8080
o You should see all the workers listed here.

Step 4: Submit the Spark Job

1. Run the Spark job on the master node:

2. /opt/spark/bin/spark-submit --master spark://<master-ip>:7077
analyze_cse_students.py
3. The script will:
o Read the students.csv file.
o Filter CSE students with CGPA > 8.
o Save the filtered data to the output directory.

Step 5: View the Results

1. Navigate to the output directory:

2. cd output/
3. View the result file:
4. cat cse_top_students.csv

Step 6: Stop the Cluster

Worker Nodes:
1. Stop the worker process on each worker:
2. /opt/spark/sbin/stop-worker.sh

Master Node:

1. Stop the master process:

2. /opt/spark/sbin/stop-master.sh

FS.20 v6.0
No ratings yet
FS.20 v6.0
64 pages
AZ 204 Master Cheat Sheet
No ratings yet
AZ 204 Master Cheat Sheet
51 pages
ETL Processes Using PySpark
67% (3)
ETL Processes Using PySpark
7 pages
ARDUINO UNO Document
100% (2)
ARDUINO UNO Document
25 pages
hadoop_docs[1]
No ratings yet
hadoop_docs[1]
9 pages
Apache Spark
No ratings yet
Apache Spark
8 pages
CIS612_SparkInstallation_Ubuntun
No ratings yet
CIS612_SparkInstallation_Ubuntun
10 pages
1731556887911
No ratings yet
1731556887911
275 pages
Py Spark
No ratings yet
Py Spark
7 pages
Databricks Apache Spark Certified Developer Master Cheat Sheet
100% (1)
Databricks Apache Spark Certified Developer Master Cheat Sheet
29 pages
DevOps. How to build pipelines with Jenkins, Docker container, AWS ECS, JDK 11, git and maven 3?
From Everand
DevOps. How to build pipelines with Jenkins, Docker container, AWS ECS, JDK 11, git and maven 3?
John Edward Cooper Berg
No ratings yet
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
PySpark Notes
No ratings yet
PySpark Notes
64 pages
Spark Details
No ratings yet
Spark Details
11 pages
Set Up Apache Spark On A Multi-Node Cluster - Y Media Labs Innovation - Medium
No ratings yet
Set Up Apache Spark On A Multi-Node Cluster - Y Media Labs Innovation - Medium
11 pages
LPIC-1 Primer
From Everand
LPIC-1 Primer
John Greene
4.5/5 (3)
Per Partition
No ratings yet
Per Partition
3 pages
BDALab Assn5
No ratings yet
BDALab Assn5
16 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
Spark - Out of Memory Exception Handling
No ratings yet
Spark - Out of Memory Exception Handling
3 pages
Configuration of a Simple Samba File Server, Quota and Schedule Backup
From Everand
Configuration of a Simple Samba File Server, Quota and Schedule Backup
Dr. Hidaia Mahmood Alassouli
No ratings yet
PySpark Exam Setup and Basic Code Guide
No ratings yet
PySpark Exam Setup and Basic Code Guide
4 pages
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
No ratings yet
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
11 pages
CCF Usage Manual
No ratings yet
CCF Usage Manual
8 pages
Download Complete Apache Spark 2 x Cookbook Cloud ready recipes for analytics and data science Rishi Yadav PDF for All Chapters
100% (1)
Download Complete Apache Spark 2 x Cookbook Cloud ready recipes for analytics and data science Rishi Yadav PDF for All Chapters
55 pages
UNIX Shell Programming Interview Questions You'll Most Likely Be Asked
From Everand
UNIX Shell Programming Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
Spark by Sumit
No ratings yet
Spark by Sumit
33 pages
Apache Spark with Scala - cheatsheet (1) (1)
No ratings yet
Apache Spark with Scala - cheatsheet (1) (1)
7 pages
Spark-Tutorial - IV - Python
No ratings yet
Spark-Tutorial - IV - Python
212 pages
The Mac Terminal Reference and Scripting Primer
From Everand
The Mac Terminal Reference and Scripting Primer
Jay Docherty
4.5/5 (3)
Introduction To Apache Spark (Spark) : - by Praveen
No ratings yet
Introduction To Apache Spark (Spark) : - by Praveen
19 pages
EDA Python for Data Analsis
No ratings yet
EDA Python for Data Analsis
10 pages
Spark Notes
No ratings yet
Spark Notes
2 pages
Understanding Software Engineering Vol 3: Programming Basic Software Functionalities.
From Everand
Understanding Software Engineering Vol 3: Programming Basic Software Functionalities.
Gabriel Clemente
No ratings yet
Apache Spark Installation
No ratings yet
Apache Spark Installation
4 pages
Production Data Processing With Apache Spark
No ratings yet
Production Data Processing With Apache Spark
7 pages
Big Data With Spark and Hadoop
No ratings yet
Big Data With Spark and Hadoop
9 pages
Spark optimisation
No ratings yet
Spark optimisation
7 pages
Full Download Apache Spark 2 X Cookbook Cloud Ready Recipes For Analytics and Data Science Rishi Yadav PDF
100% (5)
Full Download Apache Spark 2 X Cookbook Cloud Ready Recipes For Analytics and Data Science Rishi Yadav PDF
52 pages
Spark 3.0 New Features: Spark With GPU Support
No ratings yet
Spark 3.0 New Features: Spark With GPU Support
8 pages
spark_optimization_1741826797
No ratings yet
spark_optimization_1741826797
7 pages
Some Tutorials in Computer Networking Hacking
From Everand
Some Tutorials in Computer Networking Hacking
Dr. Hidaia Mahmood Alassouli
No ratings yet
How to a Developers Guide to 4k: Developer edition, #3
From Everand
How to a Developers Guide to 4k: Developer edition, #3
Xinc Cyberwizard
No ratings yet
Installation+Steps
No ratings yet
Installation+Steps
5 pages
Unit 4 Spark Updated
No ratings yet
Unit 4 Spark Updated
86 pages
Intro To Spark Development
No ratings yet
Intro To Spark Development
172 pages
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
From Everand
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
Manish Soni
No ratings yet
Hands-on Guide to Apache Spark 3: Build Scalable Computing Engines for Batch and Stream Data Processing 1st Edition Alfonso Antolínez García instant download
No ratings yet
Hands-on Guide to Apache Spark 3: Build Scalable Computing Engines for Batch and Stream Data Processing 1st Edition Alfonso Antolínez García instant download
85 pages
Big Data - Spark
No ratings yet
Big Data - Spark
42 pages
Apache Spark Component Guide
No ratings yet
Apache Spark Component Guide
84 pages
Py Spark 3 Quick Reference Guide
No ratings yet
Py Spark 3 Quick Reference Guide
2 pages
Configuration of a Simple Samba File Server, Quota and Schedule Backup
From Everand
Configuration of a Simple Samba File Server, Quota and Schedule Backup
Dr. Hedaya Alasooly
No ratings yet
TensorFlow Developer Certificate Exam Practice Tests 2024 Made Easy
From Everand
TensorFlow Developer Certificate Exam Practice Tests 2024 Made Easy
Mr Troy
No ratings yet
Chapter 3 spark
No ratings yet
Chapter 3 spark
6 pages
BIG DATA APACHE SPARK123
No ratings yet
BIG DATA APACHE SPARK123
121 pages
50 Recipes for Programming Node.js
From Everand
50 Recipes for Programming Node.js
Jamie Munro
3/5 (4)
Day1 Main
No ratings yet
Day1 Main
188 pages
Evaluation of Some Windows and Linux Intrusion Detection Tools
From Everand
Evaluation of Some Windows and Linux Intrusion Detection Tools
Dr. Hidaia Mahmood Alassouli
No ratings yet
Overview of Some Windows and Linux Intrusion Detection Tools
From Everand
Overview of Some Windows and Linux Intrusion Detection Tools
Dr. Hidaia Mahmood Alassouli
No ratings yet
Spark Implementation
No ratings yet
Spark Implementation
10 pages
XProc 3.0 Programmer Reference
From Everand
XProc 3.0 Programmer Reference
Erik Siegel
No ratings yet
Overview
No ratings yet
Overview
25 pages
CC Unit Iii
No ratings yet
CC Unit Iii
48 pages
Introduction To Programming in Python: Department of Physics
No ratings yet
Introduction To Programming in Python: Department of Physics
63 pages
Syllabus LUMS-EOBI Recruitment Test For Assistant Director IT Cadre (BPS-17)
No ratings yet
Syllabus LUMS-EOBI Recruitment Test For Assistant Director IT Cadre (BPS-17)
2 pages
Performance Analysis of Full-Adder Based On Domino Logic Technique
No ratings yet
Performance Analysis of Full-Adder Based On Domino Logic Technique
22 pages
Intel® Driver & Support Assistant - Detailed Report
No ratings yet
Intel® Driver & Support Assistant - Detailed Report
5 pages
Ijmrap V5n8p71y23
No ratings yet
Ijmrap V5n8p71y23
6 pages
Predictive Analytics
No ratings yet
Predictive Analytics
28 pages
200-201 CBROPS DUMPS
No ratings yet
200-201 CBROPS DUMPS
17 pages
04 - Step by Sptep Procedure For The Usage of Crouzet Millenium 3 PLC With Exercises
No ratings yet
04 - Step by Sptep Procedure For The Usage of Crouzet Millenium 3 PLC With Exercises
225 pages
PHY Exchange Guide, Marvell Alaska 88E1512 To ADIN1300 GB (Analog Devices Wiki)
No ratings yet
PHY Exchange Guide, Marvell Alaska 88E1512 To ADIN1300 GB (Analog Devices Wiki)
17 pages
The MVS 3.8j Tur (N) Key 4-System - Version 1.00 - : Installation
No ratings yet
The MVS 3.8j Tur (N) Key 4-System - Version 1.00 - : Installation
12 pages
Unleashing Propeller c3 v1
100% (1)
Unleashing Propeller c3 v1
95 pages
Project Report On Proxy Server - PDF - Proxy Server - Web Server
No ratings yet
Project Report On Proxy Server - PDF - Proxy Server - Web Server
163 pages
Node DQ
No ratings yet
Node DQ
140 pages
MC0070
No ratings yet
MC0070
2 pages
C++ Language Tutorial: Written By: Juan Soulié Last Revision: June, 2007
No ratings yet
C++ Language Tutorial: Written By: Juan Soulié Last Revision: June, 2007
144 pages
Uvm Summary Code
No ratings yet
Uvm Summary Code
124 pages
Event Scheduling System
No ratings yet
Event Scheduling System
26 pages
Bugreport
No ratings yet
Bugreport
24 pages
MongoDB Cheat Sheet
No ratings yet
MongoDB Cheat Sheet
4 pages
Simplex 4100 Fire Alarm Version 8 PC Programmer Instructions 7492863015
No ratings yet
Simplex 4100 Fire Alarm Version 8 PC Programmer Instructions 7492863015
33 pages
كتاب لينكس لباحثي الأمن السيبراني
No ratings yet
كتاب لينكس لباحثي الأمن السيبراني
73 pages
Cs411 Midterm Solved Mcqs by Junaid
No ratings yet
Cs411 Midterm Solved Mcqs by Junaid
48 pages
Jaafar Serhan Latest CV OS
No ratings yet
Jaafar Serhan Latest CV OS
3 pages
Performance Analysis of Ber Vs SNR For BPSK and QPSK: A Project Based Lab Report On
No ratings yet
Performance Analysis of Ber Vs SNR For BPSK and QPSK: A Project Based Lab Report On
13 pages
Oosd Unit 2
No ratings yet
Oosd Unit 2
43 pages
McKinsey Digital - When and How to Prepare for Post Quantum Cryptography
No ratings yet
McKinsey Digital - When and How to Prepare for Post Quantum Cryptography
8 pages

spark setup

Uploaded by

spark setup

Uploaded by

Step 1: Prepare All Systems (Master + Workers)

Perform these steps on every system in the cluster:

1. Update the System:

Replace <version> with the latest version, e.g., 3.5.1.

o Extract the downloaded file:

Step 2: Configure the Master Node

Do these steps only on the master node:

1. Configure Spark for the Master:

Add this line:

Replace 'your-master-ip' with the actual IP address of the master node.

2. Prepare the Data File:

Add this sample data:

o Save and exit.

Step 3: Start the Cluster

1. Start the Spark master process:

1. On each worker node, connect to the master:

Replace <master-ip> with the IP address of the master node.

3. Verify workers are connected:

Step 4: Submit the Spark Job

1. Run the Spark job on the master node:

Step 5: View the Results

1. Navigate to the output directory:

Step 6: Stop the Cluster

1. Stop the master process:

You might also like