SlideShare a Scribd company logo
Farzad Nozarian
4/18/15 @AUT
Purpose
This tutorial provides a quick introduction to using Spark. We will first
introduce the API through Spark’s interactive shell, then show how to write
applications in Scala.
To follow along with this guide, first download a packaged release of Spark
from the Spark website.
2
Interactive Analysis with the Spark Shell-
Basics
• Spark’s shell provides a simple way to learn the API, as well as a powerful tool
to analyze data interactively.
• It is available in either Scala or Python.
• Start it by running the following in the Spark directory:
• RDDs can be created from Hadoop InputFormats (such as HDFS files) or by
transforming other RDDs.
• Let’s make a new RDD from the text of the README file in the Spark source
directory:
3
./bin/spark-shell
scala> val textFile = sc.textFile("README.md")
textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3
Interactive Analysis with the Spark Shell-
Basics
• RDDs have actions, which return values, and transformations, which return
pointers to new RDDs. Let’s start with a few actions:
• Now let’s use a transformation:
• We will use the filter transformation to return a new RDD with a subset of the
items in the file.
4
scala> textFile.count() // Number of items in this RDD
res0: Long = 126
scala> textFile.first() // First item in this RDD
res1: String = # Apache Spark
scala> val linesWithSpark = textFile.filter(line =>
line.contains("Spark"))
linesWithSpark: spark.RDD[String] = spark.FilteredRDD@7dd4af09
Interactive Analysis with the Spark Shell-
More on RDD Operations
• We can chain together transformations and actions:
• RDD actions and transformations can be used for more complex computations.
• Let’s say we want to find the line with the most words:
• The arguments to map and reduce are Scala function literals (closures), and can
use any language feature or Scala/Java library.
5
scala> textFile.filter(line => line.contains("Spark")).count()
// How many lines contain "Spark"?
res3: Long = 15
scala> textFile.map(line => line.split(" ").size).reduce((a, b)
=> if (a > b) a else b)
res4: Long = 15
Interactive Analysis with the Spark Shell-
More on RDD Operations
• We can easily call functions declared elsewhere.
• We’ll use Math.max() function to make previous code easier to understand:
• One common data flow pattern is MapReduce, as popularized by Hadoop.
• Spark can implement MapReduce flows easily:
6
scala> import java.lang.Math
import java.lang.Math
scala> textFile.map(line => line.split(" ").size).reduce((a, b)
=> Math.max(a, b))
res5: Int = 15
scala> val wordCounts = textFile.flatMap(line => line.split(" ")).map(word =>
(word, 1)).reduceByKey((a, b) => a + b)
wordCounts: spark.RDD[(String, Int)] = spark.ShuffledAggregatedRDD@71f027b8
Interactive Analysis with the Spark Shell-
More on RDD Operations
• Here, we combined the flatMap, map and reduceByKey transformations to
compute the per-word counts in the file as an RDD of (String, Int) pairs.
• To collect the word counts in our shell, we can use the collect action:
7
scala> wordCounts.collect()
res6: Array[(String, Int)] = Array((means,1), (under,2), (this,3),
(Because,1), (Python,2), (agree,1), (cluster.,1), ...)
Interactive Analysis with the Spark Shell-
Caching
• Spark also supports pulling data sets into a cluster-wide in-memory cache.
• This is very useful when data is accessed repeatedly:
• Querying a small “hot” dataset.
• Running an iterative algorithm like PageRank.
• Let’s mark our linesWithSpark dataset to be cached:
8
scala> linesWithSpark.cache()
res7: spark.RDD[String] = spark.FilteredRDD@17e51082
scala> linesWithSpark.count()
res8: Long = 15
scala> linesWithSpark.count()
res9: Long = 15
Self-Contained Applications
9
/* SimpleApp.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SimpleApp {
def main(args: Array[String]) {
val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
}
Self-Contained Applications (Cont.)
• This program just counts the number of lines containing ‘a’ and the
number containing ‘b’ in the Spark README.
• Note that you’ll need to replace YOUR_SPARK_HOME with the location
where Spark is installed.
• Note that applications should define a main() method instead of
extending scala.App. Subclasses of scala.App may not work correctly.
• Unlike the earlier examples with the Spark shell, which initializes its own
SparkContext, we initialize a SparkContext as part of the program.
• We pass the SparkContext constructor a SparkConf object which
contains information about our application.
10
Self-Contained Applications (Cont.)
• Our application depends on the Spark API, so we’ll also include an sbt
configuration file, simple.sbt which explains that Spark is a dependency.
• For sbt to work correctly, we’ll need to layout SimpleApp.scala and
simple.sbt according to the typical directory structure.
• Then we can create a JAR package containing the application’s code and
use the spark-submit script to run our program.
11
name := "Simple Project"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.1"
Self-Contained Applications (Cont.)
12
# Your directory layout should look like this
$ find .
.
./simple.sbt
./src
./src/main
./src/main/scala
./src/main/scala/SimpleApp.scala
# Package a jar containing your application
$ sbt package
...
[info] Packaging {..}/{..}/target/scala-2.10/simple-project_2.10-1.0.jar
# Use spark-submit to run your application
$ YOUR_SPARK_HOME/bin/spark-submit 
--class "SimpleApp" 
--master local[4] 
target/scala-2.10/simple-project_2.10-1.0.jar
...
Lines with a: 46, Lines with b: 23

More Related Content

What's hot (20)

PDF
Introduction to Apache Spark
Datio Big Data
 
PDF
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
CloudxLab
 
PPTX
Apache Spark overview
DataArt
 
PPTX
Intro to Apache Spark
Robert Sanders
 
PDF
Apache spark basics
sparrowAnalytics.com
 
PDF
Apache Spark Introduction - CloudxLab
Abhinav Singh
 
ODP
Introduction to Spark with Scala
Himanshu Gupta
 
PPTX
Spark Sql and DataFrame
Prashant Gupta
 
PDF
Spark shuffle introduction
colorant
 
PDF
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Christian Perone
 
PPTX
Spark core
Prashant Gupta
 
PPTX
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
PPTX
Apache Spark Architecture
Alexey Grishchenko
 
PDF
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
PPTX
Apache spark Intro
Tudor Lapusan
 
PPTX
Transformations and actions a visual guide training
Spark Summit
 
PDF
Productionizing Spark and the Spark Job Server
Evan Chan
 
PDF
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
PPTX
Spark 1.6 vs Spark 2.0
Sigmoid
 
Introduction to Apache Spark
Datio Big Data
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
CloudxLab
 
Apache Spark overview
DataArt
 
Intro to Apache Spark
Robert Sanders
 
Apache spark basics
sparrowAnalytics.com
 
Apache Spark Introduction - CloudxLab
Abhinav Singh
 
Introduction to Spark with Scala
Himanshu Gupta
 
Spark Sql and DataFrame
Prashant Gupta
 
Spark shuffle introduction
colorant
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Christian Perone
 
Spark core
Prashant Gupta
 
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Apache Spark Architecture
Alexey Grishchenko
 
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Apache spark Intro
Tudor Lapusan
 
Transformations and actions a visual guide training
Spark Summit
 
Productionizing Spark and the Spark Job Server
Evan Chan
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Spark 1.6 vs Spark 2.0
Sigmoid
 

Viewers also liked (20)

PDF
Object Based Databases
Farzad Nozarian
 
PDF
Apache HDFS - Lab Assignment
Farzad Nozarian
 
PDF
Shark - Lab Assignment
Farzad Nozarian
 
PDF
Apache Storm Tutorial
Farzad Nozarian
 
PPTX
Introduction to Spark: Data Analysis and Use Cases in Big Data
Jongwook Woo
 
PDF
Apache HBase - Lab Assignment
Farzad Nozarian
 
PDF
Apache Hadoop MapReduce Tutorial
Farzad Nozarian
 
PDF
Big Data Processing in Cloud Computing Environments
Farzad Nozarian
 
PPTX
Securing the Data Hub--Protecting your Customer IP (Technical Workshop)
Cloudera, Inc.
 
PPTX
Building a Data Hub that Empowers Customer Insight (Technical Workshop)
Cloudera, Inc.
 
PPTX
Big Data and Cloud Computing
Farzad Nozarian
 
PDF
BDM25 - Spark runtime internal
David Lauzon
 
PPTX
S4: Distributed Stream Computing Platform
Farzad Nozarian
 
PPTX
The Vortex of Change - Digital Transformation (Presented by Intel)
Cloudera, Inc.
 
PDF
Big data Clustering Algorithms And Strategies
Farzad Nozarian
 
PPTX
Using Big Data to Transform Your Customer’s Experience - Part 1

Cloudera, Inc.
 
PPTX
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Cloudera, Inc.
 
PPTX
Combining Machine Learning Frameworks with Apache Spark
Databricks
 
PPTX
Top 5 IoT Use Cases
Cloudera, Inc.
 
PPTX
Data Engineering: Elastic, Low-Cost Data Processing in the Cloud
Cloudera, Inc.
 
Object Based Databases
Farzad Nozarian
 
Apache HDFS - Lab Assignment
Farzad Nozarian
 
Shark - Lab Assignment
Farzad Nozarian
 
Apache Storm Tutorial
Farzad Nozarian
 
Introduction to Spark: Data Analysis and Use Cases in Big Data
Jongwook Woo
 
Apache HBase - Lab Assignment
Farzad Nozarian
 
Apache Hadoop MapReduce Tutorial
Farzad Nozarian
 
Big Data Processing in Cloud Computing Environments
Farzad Nozarian
 
Securing the Data Hub--Protecting your Customer IP (Technical Workshop)
Cloudera, Inc.
 
Building a Data Hub that Empowers Customer Insight (Technical Workshop)
Cloudera, Inc.
 
Big Data and Cloud Computing
Farzad Nozarian
 
BDM25 - Spark runtime internal
David Lauzon
 
S4: Distributed Stream Computing Platform
Farzad Nozarian
 
The Vortex of Change - Digital Transformation (Presented by Intel)
Cloudera, Inc.
 
Big data Clustering Algorithms And Strategies
Farzad Nozarian
 
Using Big Data to Transform Your Customer’s Experience - Part 1

Cloudera, Inc.
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Cloudera, Inc.
 
Combining Machine Learning Frameworks with Apache Spark
Databricks
 
Top 5 IoT Use Cases
Cloudera, Inc.
 
Data Engineering: Elastic, Low-Cost Data Processing in the Cloud
Cloudera, Inc.
 
Ad

Similar to Apache Spark Tutorial (20)

PPTX
Introduction to Apache Spark
Rahul Jain
 
PDF
A Deep Dive Into Spark
Ashish kumar
 
PDF
Apache Spark: What? Why? When?
Massimo Schenone
 
PDF
Meetup ml spark_ppt
Snehal Nagmote
 
PDF
Spark浅谈
Jiahua Zhu
 
PDF
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Euangelos Linardos
 
PPTX
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
PDF
Artigo 81 - spark_tutorial.pdf
WalmirCouto3
 
PPTX
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Holden Karau
 
PPTX
Introduction to Spark - DataFactZ
DataFactZ
 
PDF
20150716 introduction to apache spark v3
Andrey Vykhodtsev
 
PPTX
AI與大數據數據處理 Spark實戰(20171216)
Paul Chao
 
PDF
Introduction to Spark
Li Ming Tsai
 
PDF
Scala+data
Samir Bessalah
 
PPTX
OVERVIEW ON SPARK.pptx
Aishg4
 
PDF
Apache Spark Super Happy Funtimes - CHUG 2016
Holden Karau
 
PDF
Introduction to apache spark
Muktadiur Rahman
 
PDF
Introduction to apache spark
JUGBD
 
PPTX
Apache Spark
Majid Hajibaba
 
PDF
Intro to apache spark
Amine Sagaama
 
Introduction to Apache Spark
Rahul Jain
 
A Deep Dive Into Spark
Ashish kumar
 
Apache Spark: What? Why? When?
Massimo Schenone
 
Meetup ml spark_ppt
Snehal Nagmote
 
Spark浅谈
Jiahua Zhu
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Euangelos Linardos
 
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
Artigo 81 - spark_tutorial.pdf
WalmirCouto3
 
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Holden Karau
 
Introduction to Spark - DataFactZ
DataFactZ
 
20150716 introduction to apache spark v3
Andrey Vykhodtsev
 
AI與大數據數據處理 Spark實戰(20171216)
Paul Chao
 
Introduction to Spark
Li Ming Tsai
 
Scala+data
Samir Bessalah
 
OVERVIEW ON SPARK.pptx
Aishg4
 
Apache Spark Super Happy Funtimes - CHUG 2016
Holden Karau
 
Introduction to apache spark
Muktadiur Rahman
 
Introduction to apache spark
JUGBD
 
Apache Spark
Majid Hajibaba
 
Intro to apache spark
Amine Sagaama
 
Ad

Recently uploaded (20)

PPTX
ChiSquare Procedure in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PDF
4K Video Downloader Plus Pro Crack for MacOS New Download 2025
bashirkhan333g
 
PDF
Wondershare PDFelement Pro Crack for MacOS New Version Latest 2025
bashirkhan333g
 
PDF
SAP Firmaya İade ABAB Kodları - ABAB ile yazılmıl hazır kod örneği
Salih Küçük
 
PPTX
Tally software_Introduction_Presentation
AditiBansal54083
 
PDF
Revenue streams of the Wazirx clone script.pdf
aaronjeffray
 
PPTX
Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...
Shane Coughlan
 
PDF
Online Queue Management System for Public Service Offices in Nepal [Focused i...
Rishab Acharya
 
PDF
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pdf
Varsha Nayak
 
PPTX
Migrating Millions of Users with Debezium, Apache Kafka, and an Acyclic Synch...
MD Sayem Ahmed
 
PPTX
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
PDF
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
PDF
Alarm in Android-Scheduling Timed Tasks Using AlarmManager in Android.pdf
Nabin Dhakal
 
PPTX
Transforming Mining & Engineering Operations with Odoo ERP | Streamline Proje...
SatishKumar2651
 
PDF
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
PDF
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
PDF
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
PDF
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
PDF
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
PDF
Generic or Specific? Making sensible software design decisions
Bert Jan Schrijver
 
ChiSquare Procedure in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
4K Video Downloader Plus Pro Crack for MacOS New Download 2025
bashirkhan333g
 
Wondershare PDFelement Pro Crack for MacOS New Version Latest 2025
bashirkhan333g
 
SAP Firmaya İade ABAB Kodları - ABAB ile yazılmıl hazır kod örneği
Salih Küçük
 
Tally software_Introduction_Presentation
AditiBansal54083
 
Revenue streams of the Wazirx clone script.pdf
aaronjeffray
 
Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...
Shane Coughlan
 
Online Queue Management System for Public Service Offices in Nepal [Focused i...
Rishab Acharya
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pdf
Varsha Nayak
 
Migrating Millions of Users with Debezium, Apache Kafka, and an Acyclic Synch...
MD Sayem Ahmed
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
Alarm in Android-Scheduling Timed Tasks Using AlarmManager in Android.pdf
Nabin Dhakal
 
Transforming Mining & Engineering Operations with Odoo ERP | Streamline Proje...
SatishKumar2651
 
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
Generic or Specific? Making sensible software design decisions
Bert Jan Schrijver
 

Apache Spark Tutorial

  • 2. Purpose This tutorial provides a quick introduction to using Spark. We will first introduce the API through Spark’s interactive shell, then show how to write applications in Scala. To follow along with this guide, first download a packaged release of Spark from the Spark website. 2
  • 3. Interactive Analysis with the Spark Shell- Basics • Spark’s shell provides a simple way to learn the API, as well as a powerful tool to analyze data interactively. • It is available in either Scala or Python. • Start it by running the following in the Spark directory: • RDDs can be created from Hadoop InputFormats (such as HDFS files) or by transforming other RDDs. • Let’s make a new RDD from the text of the README file in the Spark source directory: 3 ./bin/spark-shell scala> val textFile = sc.textFile("README.md") textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3
  • 4. Interactive Analysis with the Spark Shell- Basics • RDDs have actions, which return values, and transformations, which return pointers to new RDDs. Let’s start with a few actions: • Now let’s use a transformation: • We will use the filter transformation to return a new RDD with a subset of the items in the file. 4 scala> textFile.count() // Number of items in this RDD res0: Long = 126 scala> textFile.first() // First item in this RDD res1: String = # Apache Spark scala> val linesWithSpark = textFile.filter(line => line.contains("Spark")) linesWithSpark: spark.RDD[String] = spark.FilteredRDD@7dd4af09
  • 5. Interactive Analysis with the Spark Shell- More on RDD Operations • We can chain together transformations and actions: • RDD actions and transformations can be used for more complex computations. • Let’s say we want to find the line with the most words: • The arguments to map and reduce are Scala function literals (closures), and can use any language feature or Scala/Java library. 5 scala> textFile.filter(line => line.contains("Spark")).count() // How many lines contain "Spark"? res3: Long = 15 scala> textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b) res4: Long = 15
  • 6. Interactive Analysis with the Spark Shell- More on RDD Operations • We can easily call functions declared elsewhere. • We’ll use Math.max() function to make previous code easier to understand: • One common data flow pattern is MapReduce, as popularized by Hadoop. • Spark can implement MapReduce flows easily: 6 scala> import java.lang.Math import java.lang.Math scala> textFile.map(line => line.split(" ").size).reduce((a, b) => Math.max(a, b)) res5: Int = 15 scala> val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b) wordCounts: spark.RDD[(String, Int)] = spark.ShuffledAggregatedRDD@71f027b8
  • 7. Interactive Analysis with the Spark Shell- More on RDD Operations • Here, we combined the flatMap, map and reduceByKey transformations to compute the per-word counts in the file as an RDD of (String, Int) pairs. • To collect the word counts in our shell, we can use the collect action: 7 scala> wordCounts.collect() res6: Array[(String, Int)] = Array((means,1), (under,2), (this,3), (Because,1), (Python,2), (agree,1), (cluster.,1), ...)
  • 8. Interactive Analysis with the Spark Shell- Caching • Spark also supports pulling data sets into a cluster-wide in-memory cache. • This is very useful when data is accessed repeatedly: • Querying a small “hot” dataset. • Running an iterative algorithm like PageRank. • Let’s mark our linesWithSpark dataset to be cached: 8 scala> linesWithSpark.cache() res7: spark.RDD[String] = spark.FilteredRDD@17e51082 scala> linesWithSpark.count() res8: Long = 15 scala> linesWithSpark.count() res9: Long = 15
  • 9. Self-Contained Applications 9 /* SimpleApp.scala */ import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf object SimpleApp { def main(args: Array[String]) { val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system val conf = new SparkConf().setAppName("Simple Application") val sc = new SparkContext(conf) val logData = sc.textFile(logFile, 2).cache() val numAs = logData.filter(line => line.contains("a")).count() val numBs = logData.filter(line => line.contains("b")).count() println("Lines with a: %s, Lines with b: %s".format(numAs, numBs)) }
  • 10. Self-Contained Applications (Cont.) • This program just counts the number of lines containing ‘a’ and the number containing ‘b’ in the Spark README. • Note that you’ll need to replace YOUR_SPARK_HOME with the location where Spark is installed. • Note that applications should define a main() method instead of extending scala.App. Subclasses of scala.App may not work correctly. • Unlike the earlier examples with the Spark shell, which initializes its own SparkContext, we initialize a SparkContext as part of the program. • We pass the SparkContext constructor a SparkConf object which contains information about our application. 10
  • 11. Self-Contained Applications (Cont.) • Our application depends on the Spark API, so we’ll also include an sbt configuration file, simple.sbt which explains that Spark is a dependency. • For sbt to work correctly, we’ll need to layout SimpleApp.scala and simple.sbt according to the typical directory structure. • Then we can create a JAR package containing the application’s code and use the spark-submit script to run our program. 11 name := "Simple Project" version := "1.0" scalaVersion := "2.10.4" libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.1"
  • 12. Self-Contained Applications (Cont.) 12 # Your directory layout should look like this $ find . . ./simple.sbt ./src ./src/main ./src/main/scala ./src/main/scala/SimpleApp.scala # Package a jar containing your application $ sbt package ... [info] Packaging {..}/{..}/target/scala-2.10/simple-project_2.10-1.0.jar # Use spark-submit to run your application $ YOUR_SPARK_HOME/bin/spark-submit --class "SimpleApp" --master local[4] target/scala-2.10/simple-project_2.10-1.0.jar ... Lines with a: 46, Lines with b: 23