SlideShare a Scribd company logo
Basics of RDD - More Operations
Basics of RDD
More Transformations
sample(withReplacement, fraction, [seed])
Sample an RDD, with or without replacement.
Basics of RDD
val seq = sc.parallelize(1 to 100, 5)
seq.sample(false, 0.1).collect();
[8, 19, 34, 37, 43, 51, 70, 83]
More Transformations
sample(withReplacement, fraction, [seed])
Sample an RDD, with or without replacement.
Basics of RDD
More Transformations
sample(withReplacement, fraction, [seed])
Sample an RDD, with or without replacement.
val seq = sc.parallelize(1 to 100, 5)
seq.sample(false, 0.1).collect();
[8, 19, 34, 37, 43, 51, 70, 83]
seq.sample(true, 0.1).collect();
[14, 26, 40, 47, 55, 67, 69, 69]
Please note that the result will be different on every run.
Basics of RDD
Common Transformations (continued..)
mapPartitions(f, preservesPartitioning=False)
Return a new RDD by applying a function to each partition of this RDD.
Basics of RDD
val rdd = sc.parallelize(1 to 50, 3)
Common Transformations (continued..)
mapPartitions(f, preservesPartitioning=False)
Return a new RDD by applying a function to each partition of this RDD.
Basics of RDD
val rdd = sc.parallelize(1 to 50, 3)
def f(l:Iterator[Int]):Iterator[Int] = {
var sum = 0
while(l.hasNext){
sum = sum + l.next
}
return List(sum).iterator
}
Common Transformations (continued..)
mapPartitions(f, preservesPartitioning=False)
Return a new RDD by applying a function to each partition of this RDD.
Basics of RDD
val rdd = sc.parallelize(1 to 50, 3)
def f(l:Iterator[Int]):Iterator[Int] = {
var sum = 0
while(l.hasNext){
sum = sum + l.next
}
return List(sum).iterator
}
rdd.mapPartitions(f).collect()
Array(136, 425, 714)
17, 17, 16
Common Transformations (continued..)
mapPartitions(f, preservesPartitioning=False)
Return a new RDD by applying a function to each partition of this RDD.
Basics of RDD
Common Transformations (continued..)
sortBy(func, ascending=True, numPartitions=None)
Sorts this RDD by the given func
Basics of RDD
func: A function used to compute the sort key for each element.
Common Transformations (continued..)
sortBy(func, ascending=True, numPartitions=None)
Sorts this RDD by the given func
Basics of RDD
Common Transformations (continued..)
sortBy(func, ascending=True, numPartitions=None)
Sorts this RDD by the given func
func: A function used to compute the sort key for each element.
ascending: A flag to indicate whether the sorting is ascending or descending.
Basics of RDD
Common Transformations (continued..)
sortBy(func, ascending=True, numPartitions=None)
Sorts this RDD by the given func
func: A function used to compute the sort key for each element.
ascending: A flag to indicate whether the sorting is ascending or descending.
numPartitions: Number of partitions to create.
Basics of RDD
⋙ var tmp = List(('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5))
⋙ var rdd = sc.parallelize(tmp)
Common Transformations (continued..)
sortBy(keyfunc, ascending=True, numPartitions=None)
Sorts this RDD by the given keyfunc
Basics of RDD
⋙ var tmp = List(('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5))
⋙ var rdd = sc.parallelize(tmp)
⋙ rdd.sortBy(x => x._1).collect()
[('1', 3), ('2', 5), ('a', 1), ('b', 2), ('d', 4)]
Common Transformations (continued..)
sortBy(keyfunc, ascending=True, numPartitions=None)
Sorts this RDD by the given keyfunc
Basics of RDD
⋙ var tmp = List(('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5))
⋙ var rdd = sc.parallelize(tmp)
⋙ rdd.sortBy(x => x._2).collect()
[('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5)]
Common Transformations (continued..)
sortBy(keyfunc, ascending=True, numPartitions=None)
Sorts this RDD by the given keyfunc
Basics of RDD
Common Transformations (continued..)
sortBy(keyfunc, ascending=True, numPartitions=None)
Sorts this RDD by the given keyfunc
var rdd = sc.parallelize(Array(10, 2, 3,21, 4, 5))
var sortedrdd = rdd.sortBy(x => x)
sortedrdd.collect()
Basics of RDD
Common Transformations (continued..)
Pseudo set operations
Though RDD is not really a set but still the set operations try to provide you utility set functions
Basics of RDD
distinct()
+ Give the set property to your rdd
+ Expensive as shuffling is required
Set operations (Pseudo)
Basics of RDD
union()
+ Simply appends one rdd to another
+ Is not same as mathematical function
+ It may have duplicates
Set operations (Pseudo)
Basics of RDD
subtract()
+ Returns values in first RDD and not second
+ Requires Shuffling like intersection()
Set operations (Pseudo)
Basics of RDD
intersection()
+ Finds common values in RDDs
+ Also removes duplicates
+ Requires shuffling
Set operations (Pseudo)
Basics of RDD
cartesian()
+ Returns all possible pairs of (a,b)
+ a is in source RDD and b is in other RDD
Set operations (Pseudo)
Basics of RDD
fold(initial value, func)
+ Very similar to reduce
+ Provides a little extra control over the
initialisation
+ Lets us specify an initial value
More Actions - fold()
Basics of RDD
More Actions - fold()
fold(initial value, func)
Aggregates the elements of each partition and then the
results for all the partitions using a given associative and
commutative function and a neutral "zero value".
1 7 2 4 7 6
Partition 1 Partition 2
Basics of RDD
More Actions - fold()
fold(initial value, func)
Aggregates the elements of each partition and then the
results for all the partitions using a given associative and
commutative function and a neutral "zero value".
1 7 2 4 7 6
Partition 1 Partition 2
Initial Value Initial Value
Basics of RDD
More Actions - fold()
fold(initial value)(func)
Aggregates the elements of each partition and then the
results for all the partitions using a given associative and
commutative function and a neutral "zero value".
Initial Value 1 7 2 4 7 6Initial Value
Partition 1 Partition 2
1 1
Basics of RDD
More Actions - fold()
fold(initial value)(func)
Aggregates the elements of each partition and then the
results for all the partitions using a given associative and
commutative function and a neutral "zero value".
Initial Value 1 7 2 4 7 6Initial Value
Partition 1 Partition 2
1
2
3
1
2
3
Result1 Result2
Basics of RDD
More Actions - fold()
fold(initial value)(func)
Aggregates the elements of each partition and then the
results for all the partitions using a given associative and
commutative function and a neutral "zero value".
Result1 Result2Initial Value
4
5
1 7 2 4 7 6
Partition 1 Partition 2
1
2
3
1
2
3
Basics of RDD
var myrdd = sc.parallelize(1 to 10, 2)
More Actions - fold()
fold(initial value, func) Example: Concatnating to _
Basics of RDD
var myrdd = sc.parallelize(1 to 10, 2)
var myrdd1 = myrdd.map(_.toString)
More Actions - fold()
fold(initial value, func) Example: Concatnating to _
Basics of RDD
var myrdd = sc.parallelize(1 to 10, 2)
var myrdd1 = myrdd.map(_.toString)
def concat(s:String, n:String):String = s + n
More Actions - fold()
fold(initial value, func) Example: Concatnating to _
Basics of RDD
More Actions - fold()
var myrdd = sc.parallelize(1 to 10, 2)
var myrdd1 = myrdd.map(_.toString)
def concat(s:String, n:String):String = s + n
var s = "_"
myrdd1.fold(s)(concat)
res1: String = _ _12345 _678910
fold(initial value, func) Example: Concatnating to _
Basics of RDD
More Actions - aggregate()
aggregate(initial value)
(seqOp, combOp)
1. First, all values of each partitions are merged to
Initial value using SeqOp()
2. Second, all partitions result is combined together
using combOp
3. Used specially when the output is different data type
1, 2, 3 4,5 6,7
SeqOp() SeqOp() SeqOp()
CombOp()
Output
Basics of RDD
More Actions - aggregate()
aggregate(initial value)
(seqOp, combOp)
1. First, all values of each partitions are merged to
Initial value using SeqOp()
2. Second, all partitions result is combined together
using combOp
3. Used specially when the output is different data type
Basics of RDD
More Actions - aggregate()
Basics of RDD
var rdd = sc.parallelize(1 to 100)
More Actions - aggregate()
aggregate(initial value)
(seqOp, combOp)
1. First, all values of each partitions are merged to
Initial value using SeqOp()
2. Second, all partitions result is combined together
using combOp
3. Used specially when the output is different data type
Basics of RDD
var rdd = sc.parallelize(1 to 100)
var init = (0, 0) // sum, count
More Actions - aggregate()
aggregate(initial value)
(seqOp, combOp)
1. First, all values of each partitions are merged to
Initial value using SeqOp()
2. Second, all partitions result is combined together
using combOp
3. Used specially when the output is different data type
Basics of RDD
var rdd = sc.parallelize(1 to 100)
var init = (0, 0) // sum, count
def seq(t:(Int, Int), i:Int): (Int, Int) = (t._1 + i, t._2 + 1)
More Actions - aggregate()
aggregate(initial value)
(seqOp, combOp)
1. First, all values of each partitions are merged to
Initial value using SeqOp()
2. Second, all partitions result is combined together
using combOp
3. Used specially when the output is different data type
Basics of RDD
More Actions - aggregate()
aggregate(initial value)
(seqOp, combOp)
1. First, all values of each partitions are merged to
Initial value using SeqOp()
2. Second, all partitions result is combined together
using combOp
3. Used specially when the output is different data type
var rdd = sc.parallelize(1 to 100)
var init = (0, 0) // sum, count
def seq(t:(Int, Int), i:Int): (Int, Int) = (t._1 + i, t._2 + 1)
def comb(t1:(Int, Int), t2:(Int, Int)): (Int, Int) = (t1._1 + t2._1, t1._2 + t2._2)
var d = rdd.aggregate(init)(seq, comb)
res6: (Int, Int) = (5050,100)
Basics of RDD
More Actions - aggregate()
var rdd = sc.parallelize(1 to 100)
var init = (0, 0) // sum, count
def seq(t:(Int, Int), i:Int): (Int, Int) = (t._1 + i, t._2 + 1)
def comb(t1:(Int, Int), t2:(Int, Int)): (Int, Int) = (t1._1 + t2._1, t1._2 + t2._2)
var d = rdd.aggregate(init)(seq, comb)
aggregate(initial value)
(seqOp, combOp)
1. First, all values of each partitions are merged to
Initial value using SeqOp()
2. Second, all partitions result is combined together
using combOp
3. Used specially when the output is different data type
res6: (Int, Int) = (5050,100)
Basics of RDD
Number of times each element occurs in the RDD.
More Actions: countByValue()
1 2 3 3 5 5 5
var rdd = sc.parallelize(List(1, 2, 3, 3, 5, 5, 5))
var dict = rdd.countByValue()
dict
Map(1 -> 1, 5 -> 3, 2 -> 1, 3 -> 2)
Basics of RDD
Sorts and gets the maximum n values.
More Actions: top(n)
4 4 8 1 2 3 10 9
var a=sc.parallelize(List(4,4,8,1,2, 3, 10, 9))
a.top(6)
Array(10, 9, 8, 4, 4, 3)
Basics of RDD
sc.parallelize(List(10, 1, 2, 9, 3, 4, 5, 6, 7)).takeOrdered(6)
var l = List((10, "SG"), (1, "AS"), (2, "AB"), (9, "AA"), (3, "SS"), (4, "RG"), (5, "AU"), (6, "DD"), (7, "ZZ"))
var r = sc.parallelize(l)
r.takeOrdered(6)(Ordering[Int].reverse.on(x => x._1))
(10,SG), (9,AA), (7,ZZ), (6,DD), (5,AU), (4,RG)
r.takeOrdered(6)(Ordering[String].reverse.on(x => x._2))
(7,ZZ), (3,SS), (10,SG), (4,RG), (6,DD), (5,AU)
r.takeOrdered(6)(Ordering[String].on(x => x._2))
(9,AA), (2,AB), (1,AS), (5,AU), (6,DD), (4,RG)
Get the N elements from an RDD ordered in ascending order or as specified by
the optional key function.
More Actions: takeOrdered()
Basics of RDD
Applies a function to all elements of this RDD.
More Actions: foreach()
>>> def f(x:Int)= println(s"Save $x to DB")
>>> sc.parallelize(1 to 5).foreach(f)
Save 2 to DB
Save 1 to DB
Save 4 to DB
Save 5 to DB
Basics of RDD
Differences from map()
More Actions: foreach()
1. Use foreach if you don't expect any result. For example
saving to database.
2. Foreach is an action. Map is transformation
Basics of RDD
Applies a function to each partition of this RDD.
More Actions: foreachPartition(f)
Basics of RDD
def partitionSum(itr: Iterator[Int]) =
println("The sum of the parition is " + itr.sum.toString)
Applies a function to each partition of this RDD.
More Actions: foreachPartition(f)
Basics of RDD
Applies a function to each partition of this RDD.
More Actions: foreachPartition(f)
def partitionSum(itr: Iterator[Int]) =
println("The sum of the parition is " + itr.sum.toString)
sc.parallelize(1 to 40, 4).foreachPartition(partitionSum)
The sum of the parition is 155
The sum of the parition is 55
The sum of the parition is 355
The sum of the parition is 255
Thank you!
Basics of RDD
Ad

More Related Content

What's hot (20)

Py spark cheat sheet by cheatsheetmaker.com
Py spark cheat sheet by cheatsheetmaker.comPy spark cheat sheet by cheatsheetmaker.com
Py spark cheat sheet by cheatsheetmaker.com
Lam Hoang
 
Scalding for Hadoop
Scalding for HadoopScalding for Hadoop
Scalding for Hadoop
Chicago Hadoop Users Group
 
AJUG April 2011 Cascading example
AJUG April 2011 Cascading exampleAJUG April 2011 Cascading example
AJUG April 2011 Cascading example
Christopher Curtin
 
Using NoSQL databases to store RADIUS and Syslog data
Using NoSQL databases to store RADIUS and Syslog dataUsing NoSQL databases to store RADIUS and Syslog data
Using NoSQL databases to store RADIUS and Syslog data
Karri Huhtanen
 
Scoobi - Scala for Startups
Scoobi - Scala for StartupsScoobi - Scala for Startups
Scoobi - Scala for Startups
bmlever
 
JDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And Profit
JDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And ProfitJDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And Profit
JDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And Profit
PROIDEA
 
Scalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of codeScalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of code
Konrad Malawski
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
Hugo Gävert
 
Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch? Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch?
DataWorks Summit
 
Hack reduce mr-intro
Hack reduce mr-introHack reduce mr-intro
Hack reduce mr-intro
montrealouvert
 
R Programming: Export/Output Data In R
R Programming: Export/Output Data In RR Programming: Export/Output Data In R
R Programming: Export/Output Data In R
Rsquared Academy
 
CR17 - Designing a database like an archaeologist
CR17 - Designing a database like an archaeologistCR17 - Designing a database like an archaeologist
CR17 - Designing a database like an archaeologist
yoavrubin
 
Designing a database like an archaeologist
Designing a database like an archaeologistDesigning a database like an archaeologist
Designing a database like an archaeologist
yoavrubin
 
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
台灣資料科學年會
 
R Programming: Learn To Manipulate Strings In R
R Programming: Learn To Manipulate Strings In RR Programming: Learn To Manipulate Strings In R
R Programming: Learn To Manipulate Strings In R
Rsquared Academy
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Data Con LA
 
Using Arbor/ RGraph JS libaries for Data Visualisation
Using Arbor/ RGraph JS libaries for Data VisualisationUsing Arbor/ RGraph JS libaries for Data Visualisation
Using Arbor/ RGraph JS libaries for Data Visualisation
Alex Hardman
 
Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016
Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016
Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016
Comsysto Reply GmbH
 
Grouping & Summarizing Data in R
Grouping & Summarizing Data in RGrouping & Summarizing Data in R
Grouping & Summarizing Data in R
Jeffrey Breen
 
Ruby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby GroupRuby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby Group
Brian O'Neill
 
Py spark cheat sheet by cheatsheetmaker.com
Py spark cheat sheet by cheatsheetmaker.comPy spark cheat sheet by cheatsheetmaker.com
Py spark cheat sheet by cheatsheetmaker.com
Lam Hoang
 
AJUG April 2011 Cascading example
AJUG April 2011 Cascading exampleAJUG April 2011 Cascading example
AJUG April 2011 Cascading example
Christopher Curtin
 
Using NoSQL databases to store RADIUS and Syslog data
Using NoSQL databases to store RADIUS and Syslog dataUsing NoSQL databases to store RADIUS and Syslog data
Using NoSQL databases to store RADIUS and Syslog data
Karri Huhtanen
 
Scoobi - Scala for Startups
Scoobi - Scala for StartupsScoobi - Scala for Startups
Scoobi - Scala for Startups
bmlever
 
JDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And Profit
JDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And ProfitJDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And Profit
JDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And Profit
PROIDEA
 
Scalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of codeScalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of code
Konrad Malawski
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
Hugo Gävert
 
Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch? Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch?
DataWorks Summit
 
R Programming: Export/Output Data In R
R Programming: Export/Output Data In RR Programming: Export/Output Data In R
R Programming: Export/Output Data In R
Rsquared Academy
 
CR17 - Designing a database like an archaeologist
CR17 - Designing a database like an archaeologistCR17 - Designing a database like an archaeologist
CR17 - Designing a database like an archaeologist
yoavrubin
 
Designing a database like an archaeologist
Designing a database like an archaeologistDesigning a database like an archaeologist
Designing a database like an archaeologist
yoavrubin
 
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
台灣資料科學年會
 
R Programming: Learn To Manipulate Strings In R
R Programming: Learn To Manipulate Strings In RR Programming: Learn To Manipulate Strings In R
R Programming: Learn To Manipulate Strings In R
Rsquared Academy
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Data Con LA
 
Using Arbor/ RGraph JS libaries for Data Visualisation
Using Arbor/ RGraph JS libaries for Data VisualisationUsing Arbor/ RGraph JS libaries for Data Visualisation
Using Arbor/ RGraph JS libaries for Data Visualisation
Alex Hardman
 
Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016
Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016
Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016
Comsysto Reply GmbH
 
Grouping & Summarizing Data in R
Grouping & Summarizing Data in RGrouping & Summarizing Data in R
Grouping & Summarizing Data in R
Jeffrey Breen
 
Ruby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby GroupRuby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby Group
Brian O'Neill
 

Similar to Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutorial | CloudxLab (20)

Apache Spark & Streaming
Apache Spark & StreamingApache Spark & Streaming
Apache Spark & Streaming
Fernando Rodriguez
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Operations on rdd
Operations on rddOperations on rdd
Operations on rdd
sparrowAnalytics.com
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...
Data Con LA
 
Spark workshop
Spark workshopSpark workshop
Spark workshop
Wojciech Pituła
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
Databricks
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
Databricks
 
Distributed computing with spark
Distributed computing with sparkDistributed computing with spark
Distributed computing with spark
Javier Santos Paniego
 
04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence
Venkat Datla
 
Lecture 5: Functional Programming
Lecture 5: Functional ProgrammingLecture 5: Functional Programming
Lecture 5: Functional Programming
Eelco Visser
 
Scala meetup - Intro to spark
Scala meetup - Intro to sparkScala meetup - Intro to spark
Scala meetup - Intro to spark
Javier Arrieta
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Patrick Wendell
 
Frsa
FrsaFrsa
Frsa
_111
 
Functional Programming with Groovy
Functional Programming with GroovyFunctional Programming with Groovy
Functional Programming with Groovy
Arturo Herrero
 
Functional programming with clojure
Functional programming with clojureFunctional programming with clojure
Functional programming with clojure
Lucy Fang
 
Testing batch and streaming Spark applications
Testing batch and streaming Spark applicationsTesting batch and streaming Spark applications
Testing batch and streaming Spark applications
Łukasz Gawron
 
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
Future Processing
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Databricks
 
Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)
Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)
Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)
Craig Chao
 
All I know about rsc.io/c2go
All I know about rsc.io/c2goAll I know about rsc.io/c2go
All I know about rsc.io/c2go
Moriyoshi Koizumi
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...
Data Con LA
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
Databricks
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
Databricks
 
04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence
Venkat Datla
 
Lecture 5: Functional Programming
Lecture 5: Functional ProgrammingLecture 5: Functional Programming
Lecture 5: Functional Programming
Eelco Visser
 
Scala meetup - Intro to spark
Scala meetup - Intro to sparkScala meetup - Intro to spark
Scala meetup - Intro to spark
Javier Arrieta
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Patrick Wendell
 
Frsa
FrsaFrsa
Frsa
_111
 
Functional Programming with Groovy
Functional Programming with GroovyFunctional Programming with Groovy
Functional Programming with Groovy
Arturo Herrero
 
Functional programming with clojure
Functional programming with clojureFunctional programming with clojure
Functional programming with clojure
Lucy Fang
 
Testing batch and streaming Spark applications
Testing batch and streaming Spark applicationsTesting batch and streaming Spark applications
Testing batch and streaming Spark applications
Łukasz Gawron
 
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
[QE 2018] Łukasz Gawron – Testing Batch and Streaming Spark Applications
Future Processing
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Databricks
 
Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)
Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)
Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)
Craig Chao
 
All I know about rsc.io/c2go
All I know about rsc.io/c2goAll I know about rsc.io/c2go
All I know about rsc.io/c2go
Moriyoshi Koizumi
 
Ad

More from CloudxLab (20)

Understanding computer vision with Deep Learning
Understanding computer vision with Deep LearningUnderstanding computer vision with Deep Learning
Understanding computer vision with Deep Learning
CloudxLab
 
Deep Learning Overview
Deep Learning OverviewDeep Learning Overview
Deep Learning Overview
CloudxLab
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural Networks
CloudxLab
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
CloudxLab
 
Naive Bayes
Naive BayesNaive Bayes
Naive Bayes
CloudxLab
 
Autoencoders
AutoencodersAutoencoders
Autoencoders
CloudxLab
 
Training Deep Neural Nets
Training Deep Neural NetsTraining Deep Neural Nets
Training Deep Neural Nets
CloudxLab
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
CloudxLab
 
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
CloudxLab
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
CloudxLab
 
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
CloudxLab
 
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLabIntroduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
CloudxLab
 
Introduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLabIntroduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLab
CloudxLab
 
Dimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLabDimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLab
CloudxLab
 
Ensemble Learning and Random Forests
Ensemble Learning and Random ForestsEnsemble Learning and Random Forests
Ensemble Learning and Random Forests
CloudxLab
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
CloudxLab
 
Understanding computer vision with Deep Learning
Understanding computer vision with Deep LearningUnderstanding computer vision with Deep Learning
Understanding computer vision with Deep Learning
CloudxLab
 
Deep Learning Overview
Deep Learning OverviewDeep Learning Overview
Deep Learning Overview
CloudxLab
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural Networks
CloudxLab
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
CloudxLab
 
Autoencoders
AutoencodersAutoencoders
Autoencoders
CloudxLab
 
Training Deep Neural Nets
Training Deep Neural NetsTraining Deep Neural Nets
Training Deep Neural Nets
CloudxLab
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
CloudxLab
 
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
CloudxLab
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
CloudxLab
 
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
CloudxLab
 
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLabIntroduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
CloudxLab
 
Introduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLabIntroduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLab
CloudxLab
 
Dimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLabDimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLab
CloudxLab
 
Ensemble Learning and Random Forests
Ensemble Learning and Random ForestsEnsemble Learning and Random Forests
Ensemble Learning and Random Forests
CloudxLab
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
CloudxLab
 
Ad

Recently uploaded (20)

Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 

Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutorial | CloudxLab

  • 1. Basics of RDD - More Operations
  • 2. Basics of RDD More Transformations sample(withReplacement, fraction, [seed]) Sample an RDD, with or without replacement.
  • 3. Basics of RDD val seq = sc.parallelize(1 to 100, 5) seq.sample(false, 0.1).collect(); [8, 19, 34, 37, 43, 51, 70, 83] More Transformations sample(withReplacement, fraction, [seed]) Sample an RDD, with or without replacement.
  • 4. Basics of RDD More Transformations sample(withReplacement, fraction, [seed]) Sample an RDD, with or without replacement. val seq = sc.parallelize(1 to 100, 5) seq.sample(false, 0.1).collect(); [8, 19, 34, 37, 43, 51, 70, 83] seq.sample(true, 0.1).collect(); [14, 26, 40, 47, 55, 67, 69, 69] Please note that the result will be different on every run.
  • 5. Basics of RDD Common Transformations (continued..) mapPartitions(f, preservesPartitioning=False) Return a new RDD by applying a function to each partition of this RDD.
  • 6. Basics of RDD val rdd = sc.parallelize(1 to 50, 3) Common Transformations (continued..) mapPartitions(f, preservesPartitioning=False) Return a new RDD by applying a function to each partition of this RDD.
  • 7. Basics of RDD val rdd = sc.parallelize(1 to 50, 3) def f(l:Iterator[Int]):Iterator[Int] = { var sum = 0 while(l.hasNext){ sum = sum + l.next } return List(sum).iterator } Common Transformations (continued..) mapPartitions(f, preservesPartitioning=False) Return a new RDD by applying a function to each partition of this RDD.
  • 8. Basics of RDD val rdd = sc.parallelize(1 to 50, 3) def f(l:Iterator[Int]):Iterator[Int] = { var sum = 0 while(l.hasNext){ sum = sum + l.next } return List(sum).iterator } rdd.mapPartitions(f).collect() Array(136, 425, 714) 17, 17, 16 Common Transformations (continued..) mapPartitions(f, preservesPartitioning=False) Return a new RDD by applying a function to each partition of this RDD.
  • 9. Basics of RDD Common Transformations (continued..) sortBy(func, ascending=True, numPartitions=None) Sorts this RDD by the given func
  • 10. Basics of RDD func: A function used to compute the sort key for each element. Common Transformations (continued..) sortBy(func, ascending=True, numPartitions=None) Sorts this RDD by the given func
  • 11. Basics of RDD Common Transformations (continued..) sortBy(func, ascending=True, numPartitions=None) Sorts this RDD by the given func func: A function used to compute the sort key for each element. ascending: A flag to indicate whether the sorting is ascending or descending.
  • 12. Basics of RDD Common Transformations (continued..) sortBy(func, ascending=True, numPartitions=None) Sorts this RDD by the given func func: A function used to compute the sort key for each element. ascending: A flag to indicate whether the sorting is ascending or descending. numPartitions: Number of partitions to create.
  • 13. Basics of RDD ⋙ var tmp = List(('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5)) ⋙ var rdd = sc.parallelize(tmp) Common Transformations (continued..) sortBy(keyfunc, ascending=True, numPartitions=None) Sorts this RDD by the given keyfunc
  • 14. Basics of RDD ⋙ var tmp = List(('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5)) ⋙ var rdd = sc.parallelize(tmp) ⋙ rdd.sortBy(x => x._1).collect() [('1', 3), ('2', 5), ('a', 1), ('b', 2), ('d', 4)] Common Transformations (continued..) sortBy(keyfunc, ascending=True, numPartitions=None) Sorts this RDD by the given keyfunc
  • 15. Basics of RDD ⋙ var tmp = List(('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5)) ⋙ var rdd = sc.parallelize(tmp) ⋙ rdd.sortBy(x => x._2).collect() [('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5)] Common Transformations (continued..) sortBy(keyfunc, ascending=True, numPartitions=None) Sorts this RDD by the given keyfunc
  • 16. Basics of RDD Common Transformations (continued..) sortBy(keyfunc, ascending=True, numPartitions=None) Sorts this RDD by the given keyfunc var rdd = sc.parallelize(Array(10, 2, 3,21, 4, 5)) var sortedrdd = rdd.sortBy(x => x) sortedrdd.collect()
  • 17. Basics of RDD Common Transformations (continued..) Pseudo set operations Though RDD is not really a set but still the set operations try to provide you utility set functions
  • 18. Basics of RDD distinct() + Give the set property to your rdd + Expensive as shuffling is required Set operations (Pseudo)
  • 19. Basics of RDD union() + Simply appends one rdd to another + Is not same as mathematical function + It may have duplicates Set operations (Pseudo)
  • 20. Basics of RDD subtract() + Returns values in first RDD and not second + Requires Shuffling like intersection() Set operations (Pseudo)
  • 21. Basics of RDD intersection() + Finds common values in RDDs + Also removes duplicates + Requires shuffling Set operations (Pseudo)
  • 22. Basics of RDD cartesian() + Returns all possible pairs of (a,b) + a is in source RDD and b is in other RDD Set operations (Pseudo)
  • 23. Basics of RDD fold(initial value, func) + Very similar to reduce + Provides a little extra control over the initialisation + Lets us specify an initial value More Actions - fold()
  • 24. Basics of RDD More Actions - fold() fold(initial value, func) Aggregates the elements of each partition and then the results for all the partitions using a given associative and commutative function and a neutral "zero value". 1 7 2 4 7 6 Partition 1 Partition 2
  • 25. Basics of RDD More Actions - fold() fold(initial value, func) Aggregates the elements of each partition and then the results for all the partitions using a given associative and commutative function and a neutral "zero value". 1 7 2 4 7 6 Partition 1 Partition 2 Initial Value Initial Value
  • 26. Basics of RDD More Actions - fold() fold(initial value)(func) Aggregates the elements of each partition and then the results for all the partitions using a given associative and commutative function and a neutral "zero value". Initial Value 1 7 2 4 7 6Initial Value Partition 1 Partition 2 1 1
  • 27. Basics of RDD More Actions - fold() fold(initial value)(func) Aggregates the elements of each partition and then the results for all the partitions using a given associative and commutative function and a neutral "zero value". Initial Value 1 7 2 4 7 6Initial Value Partition 1 Partition 2 1 2 3 1 2 3 Result1 Result2
  • 28. Basics of RDD More Actions - fold() fold(initial value)(func) Aggregates the elements of each partition and then the results for all the partitions using a given associative and commutative function and a neutral "zero value". Result1 Result2Initial Value 4 5 1 7 2 4 7 6 Partition 1 Partition 2 1 2 3 1 2 3
  • 29. Basics of RDD var myrdd = sc.parallelize(1 to 10, 2) More Actions - fold() fold(initial value, func) Example: Concatnating to _
  • 30. Basics of RDD var myrdd = sc.parallelize(1 to 10, 2) var myrdd1 = myrdd.map(_.toString) More Actions - fold() fold(initial value, func) Example: Concatnating to _
  • 31. Basics of RDD var myrdd = sc.parallelize(1 to 10, 2) var myrdd1 = myrdd.map(_.toString) def concat(s:String, n:String):String = s + n More Actions - fold() fold(initial value, func) Example: Concatnating to _
  • 32. Basics of RDD More Actions - fold() var myrdd = sc.parallelize(1 to 10, 2) var myrdd1 = myrdd.map(_.toString) def concat(s:String, n:String):String = s + n var s = "_" myrdd1.fold(s)(concat) res1: String = _ _12345 _678910 fold(initial value, func) Example: Concatnating to _
  • 33. Basics of RDD More Actions - aggregate() aggregate(initial value) (seqOp, combOp) 1. First, all values of each partitions are merged to Initial value using SeqOp() 2. Second, all partitions result is combined together using combOp 3. Used specially when the output is different data type 1, 2, 3 4,5 6,7 SeqOp() SeqOp() SeqOp() CombOp() Output
  • 34. Basics of RDD More Actions - aggregate() aggregate(initial value) (seqOp, combOp) 1. First, all values of each partitions are merged to Initial value using SeqOp() 2. Second, all partitions result is combined together using combOp 3. Used specially when the output is different data type
  • 35. Basics of RDD More Actions - aggregate()
  • 36. Basics of RDD var rdd = sc.parallelize(1 to 100) More Actions - aggregate() aggregate(initial value) (seqOp, combOp) 1. First, all values of each partitions are merged to Initial value using SeqOp() 2. Second, all partitions result is combined together using combOp 3. Used specially when the output is different data type
  • 37. Basics of RDD var rdd = sc.parallelize(1 to 100) var init = (0, 0) // sum, count More Actions - aggregate() aggregate(initial value) (seqOp, combOp) 1. First, all values of each partitions are merged to Initial value using SeqOp() 2. Second, all partitions result is combined together using combOp 3. Used specially when the output is different data type
  • 38. Basics of RDD var rdd = sc.parallelize(1 to 100) var init = (0, 0) // sum, count def seq(t:(Int, Int), i:Int): (Int, Int) = (t._1 + i, t._2 + 1) More Actions - aggregate() aggregate(initial value) (seqOp, combOp) 1. First, all values of each partitions are merged to Initial value using SeqOp() 2. Second, all partitions result is combined together using combOp 3. Used specially when the output is different data type
  • 39. Basics of RDD More Actions - aggregate() aggregate(initial value) (seqOp, combOp) 1. First, all values of each partitions are merged to Initial value using SeqOp() 2. Second, all partitions result is combined together using combOp 3. Used specially when the output is different data type var rdd = sc.parallelize(1 to 100) var init = (0, 0) // sum, count def seq(t:(Int, Int), i:Int): (Int, Int) = (t._1 + i, t._2 + 1) def comb(t1:(Int, Int), t2:(Int, Int)): (Int, Int) = (t1._1 + t2._1, t1._2 + t2._2) var d = rdd.aggregate(init)(seq, comb) res6: (Int, Int) = (5050,100)
  • 40. Basics of RDD More Actions - aggregate() var rdd = sc.parallelize(1 to 100) var init = (0, 0) // sum, count def seq(t:(Int, Int), i:Int): (Int, Int) = (t._1 + i, t._2 + 1) def comb(t1:(Int, Int), t2:(Int, Int)): (Int, Int) = (t1._1 + t2._1, t1._2 + t2._2) var d = rdd.aggregate(init)(seq, comb) aggregate(initial value) (seqOp, combOp) 1. First, all values of each partitions are merged to Initial value using SeqOp() 2. Second, all partitions result is combined together using combOp 3. Used specially when the output is different data type res6: (Int, Int) = (5050,100)
  • 41. Basics of RDD Number of times each element occurs in the RDD. More Actions: countByValue() 1 2 3 3 5 5 5 var rdd = sc.parallelize(List(1, 2, 3, 3, 5, 5, 5)) var dict = rdd.countByValue() dict Map(1 -> 1, 5 -> 3, 2 -> 1, 3 -> 2)
  • 42. Basics of RDD Sorts and gets the maximum n values. More Actions: top(n) 4 4 8 1 2 3 10 9 var a=sc.parallelize(List(4,4,8,1,2, 3, 10, 9)) a.top(6) Array(10, 9, 8, 4, 4, 3)
  • 43. Basics of RDD sc.parallelize(List(10, 1, 2, 9, 3, 4, 5, 6, 7)).takeOrdered(6) var l = List((10, "SG"), (1, "AS"), (2, "AB"), (9, "AA"), (3, "SS"), (4, "RG"), (5, "AU"), (6, "DD"), (7, "ZZ")) var r = sc.parallelize(l) r.takeOrdered(6)(Ordering[Int].reverse.on(x => x._1)) (10,SG), (9,AA), (7,ZZ), (6,DD), (5,AU), (4,RG) r.takeOrdered(6)(Ordering[String].reverse.on(x => x._2)) (7,ZZ), (3,SS), (10,SG), (4,RG), (6,DD), (5,AU) r.takeOrdered(6)(Ordering[String].on(x => x._2)) (9,AA), (2,AB), (1,AS), (5,AU), (6,DD), (4,RG) Get the N elements from an RDD ordered in ascending order or as specified by the optional key function. More Actions: takeOrdered()
  • 44. Basics of RDD Applies a function to all elements of this RDD. More Actions: foreach() >>> def f(x:Int)= println(s"Save $x to DB") >>> sc.parallelize(1 to 5).foreach(f) Save 2 to DB Save 1 to DB Save 4 to DB Save 5 to DB
  • 45. Basics of RDD Differences from map() More Actions: foreach() 1. Use foreach if you don't expect any result. For example saving to database. 2. Foreach is an action. Map is transformation
  • 46. Basics of RDD Applies a function to each partition of this RDD. More Actions: foreachPartition(f)
  • 47. Basics of RDD def partitionSum(itr: Iterator[Int]) = println("The sum of the parition is " + itr.sum.toString) Applies a function to each partition of this RDD. More Actions: foreachPartition(f)
  • 48. Basics of RDD Applies a function to each partition of this RDD. More Actions: foreachPartition(f) def partitionSum(itr: Iterator[Int]) = println("The sum of the parition is " + itr.sum.toString) sc.parallelize(1 to 40, 4).foreachPartition(partitionSum) The sum of the parition is 155 The sum of the parition is 55 The sum of the parition is 355 The sum of the parition is 255