SlideShare a Scribd company logo
Key-Value RDD
Key-Value RDD
Transformations on Pair RDDs
keys()
Returns an RDD with the keys of each tuple.
>>> var m = sc.parallelize(List((1, 2), (3, 4))).keys
>>> m.collect()
Array[Int] = Array(1, 3)
Key-Value RDD
Transformations on Pair RDDs
values()
>>> var m = sc.parallelize(List((1, 2), (3, 4))).values
>>> m.collect()
Array(2, 4)
Return an RDD with the values of each tuple.
Key-Value RDD
var rdd = sc.parallelize(List((1, 2), (3, 4), (3, 6)));
var rdd1 = rdd.groupByKey()
var vals = rdd1.collect()
for( i <- vals){
for (k <- i.productIterator) {
println("t" + k);
}
}
Transformations on Pair RDDs
groupByKey()
Group values with the same key.
Key-Value RDD
Questions - Set Operations
What will be the result of the following?
var rdd = sc.parallelize(Array(("a", 1), ("b", 1), ("a", 1)));
rdd.groupByKey().mapValues(_.size).collect()
Key-Value RDD
var myrdd = sc.parallelize(List(1,2,3,4,5)).map(("x", _))
def cc(x:Int):String = x.toString
def mv(x:String, y:Int):String = {x + "," + y}
def mc(x:String, y:String):String = {x + ", " + y}
myrdd1.combineByKey(cc, mv, mc).collect()
Array((x,1, 2, 3, 4,5))
Transformations on Pair RDDs
combineByKey(createCombiner, mergeValue, mergeCombiners,
numPartitions=None)
Combine values with the same key using a different result type.
Turns RDD[(K, V)] into a result of type RDD[(K, C)]
createCombiner, which turns a V into a C (e.g., creates a one-element list)
mergeValue, to merge a V into a C (e.g., adds it to the end of a list)
mergeCombiners, to combine two C’s into a single one.
Key-Value RDD
var myrdd = sc.parallelize(List(1,2,3), 2).map(("x", _))
Example: combineByKey
1 2 3 "1, 2, 3"
Key-Value RDD
var myrdd = sc.parallelize(List(1,2,3), 2).map(("x", _))
1 2 3
Example: combineByKey
Key-Value RDD
var myrdd = sc.parallelize(List(1,2,3), 2).map(("x", _))
def cc(x:Int):String = x.toString
1 2 3
"1"
cc
Example: combineByKey
Key-Value RDD
var myrdd = sc.parallelize(List(1,2,3), 2).map(("x", _))
def cc(x:Int):String = x.toString
1 2 3
"1"
cc
"3"
cc
Example: combineByKey
Key-Value RDD
var myrdd = sc.parallelize(List(1,2,3), 2).map(("x", _))
def cc(x:Int):String = x.toString
def mv(x:String, y:Int):String = {x + "," + y.toString}
1 2 3
"1"
cc
"3"
cc
mv
Example: combineByKey
"1,2"
Key-Value RDD
var myrdd = sc.parallelize(List(1,2,3), 2).map(("x", _))
def cc(x:Int):String = x.toString
def mv(x:String, y:Int):String = {x + "," + y.toString}
def mc(x:String, y:String):String = {x + ", " + y}
1 2 3
"1"
cc
"3"
cc
mv "1,2"
mc
"1,2,3"
Example: combineByKey
Key-Value RDD
var myrdd = sc.parallelize(List(1,2,3), 2).map(("x", _))
def cc(x:Int):String = x.toString
def mv(x:String, y:Int):String = {x + ":" + y.toString}
def mc(x:String, y:String):String = {x + "," + y}
myrdd.combineByKey(cc, mv, mc)
1 2 3
"1"
cc
"3"
cc
mv "1,2"
mc
"1,2,3"
Example: combineByKey
Key-Value RDD
var myrdd = sc.parallelize(List(1,2,3), 2).map(("x", _))
def cc(x:Int):String = x.toString
def mv(x:String, y:Int):String = {x + "," + y.toString}
def mc(x:String, y:String):String = {x + ", " + y}
myrdd.combineByKey(cc, mv, mc).collect()(0)._2
String = 1, 2, 3, 4,5
1 2 3
"1"
cc
"3"
cc
mv "1,2"
mc
"1,2,3"
Example: combineByKey
Key-Value RDD
Questions - Set Operations
What will be the result of the following?
def cc (v): return ("[" , v , "]");
def mv (c, v): return c[0:-1] + (v, "]")
def mc(c1,c2): return c1[0:-1] + c2[1:]
mc(mv(cc(1), 2), cc(3))
Key-Value RDD
Questions - Set Operations
('[', 1, 2, 3, ']')
What will be the result of the following?
def cc (v): return ("[" , v , "]");
def mv (c, v): return c[0:-1] + (v, "]")
def mc(c1,c2): return c1[0:-1] + c2[1:]
mc(mv(cc(1), 2), cc(3))
Key-Value RDD
Questions - Set Operations
What will be the result of the following?
def cc (v): return ("[" , v , "]");
def mv (c, v): return c[0:-1] + (v, "]")
def mc(c1, c2): return c1[0:-1] + c2[1:]
rdd = sc.parallelize([("a", 1), ("b", 2), ("a", 3)])
rdd.combineByKey(cc,mv, mc).collect()
Key-Value RDD
Questions - Set Operations
[('a', ('[', 1, 3, ']')), ('b', ('[', 2, ']'))]
What will be the result of the following?
def cc (v): return ("[" , v , "]");
def mv (c, v): return c[0:-1] + (v, "]")
def mc(c1, c2): return c1[0:-1] + c2[1:]
rdd = sc.parallelize([("a", 1), ("b", 2), ("a", 3)])
rdd.combineByKey(cc,mv, mc).collect()
Key-Value RDD
Transformations on Pair RDDs
sortByKey(ascending=true, numPartitions=current partitions)
Sorts this RDD, which is assumed to consist of (key, value) pairs.
Key-Value RDD
>>> var tmp = List(('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5))
>>> sc.parallelize(tmp).sortByKey().collect()
Array((1,3), (2,5), (a,1), (b,2), (d,4))
Transformations on Pair RDDs
sortByKey(ascending=true, numPartitions=current partitions)
Sorts this RDD, which is assumed to consist of (key, value) pairs.
Key-Value RDD
>>> var tmp = List(('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5))
>>> sc.parallelize(tmp).sortByKey(true, 1).collect()
Array((1,3), (2,5), (a,1), (b,2), (d,4))
Transformations on Pair RDDs
sortByKey(ascending=true, numPartitions=current partitions)
Sorts this RDD, which is assumed to consist of (key, value) pairs.
Key-Value RDD
>>> var tmp = List(('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5))
>>> sc.parallelize(tmp).sortByKey(ascending=false,
numPartitions=2).collect()
Array((d,4), (b,2), (a,1), (2,5), (1,3))
Transformations on Pair RDDs
sortByKey(ascending=true, numPartitions=current partitions)
Sorts this RDD, which is assumed to consist of (key, value) pairs.
Key-Value RDD
Transformations on Pair RDDs
subtractByKey(other, numPartitions=None)
Return each (key, value) pair in self that has no pair with matching key in other.
>>> var x = sc.parallelize(List(("a", 1), ("b", 4), ("b", 5), ("a", 2)))
>>> var y = sc.parallelize(List(("a", 3), ("c", None)))
>>> x.subtractByKey(y).collect()
[('b', 4), ('b', 5)]
Key-Value RDD
Transformations on Pair RDDs
join(other, numPartitions=None)
Return an RDD containing all pairs of elements with matching keys in self and
other.
Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in
self and (k, v2) is in other.
Key-Value RDD
Transformations on Pair RDDs
join(other, numPartitions=None)
Return an RDD containing all pairs of elements with matching keys in self and
other.
Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in
self and (k, v2) is in other.
>>> var x = sc.parallelize(List(("a", 1), ("b", 4), ("c", 5)))
>>> var y = sc.parallelize(List(("a", 2), ("a", 3), ("d", 7)))
>>> x.join(y).collect()
Array((a,(1,2)), (a,(1,3)))
Key-Value RDD
Transformations on Pair RDDs
leftOuterJoin(other, numPartitions=None)
Perform a left outer join of self and other.
For each element (k, v) in self, the resulting RDD will either contain all pairs (k, (v,
w)) for w in other, or the pair (k, (v, None)) if no elements in other have key k.
Hash-partitions the resulting RDD into the given number of partitions.
>>> var x = sc.parallelize(List(("a", 1), ("b", 4)))
>>> var y = sc.parallelize(List(("a", 2)))
>>> x.leftOuterJoin(y).collect()
Array((a,(1,Some(2))), (b,(4,None)))
Key-Value RDD
x = sc.parallelize(
[(1, "sandeep"), ("2", "sravani")])
y = sc.parallelize(
[(1, "ryan"), (3, "giri")])
x.leftOuterJoin(y).collect()
Questions - Set Operations
What will be the result of the following?
LEFT OUTER JOIN
Key-Value RDD
Questions - Set Operations
[(1, ('sandeep', 'ryan')), ('2', ('sravani', None))]
What will be the result of the following?
x = sc.parallelize(
[(1, "sandeep"), ("2", "sravani")])
y = sc.parallelize(
[(1, "ryan"), (3, "giri")])
x.leftOuterJoin(y).collect()
LEFT OUTER JOIN
Key-Value RDD
Transformations on Pair RDDs
rightOuterJoin(other, numPartitions=None)
Perform a right outer join of self and other.
For each element (k, w) in other, the resulting RDD will either contain all pairs
(k, (v, w)) for v in this, or the pair (k, (None, w)) if no elements in self have key
k.
Hash-partitions the resulting RDD into the given number of partitions.
>>> x = sc.parallelize([("a", 1), ("b", 4)])
>>> y = sc.parallelize([("a", 2)])
>>> y.rightOuterJoin(x).collect()
[('a', (2, 1)), ('b', (None, 4))]
Key-Value RDD
x = sc.parallelize(
[(1, "sandeep"), ("2", "sravani")])
y = sc.parallelize(
[(1, "ryan"), (3, "giri")])
x.rightOuterJoin(y).collect()
Questions - Set Operations
What will be the result of the following?
RIGHT OUTER JOIN
Key-Value RDD
Questions - Set Operations
[(1, ('sandeep', 'ryan')), (3, (None, 'giri'))]
What will be the result of the following?
x = sc.parallelize(
[(1, "sandeep"), ("2", "sravani")])
y = sc.parallelize(
[(1, "ryan"), (3, "giri")])
x.rightOuterJoin(y).collect()
RIGHT OUTER JOIN
3
Key-Value RDD
Transformations on Pair RDDs
>>> var x = sc.parallelize(List(("a", 1), ("b", 4)))
>>> var y = sc.parallelize(List(("a", 2), ("a", 3)))
>>> var cg = x.cogroup(y)
>>> cgl = cg.collect()
Array((a,(CompactBuffer(1),CompactBuffer(2, 3))),
(b,(CompactBuffer(4),CompactBuffer())))
This is basically same as:
((a, ([1], [2,3])), (b, ([4], []))))
cogroup(other, numPartitions=None)
For each key k in self or other, return a resulting RDD that contains a tuple with the
list of values for that key in self as well as other.
Key-Value RDD
Actions Available on Pair RDDs
countByKey()
Count the number of elements for each key, and return the result to the master as a
dictionary.
>>> var rdd = sc.parallelize(List(("a", 1), ("b", 1), ("a", 1), ('a', 10)))
>>> rdd.countByKey()
Map(a -> 2, a -> 1, b -> 1)
Key-Value RDD
Actions Available on Pair RDDs
lookup(key)
Return the list of values in the RDD for key. This operation is done efficiently if the
RDD has a known partitioner by only searching the partition that the key maps to.
var lr = sc.parallelize(1 to 1000).map(x => (x, x) )
lr.lookup(42)
Job 24 finished: lookup at <console>:28, took 0.037469 s
WrappedArray(42)
var sorted = lr.sortByKey()
sorted.lookup(42) # fast
Job 21 finished: lookup at <console>:28, took 0.008917 s
ArrayBuffer(42)
Thank you!
Basics of RDD
Ad

More Related Content

What's hot (20)

Dive into Catalyst
Dive into CatalystDive into Catalyst
Dive into Catalyst
Cheng Lian
 
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using ScaldingWriting Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using Scalding
Toni Cebrián
 
Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch? Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch?
DataWorks Summit
 
Scalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of codeScalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of code
Konrad Malawski
 
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
台灣資料科學年會
 
Scalding for Hadoop
Scalding for HadoopScalding for Hadoop
Scalding for Hadoop
Chicago Hadoop Users Group
 
R Programming: Export/Output Data In R
R Programming: Export/Output Data In RR Programming: Export/Output Data In R
R Programming: Export/Output Data In R
Rsquared Academy
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
Hugo Gävert
 
JDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And Profit
JDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And ProfitJDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And Profit
JDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And Profit
PROIDEA
 
Scoobi - Scala for Startups
Scoobi - Scala for StartupsScoobi - Scala for Startups
Scoobi - Scala for Startups
bmlever
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Data Con LA
 
Spark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with SparkSpark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with Spark
samthemonad
 
Modern technologies in data science
Modern technologies in data science Modern technologies in data science
Modern technologies in data science
Chucheng Hsieh
 
R Programming: Learn To Manipulate Strings In R
R Programming: Learn To Manipulate Strings In RR Programming: Learn To Manipulate Strings In R
R Programming: Learn To Manipulate Strings In R
Rsquared Academy
 
Pragmatic Real-World Scala (short version)
Pragmatic Real-World Scala (short version)Pragmatic Real-World Scala (short version)
Pragmatic Real-World Scala (short version)
Jonas Bonér
 
D3.js workshop
D3.js workshopD3.js workshop
D3.js workshop
Anton Katunin
 
The Ring programming language version 1.5.2 book - Part 11 of 181
The Ring programming language version 1.5.2 book - Part 11 of 181The Ring programming language version 1.5.2 book - Part 11 of 181
The Ring programming language version 1.5.2 book - Part 11 of 181
Mahmoud Samir Fayed
 
Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016
Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016
Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016
Comsysto Reply GmbH
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
Databricks
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
Holden Karau
 
Dive into Catalyst
Dive into CatalystDive into Catalyst
Dive into Catalyst
Cheng Lian
 
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using ScaldingWriting Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using Scalding
Toni Cebrián
 
Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch? Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch?
DataWorks Summit
 
Scalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of codeScalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of code
Konrad Malawski
 
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
台灣資料科學年會
 
R Programming: Export/Output Data In R
R Programming: Export/Output Data In RR Programming: Export/Output Data In R
R Programming: Export/Output Data In R
Rsquared Academy
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
Hugo Gävert
 
JDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And Profit
JDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And ProfitJDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And Profit
JDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And Profit
PROIDEA
 
Scoobi - Scala for Startups
Scoobi - Scala for StartupsScoobi - Scala for Startups
Scoobi - Scala for Startups
bmlever
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Data Con LA
 
Spark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with SparkSpark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with Spark
samthemonad
 
Modern technologies in data science
Modern technologies in data science Modern technologies in data science
Modern technologies in data science
Chucheng Hsieh
 
R Programming: Learn To Manipulate Strings In R
R Programming: Learn To Manipulate Strings In RR Programming: Learn To Manipulate Strings In R
R Programming: Learn To Manipulate Strings In R
Rsquared Academy
 
Pragmatic Real-World Scala (short version)
Pragmatic Real-World Scala (short version)Pragmatic Real-World Scala (short version)
Pragmatic Real-World Scala (short version)
Jonas Bonér
 
The Ring programming language version 1.5.2 book - Part 11 of 181
The Ring programming language version 1.5.2 book - Part 11 of 181The Ring programming language version 1.5.2 book - Part 11 of 181
The Ring programming language version 1.5.2 book - Part 11 of 181
Mahmoud Samir Fayed
 
Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016
Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016
Spark RDD-DF-SQL-DS-Spark Hadoop User Group Munich Meetup 2016
Comsysto Reply GmbH
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
Databricks
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
Holden Karau
 

Similar to Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutorial | CloudxLab (20)

Super Advanced Python –act1
Super Advanced Python –act1Super Advanced Python –act1
Super Advanced Python –act1
Ke Wei Louis
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
Databricks
 
Operations on rdd
Operations on rddOperations on rdd
Operations on rdd
sparrowAnalytics.com
 
18. Java associative arrays
18. Java associative arrays18. Java associative arrays
18. Java associative arrays
Intro C# Book
 
Erlang for data ops
Erlang for data opsErlang for data ops
Erlang for data ops
mnacos
 
Microsoft Word Practice Exercise Set 2
Microsoft Word   Practice Exercise Set 2Microsoft Word   Practice Exercise Set 2
Microsoft Word Practice Exercise Set 2
rampan
 
CLUSTERGRAM
CLUSTERGRAMCLUSTERGRAM
CLUSTERGRAM
Dr. Volkan OBAN
 
Frsa
FrsaFrsa
Frsa
_111
 
Java Foundations: Maps, Lambda and Stream API
Java Foundations: Maps, Lambda and Stream APIJava Foundations: Maps, Lambda and Stream API
Java Foundations: Maps, Lambda and Stream API
Svetlin Nakov
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
Duyhai Doan
 
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov Vyacheslav
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov VyacheslavSeminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov Vyacheslav
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov Vyacheslav
Vyacheslav Arbuzov
 
Monadologie
MonadologieMonadologie
Monadologie
league
 
Spark workshop
Spark workshopSpark workshop
Spark workshop
Wojciech Pituła
 
R Language Introduction
R Language IntroductionR Language Introduction
R Language Introduction
Khaled Al-Shamaa
 
A Shiny Example-- R
A Shiny Example-- RA Shiny Example-- R
A Shiny Example-- R
Dr. Volkan OBAN
 
R for you
R for youR for you
R for you
Andreas Chandra
 
Will it Blend? - ScalaSyd February 2015
Will it Blend? - ScalaSyd February 2015Will it Blend? - ScalaSyd February 2015
Will it Blend? - ScalaSyd February 2015
Filippo Vitale
 
Grokking Monads in Scala
Grokking Monads in ScalaGrokking Monads in Scala
Grokking Monads in Scala
Tim Dalton
 
Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)
Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)
Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)
MongoSF
 
Introduction to R programming
Introduction to R programmingIntroduction to R programming
Introduction to R programming
Alberto Labarga
 
Super Advanced Python –act1
Super Advanced Python –act1Super Advanced Python –act1
Super Advanced Python –act1
Ke Wei Louis
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
Databricks
 
18. Java associative arrays
18. Java associative arrays18. Java associative arrays
18. Java associative arrays
Intro C# Book
 
Erlang for data ops
Erlang for data opsErlang for data ops
Erlang for data ops
mnacos
 
Microsoft Word Practice Exercise Set 2
Microsoft Word   Practice Exercise Set 2Microsoft Word   Practice Exercise Set 2
Microsoft Word Practice Exercise Set 2
rampan
 
Frsa
FrsaFrsa
Frsa
_111
 
Java Foundations: Maps, Lambda and Stream API
Java Foundations: Maps, Lambda and Stream APIJava Foundations: Maps, Lambda and Stream API
Java Foundations: Maps, Lambda and Stream API
Svetlin Nakov
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
Duyhai Doan
 
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov Vyacheslav
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov VyacheslavSeminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov Vyacheslav
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov Vyacheslav
Vyacheslav Arbuzov
 
Monadologie
MonadologieMonadologie
Monadologie
league
 
Will it Blend? - ScalaSyd February 2015
Will it Blend? - ScalaSyd February 2015Will it Blend? - ScalaSyd February 2015
Will it Blend? - ScalaSyd February 2015
Filippo Vitale
 
Grokking Monads in Scala
Grokking Monads in ScalaGrokking Monads in Scala
Grokking Monads in Scala
Tim Dalton
 
Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)
Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)
Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)
MongoSF
 
Introduction to R programming
Introduction to R programmingIntroduction to R programming
Introduction to R programming
Alberto Labarga
 
Ad

More from CloudxLab (20)

Understanding computer vision with Deep Learning
Understanding computer vision with Deep LearningUnderstanding computer vision with Deep Learning
Understanding computer vision with Deep Learning
CloudxLab
 
Deep Learning Overview
Deep Learning OverviewDeep Learning Overview
Deep Learning Overview
CloudxLab
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural Networks
CloudxLab
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
CloudxLab
 
Naive Bayes
Naive BayesNaive Bayes
Naive Bayes
CloudxLab
 
Autoencoders
AutoencodersAutoencoders
Autoencoders
CloudxLab
 
Training Deep Neural Nets
Training Deep Neural NetsTraining Deep Neural Nets
Training Deep Neural Nets
CloudxLab
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
CloudxLab
 
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
CloudxLab
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
CloudxLab
 
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
CloudxLab
 
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLabIntroduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
CloudxLab
 
Introduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLabIntroduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLab
CloudxLab
 
Dimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLabDimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLab
CloudxLab
 
Ensemble Learning and Random Forests
Ensemble Learning and Random ForestsEnsemble Learning and Random Forests
Ensemble Learning and Random Forests
CloudxLab
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
CloudxLab
 
Understanding computer vision with Deep Learning
Understanding computer vision with Deep LearningUnderstanding computer vision with Deep Learning
Understanding computer vision with Deep Learning
CloudxLab
 
Deep Learning Overview
Deep Learning OverviewDeep Learning Overview
Deep Learning Overview
CloudxLab
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural Networks
CloudxLab
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
CloudxLab
 
Autoencoders
AutoencodersAutoencoders
Autoencoders
CloudxLab
 
Training Deep Neural Nets
Training Deep Neural NetsTraining Deep Neural Nets
Training Deep Neural Nets
CloudxLab
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
CloudxLab
 
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
CloudxLab
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
CloudxLab
 
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
CloudxLab
 
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLabIntroduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
CloudxLab
 
Introduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLabIntroduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLab
CloudxLab
 
Dimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLabDimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLab
CloudxLab
 
Ensemble Learning and Random Forests
Ensemble Learning and Random ForestsEnsemble Learning and Random Forests
Ensemble Learning and Random Forests
CloudxLab
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
CloudxLab
 
Ad

Recently uploaded (20)

Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
Vaibhav Gupta BAML: AI work flows without Hallucinations
Vaibhav Gupta BAML: AI work flows without HallucinationsVaibhav Gupta BAML: AI work flows without Hallucinations
Vaibhav Gupta BAML: AI work flows without Hallucinations
john409870
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
Mastering Advance Window Functions in SQL.pdf
Mastering Advance Window Functions in SQL.pdfMastering Advance Window Functions in SQL.pdf
Mastering Advance Window Functions in SQL.pdf
Spiral Mantra
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
Vaibhav Gupta BAML: AI work flows without Hallucinations
Vaibhav Gupta BAML: AI work flows without HallucinationsVaibhav Gupta BAML: AI work flows without Hallucinations
Vaibhav Gupta BAML: AI work flows without Hallucinations
john409870
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
Mastering Advance Window Functions in SQL.pdf
Mastering Advance Window Functions in SQL.pdfMastering Advance Window Functions in SQL.pdf
Mastering Advance Window Functions in SQL.pdf
Spiral Mantra
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 

Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutorial | CloudxLab

  • 2. Key-Value RDD Transformations on Pair RDDs keys() Returns an RDD with the keys of each tuple. >>> var m = sc.parallelize(List((1, 2), (3, 4))).keys >>> m.collect() Array[Int] = Array(1, 3)
  • 3. Key-Value RDD Transformations on Pair RDDs values() >>> var m = sc.parallelize(List((1, 2), (3, 4))).values >>> m.collect() Array(2, 4) Return an RDD with the values of each tuple.
  • 4. Key-Value RDD var rdd = sc.parallelize(List((1, 2), (3, 4), (3, 6))); var rdd1 = rdd.groupByKey() var vals = rdd1.collect() for( i <- vals){ for (k <- i.productIterator) { println("t" + k); } } Transformations on Pair RDDs groupByKey() Group values with the same key.
  • 5. Key-Value RDD Questions - Set Operations What will be the result of the following? var rdd = sc.parallelize(Array(("a", 1), ("b", 1), ("a", 1))); rdd.groupByKey().mapValues(_.size).collect()
  • 6. Key-Value RDD var myrdd = sc.parallelize(List(1,2,3,4,5)).map(("x", _)) def cc(x:Int):String = x.toString def mv(x:String, y:Int):String = {x + "," + y} def mc(x:String, y:String):String = {x + ", " + y} myrdd1.combineByKey(cc, mv, mc).collect() Array((x,1, 2, 3, 4,5)) Transformations on Pair RDDs combineByKey(createCombiner, mergeValue, mergeCombiners, numPartitions=None) Combine values with the same key using a different result type. Turns RDD[(K, V)] into a result of type RDD[(K, C)] createCombiner, which turns a V into a C (e.g., creates a one-element list) mergeValue, to merge a V into a C (e.g., adds it to the end of a list) mergeCombiners, to combine two C’s into a single one.
  • 7. Key-Value RDD var myrdd = sc.parallelize(List(1,2,3), 2).map(("x", _)) Example: combineByKey 1 2 3 "1, 2, 3"
  • 8. Key-Value RDD var myrdd = sc.parallelize(List(1,2,3), 2).map(("x", _)) 1 2 3 Example: combineByKey
  • 9. Key-Value RDD var myrdd = sc.parallelize(List(1,2,3), 2).map(("x", _)) def cc(x:Int):String = x.toString 1 2 3 "1" cc Example: combineByKey
  • 10. Key-Value RDD var myrdd = sc.parallelize(List(1,2,3), 2).map(("x", _)) def cc(x:Int):String = x.toString 1 2 3 "1" cc "3" cc Example: combineByKey
  • 11. Key-Value RDD var myrdd = sc.parallelize(List(1,2,3), 2).map(("x", _)) def cc(x:Int):String = x.toString def mv(x:String, y:Int):String = {x + "," + y.toString} 1 2 3 "1" cc "3" cc mv Example: combineByKey "1,2"
  • 12. Key-Value RDD var myrdd = sc.parallelize(List(1,2,3), 2).map(("x", _)) def cc(x:Int):String = x.toString def mv(x:String, y:Int):String = {x + "," + y.toString} def mc(x:String, y:String):String = {x + ", " + y} 1 2 3 "1" cc "3" cc mv "1,2" mc "1,2,3" Example: combineByKey
  • 13. Key-Value RDD var myrdd = sc.parallelize(List(1,2,3), 2).map(("x", _)) def cc(x:Int):String = x.toString def mv(x:String, y:Int):String = {x + ":" + y.toString} def mc(x:String, y:String):String = {x + "," + y} myrdd.combineByKey(cc, mv, mc) 1 2 3 "1" cc "3" cc mv "1,2" mc "1,2,3" Example: combineByKey
  • 14. Key-Value RDD var myrdd = sc.parallelize(List(1,2,3), 2).map(("x", _)) def cc(x:Int):String = x.toString def mv(x:String, y:Int):String = {x + "," + y.toString} def mc(x:String, y:String):String = {x + ", " + y} myrdd.combineByKey(cc, mv, mc).collect()(0)._2 String = 1, 2, 3, 4,5 1 2 3 "1" cc "3" cc mv "1,2" mc "1,2,3" Example: combineByKey
  • 15. Key-Value RDD Questions - Set Operations What will be the result of the following? def cc (v): return ("[" , v , "]"); def mv (c, v): return c[0:-1] + (v, "]") def mc(c1,c2): return c1[0:-1] + c2[1:] mc(mv(cc(1), 2), cc(3))
  • 16. Key-Value RDD Questions - Set Operations ('[', 1, 2, 3, ']') What will be the result of the following? def cc (v): return ("[" , v , "]"); def mv (c, v): return c[0:-1] + (v, "]") def mc(c1,c2): return c1[0:-1] + c2[1:] mc(mv(cc(1), 2), cc(3))
  • 17. Key-Value RDD Questions - Set Operations What will be the result of the following? def cc (v): return ("[" , v , "]"); def mv (c, v): return c[0:-1] + (v, "]") def mc(c1, c2): return c1[0:-1] + c2[1:] rdd = sc.parallelize([("a", 1), ("b", 2), ("a", 3)]) rdd.combineByKey(cc,mv, mc).collect()
  • 18. Key-Value RDD Questions - Set Operations [('a', ('[', 1, 3, ']')), ('b', ('[', 2, ']'))] What will be the result of the following? def cc (v): return ("[" , v , "]"); def mv (c, v): return c[0:-1] + (v, "]") def mc(c1, c2): return c1[0:-1] + c2[1:] rdd = sc.parallelize([("a", 1), ("b", 2), ("a", 3)]) rdd.combineByKey(cc,mv, mc).collect()
  • 19. Key-Value RDD Transformations on Pair RDDs sortByKey(ascending=true, numPartitions=current partitions) Sorts this RDD, which is assumed to consist of (key, value) pairs.
  • 20. Key-Value RDD >>> var tmp = List(('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5)) >>> sc.parallelize(tmp).sortByKey().collect() Array((1,3), (2,5), (a,1), (b,2), (d,4)) Transformations on Pair RDDs sortByKey(ascending=true, numPartitions=current partitions) Sorts this RDD, which is assumed to consist of (key, value) pairs.
  • 21. Key-Value RDD >>> var tmp = List(('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5)) >>> sc.parallelize(tmp).sortByKey(true, 1).collect() Array((1,3), (2,5), (a,1), (b,2), (d,4)) Transformations on Pair RDDs sortByKey(ascending=true, numPartitions=current partitions) Sorts this RDD, which is assumed to consist of (key, value) pairs.
  • 22. Key-Value RDD >>> var tmp = List(('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5)) >>> sc.parallelize(tmp).sortByKey(ascending=false, numPartitions=2).collect() Array((d,4), (b,2), (a,1), (2,5), (1,3)) Transformations on Pair RDDs sortByKey(ascending=true, numPartitions=current partitions) Sorts this RDD, which is assumed to consist of (key, value) pairs.
  • 23. Key-Value RDD Transformations on Pair RDDs subtractByKey(other, numPartitions=None) Return each (key, value) pair in self that has no pair with matching key in other. >>> var x = sc.parallelize(List(("a", 1), ("b", 4), ("b", 5), ("a", 2))) >>> var y = sc.parallelize(List(("a", 3), ("c", None))) >>> x.subtractByKey(y).collect() [('b', 4), ('b', 5)]
  • 24. Key-Value RDD Transformations on Pair RDDs join(other, numPartitions=None) Return an RDD containing all pairs of elements with matching keys in self and other. Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in self and (k, v2) is in other.
  • 25. Key-Value RDD Transformations on Pair RDDs join(other, numPartitions=None) Return an RDD containing all pairs of elements with matching keys in self and other. Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in self and (k, v2) is in other. >>> var x = sc.parallelize(List(("a", 1), ("b", 4), ("c", 5))) >>> var y = sc.parallelize(List(("a", 2), ("a", 3), ("d", 7))) >>> x.join(y).collect() Array((a,(1,2)), (a,(1,3)))
  • 26. Key-Value RDD Transformations on Pair RDDs leftOuterJoin(other, numPartitions=None) Perform a left outer join of self and other. For each element (k, v) in self, the resulting RDD will either contain all pairs (k, (v, w)) for w in other, or the pair (k, (v, None)) if no elements in other have key k. Hash-partitions the resulting RDD into the given number of partitions. >>> var x = sc.parallelize(List(("a", 1), ("b", 4))) >>> var y = sc.parallelize(List(("a", 2))) >>> x.leftOuterJoin(y).collect() Array((a,(1,Some(2))), (b,(4,None)))
  • 27. Key-Value RDD x = sc.parallelize( [(1, "sandeep"), ("2", "sravani")]) y = sc.parallelize( [(1, "ryan"), (3, "giri")]) x.leftOuterJoin(y).collect() Questions - Set Operations What will be the result of the following? LEFT OUTER JOIN
  • 28. Key-Value RDD Questions - Set Operations [(1, ('sandeep', 'ryan')), ('2', ('sravani', None))] What will be the result of the following? x = sc.parallelize( [(1, "sandeep"), ("2", "sravani")]) y = sc.parallelize( [(1, "ryan"), (3, "giri")]) x.leftOuterJoin(y).collect() LEFT OUTER JOIN
  • 29. Key-Value RDD Transformations on Pair RDDs rightOuterJoin(other, numPartitions=None) Perform a right outer join of self and other. For each element (k, w) in other, the resulting RDD will either contain all pairs (k, (v, w)) for v in this, or the pair (k, (None, w)) if no elements in self have key k. Hash-partitions the resulting RDD into the given number of partitions. >>> x = sc.parallelize([("a", 1), ("b", 4)]) >>> y = sc.parallelize([("a", 2)]) >>> y.rightOuterJoin(x).collect() [('a', (2, 1)), ('b', (None, 4))]
  • 30. Key-Value RDD x = sc.parallelize( [(1, "sandeep"), ("2", "sravani")]) y = sc.parallelize( [(1, "ryan"), (3, "giri")]) x.rightOuterJoin(y).collect() Questions - Set Operations What will be the result of the following? RIGHT OUTER JOIN
  • 31. Key-Value RDD Questions - Set Operations [(1, ('sandeep', 'ryan')), (3, (None, 'giri'))] What will be the result of the following? x = sc.parallelize( [(1, "sandeep"), ("2", "sravani")]) y = sc.parallelize( [(1, "ryan"), (3, "giri")]) x.rightOuterJoin(y).collect() RIGHT OUTER JOIN 3
  • 32. Key-Value RDD Transformations on Pair RDDs >>> var x = sc.parallelize(List(("a", 1), ("b", 4))) >>> var y = sc.parallelize(List(("a", 2), ("a", 3))) >>> var cg = x.cogroup(y) >>> cgl = cg.collect() Array((a,(CompactBuffer(1),CompactBuffer(2, 3))), (b,(CompactBuffer(4),CompactBuffer()))) This is basically same as: ((a, ([1], [2,3])), (b, ([4], [])))) cogroup(other, numPartitions=None) For each key k in self or other, return a resulting RDD that contains a tuple with the list of values for that key in self as well as other.
  • 33. Key-Value RDD Actions Available on Pair RDDs countByKey() Count the number of elements for each key, and return the result to the master as a dictionary. >>> var rdd = sc.parallelize(List(("a", 1), ("b", 1), ("a", 1), ('a', 10))) >>> rdd.countByKey() Map(a -> 2, a -> 1, b -> 1)
  • 34. Key-Value RDD Actions Available on Pair RDDs lookup(key) Return the list of values in the RDD for key. This operation is done efficiently if the RDD has a known partitioner by only searching the partition that the key maps to. var lr = sc.parallelize(1 to 1000).map(x => (x, x) ) lr.lookup(42) Job 24 finished: lookup at <console>:28, took 0.037469 s WrappedArray(42) var sorted = lr.sortByKey() sorted.lookup(42) # fast Job 21 finished: lookup at <console>:28, took 0.008917 s ArrayBuffer(42)