0% found this document useful (0 votes)
3 views

sparkcore

Spark Core is the foundational component of the Spark project, providing distributed task dispatching, scheduling, and I/O functionalities through an API that supports multiple programming languages. It utilizes the RDD (Resilient Distributed Dataset) abstraction for parallel operations and ensures fault-tolerance by tracking the lineage of RDDs. Additionally, Spark offers shared variables like broadcast variables and accumulators for efficient data handling and computation.

Uploaded by

derkuzesta
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

sparkcore

Spark Core is the foundational component of the Spark project, providing distributed task dispatching, scheduling, and I/O functionalities through an API that supports multiple programming languages. It utilizes the RDD (Resilient Distributed Dataset) abstraction for parallel operations and ensures fault-tolerance by tracking the lineage of RDDs. Additionally, Spark offers shared variables like broadcast variables and accumulators for efficient data handling and computation.

Uploaded by

derkuzesta
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

park Core

[edit]

Spark Core is the foundation of the overall project. It provides distributed task dispatching,
scheduling, and basic I/O functionalities, exposed through an application programming
interface (for Java, Python, Scala, .NET[16] and R) centered on the RDD abstraction (the Java
API is available for other JVM languages, but is also usable for some other non-JVM languages
that can connect to the JVM, such as Julia[17]). This interface mirrors a functional/higher-
order model of programming: a "driver" program invokes parallel operations such as map,
filter or reduce on an RDD by passing a function to Spark, which then schedules the
function's execution in parallel on the cluster.[2] These operations, and additional ones such
as joins, take RDDs as input and produce new RDDs. RDDs are immutable and their
operations are lazy; fault-tolerance is achieved by keeping track of the "lineage" of each RDD
(the sequence of operations that produced it) so that it can be reconstructed in the case of
data loss. RDDs can contain any type of Python, .NET, Java, or Scala objects.

Besides the RDD-oriented functional style of programming, Spark provides two restricted
forms of shared variables: broadcast variables reference read-only data that needs to be
available on all nodes, while accumulators can be used to program reductions in an
imperative style.[2]

A typical example of RDD-centric functional programming is the following Scala program that
computes the frequencies of all words occurring in a set of text files and prints the most
common ones. Each map, flatMap (a variant of map) and reduceByKey takes an anonymous
function that performs a simple operation on a single data item (or a pair of items), and
applies its argument to transform an RDD into a new RDD.

val conf = new SparkConf().setAppName("wiki_test") // create a spark config object

val sc = new SparkContext(conf) // Create a spark context

val data = sc.textFile("/path/to/somedir") // Read files from "somedir" into an RDD of


(filename, content) pairs.

val tokens = data.flatMap(_.split(" ")) // Split each file into a list of tokens (words).

val wordFreq = tokens.map((_, 1)).reduceByKey(_ + _) // Add a count of one to each token, then
sum the counts per word type.

wordFreq.sortBy(s => -s._2).map(x => (x._2, x._1)).top(10) // Get the top 10 words. Swap word
and count to sort by count.

You might also like