Data Analytics With Spark PDF
Data Analytics With Spark PDF
net/publication/321937804
CITATIONS READS
0 1,659
1 author:
Peter Vanroose
ABIS Training & Consulting
39 PUBLICATIONS 175 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Peter Vanroose on 20 December 2017.
Peter Vanroose
GSE NL Nat.Conf.
16 November 2017
TRAINING & CONSULTING Almere - Van Der Valk
“Digital Transformation”
Data Analytics with Spark
Outline :
• Data analytics - history
• Spark and its predecessors
• Spark and Scala - some examples
• Spark libraries: SQL, streaming, MLlib, GraphX
1. Data analytics
2. The “classic” data tools
2.1 RDBMS
Wikipedia: 2.2 Statistical software
2.3 Machine learning
3. Big Data tools
- process of inspecting, cleansing, transforming, modeling data; 3.1 Enormous data volumes
3.2 Hadoop
=> discover info, suggest conclusions, support decision-making 3.3 MapReduce
3.4 HDFS
3.5 Yarn
- Related terms: business intelligence (BI); data mining; statistics 3.6 Pig & Hive
4. Spark
4.1 command-line
Business Intelligence (BI): 4.2 design
4.3 history
- relies heavily on aggregation; focus on business information 4.4 a motivating example
4.5 transformations&actions
4.6 Spark core
4.7 Spark SQL, DataFrames
Data mining: 4.8 Spark APIs
4.9 Spark Streaming
- modeling & knowledge discovery for predictive purposes 4.10 MLlib
4.11 GraphX
1. Data analytics
2. The “classic” data tools
2.1 RDBMS
2.2 Statistical software
RDBMS (e.g. Db2) 2.1 2.3 Machine learning
3. Big Data tools
• On-Line Analytical Processing (OLAP) 3.1 Enormous data volumes
3.2 Hadoop
3.3 MapReduce
- aggregation (SUM, COUNT, AVG) + grouping sets 3.4 HDFS
3.5 Yarn
3.6 Pig & Hive
• can easily answer BI questions, like: 4. Spark
4.1 command-line
- turnover, revenue => overview per year, month, region, product 4.2 design
4.3 history
4.4 a motivating example
- TOP-10 analysis (10 best customers, 10 most promising new mar- 4.5 transformations&actions
kets, 10 least profitable products, ...) 4.6 Spark core
4.7 Spark SQL, DataFrames
4.8 Spark APIs
=> requires “total sorting” (= n log n) + showing just first part 4.9 Spark Streaming
4.10 MLlib
could use pre-sorted data (indexes) => not always possible ! 4.11 GraphX
1. Data analytics
2. The “classic” data tools
2.1 RDBMS
2.2 Statistical software
Statistical software (e.g. SPSS, R) 2.2 2.3 Machine learning
3. Big Data tools
• graphical possibilities (better than Excel) 3.1 Enormous data volumes
3.2 Hadoop
3.3 MapReduce
- scatter plot (correlation), heat map, ... 3.4 HDFS
3.5 Yarn
3.6 Pig & Hive
- histogram (frequency distrib.), bar chart (ranking), pie chart, ... 4. Spark
4.1 command-line
- time series (line chart) 4.2 design
4.3 history
1. Data analytics
2. The “classic” data tools
2.1 RDBMS
2.2 Statistical software
Typical “machine learning” applications 2.3 2.3 Machine learning
3. Big Data tools
• Examples: 3.1 Enormous data volumes
3.2 Hadoop
3.3 MapReduce
- spam filters 3.4 HDFS
3.5 Yarn
3.6 Pig & Hive
- virus scanners 4. Spark
4.1 command-line
- break-in detection 4.2 design
4.3 history
1. Data analytics
2. The “classic” data tools
2.1 RDBMS
2.2 Statistical software
Enormous amounts of data 3.1 2.3 Machine learning
3. Big Data tools
• the 3 Vs => need for a new framework? 3.1 Enormous data volumes
3.2 Hadoop
3.3 MapReduce
- volume (TB / PB / ZB / YB) 3.4 HDFS
3.5 Yarn
3.6 Pig & Hive
- velocity (real-time analysis) 4. Spark
4.1 command-line
- variety (unstructured & semi-structured data) 4.2 design
4.3 history
4.4 a motivating example
• “Big Data” => Hadoop 4.5 transformations&actions
4.6 Spark core
- assumes a cluster of commodity hardware (sharding - scale out) 4.7 Spark SQL, DataFrames
4.8 Spark APIs
4.9 Spark Streaming
- fail-safe because of redundance 4.10 MLlib
4.11 GraphX
• but ... less data consistency guarantees
- because of the CAP theorem (Brewer, 2000)
can only have 2 out of 3: consistency, availablility, partitioned
- BASE instead of ACID
• Hadoop’s analytical frame work: MapReduce
=> “access path” responsibility : the programmer
1. Data analytics
2. The “classic” data tools
2.1 RDBMS
• Apache project (http://hadoop.apache.org/) 2.2 Statistical software
2.3 Machine learning
1. Data analytics
2. The “classic” data tools
2.1 RDBMS
2.2 Statistical software
2.3 Machine learning
3. Big Data tools
3.1 Enormous data volumes
3.2 Hadoop
3.3 MapReduce
3.4 HDFS
3.5 Yarn
3.6 Pig & Hive
4. Spark
4.1 command-line
4.2 design
4.3 history
4.4 a motivating example
4.5 transformations&actions
4.6 Spark core
4.7 Spark SQL, DataFrames
4.8 Spark APIs
4.9 Spark Streaming
4.10 MLlib
4.11 GraphX
1. Data analytics
2. The “classic” data tools
2.1 RDBMS
2.2 Statistical software
HDFS 3.4 2.3 Machine learning
3. Big Data tools
• Hadoop Distributed File System 3.1 Enormous data volumes
3.2 Hadoop
3.3 MapReduce
• storage abstraction layer 3.4 HDFS
3.5 Yarn
3.6 Pig & Hive
- single HDFS “file” is actually a set of fragments / partitions 4. Spark
4.1 command-line
- residing on different cluster nodes 4.2 design
4.3 history
4.4 a motivating example
- with duplicates (replication factor; default: 3) 4.5 transformations&actions
4.6 Spark core
• end user sees a “normal” hierarchical file system 4.7 Spark SQL, DataFrames
4.8 Spark APIs
4.9 Spark Streaming
hdfs:/user/peter/myfile.txt 4.10 MLlib
4.11 GraphX
- command-line interface (Linux style) & API
· put & get files between client & cluster
· move/rename, remove, append to
· head & tail
· no update !
Yarn 3.5
1. Data analytics
2. The “classic” data tools
2.1 RDBMS
• Pig (an Apache project - http://pig.apache.org/) 2.2 Statistical software
2.3 Machine learning
- High-level language interface, compiles into Hadoop MapReduce 3. Big Data tools
3.1 Enormous data volumes
3.2 Hadoop
- Easily readable formulation for standard design patterns 3.3 MapReduce
3.4 HDFS
- Data is represented as “objects”, “variables” 3.5 Yarn
3.6 Pig & Hive
4. Spark
- Example: 4.1 command-line
4.2 design
logs = LOAD 'mytext.txt' USING PigStorage(' '); /* space-delimited input */ 4.3 history
data = FOREACH logs GENERATE $0 AS ip, $6 AS webpage; /* fields 0 and 6 */ 4.4 a motivating example
4.5 transformations&actions
valid = FILTER data BY ip MATCHES '^10(\\.\\d+){3}$'; /* a valid IP address */ 4.6 Spark core
4.7 Spark SQL, DataFrames
STORE valid INTO 'weblog.out'; 4.8 Spark APIs
4.9 Spark Streaming
- SQL-like interface
- like Pig, translates “standard” questions
into optimal MapReduce implementation
- Example:
CREATE TABLE weblog (ip STRING, ..., webpage STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' ;
SELECT webpage,COUNT(*)
FROM weblog WHERE ip LIKE '10.%'
GROUP BY webpage;
1. Data analytics
2. The “classic” data tools
2.1 RDBMS
2.2 Statistical software
Spark 3.7 2.3 Machine learning
3. Big Data tools
• has learned from Big Data history (esp. Hadoop, Hive) & from 3.1 Enormous data volumes
3.2 Hadoop
R, Python, Jupyter Notebook, Zeppelin, Mahout, Storm, Avro, ... 3.3 MapReduce
3.4 HDFS
3.5 Yarn
• tries to combine the best elements of all its predecessors 3.6 Pig & Hive
4. Spark
4.1 command-line
• top-down approach instead of bottom-up: 4.2 design
4.3 history
- good, simple user interface, prevent making “stupid mistakes”: 4.4 a motivating example
4.5 transformations&actions
4.6 Spark core
· fast prototyping: command interface (interactively) 4.7 Spark SQL, DataFrames
4.8 Spark APIs
· provide for same programming language for the final algorithm 4.9 Spark Streaming
(e.g. to run multiple times, or in a continuous setup) 4.10 MLlib
4.11 GraphX
· provide for a data flow pipeline via immutable objects
& their methods ==> functional programming
- provides for simple integration with existing frameworks:
· data sources & sinks: HDFS, local filesystem, URLs, data streams
· Hadoop framework (which runs on Java and hence on JVM)
· Yarn or similar resource negotiator / workload balancer
· Simple RDBMS interface; connections to Cassandra, MongoDB, ...
• better than its predecessors: e.g. in-memory where possible
1. Data analytics
2. The “classic” data tools
2.1 RDBMS
• Spark from scratch: 2.2 Statistical software
2.3 Machine learning
- no need for a cluster 3. Big Data tools
3.1 Enormous data volumes
3.2 Hadoop
· develop & test on stand-alone system (local or cloud) 3.3 MapReduce
3.4 HDFS
· your Spark prototype programs will easily deploy on cluster 3.5 Yarn
3.6 Pig & Hive
- download & install software on a Linux system 4. Spark
4.1 command-line
· or download a preconfigured virtual image (VMware / VirtualBox) 4.2 design
4.3 history
e.g. CDH from https://www.cloudera.com/downloads/ 4.4 a motivating example
4.5 transformations&actions
or HDP from https://hortonworks.com/downloads/ 4.6 Spark core
4.7 Spark SQL, DataFrames
4.8 Spark APIs
- a typical Spark installation also contains 4.9 Spark Streaming
4.10 MLlib
· Hadoop (with HDFS, MapReduce, Yarn) or Mesos 4.11 GraphX
1. Data analytics
2. The “classic” data tools
2.1 RDBMS
[Linux]$ spark-shell 2.2 Statistical software
Setting default log level to "WARN". 2.3 Machine learning
To adjust logging level use sc.setLogLevel(newLevel). 3. Big Data tools
3.1 Enormous data volumes
Welcome to
3.2 Hadoop
____ __ 3.3 MapReduce
/ __/__ ___ _____/ /__ 3.4 HDFS
_\ \/ _ \/ _ `/ __/ '_/ 3.5 Yarn
3.6 Pig & Hive
/___/ .__/\_,_/_/ /_/\_\ version 1.6.0 4. Spark
/_/ 4.1 command-line
4.2 design
Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_67) 4.3 history
4.4 a motivating example
Type in expressions to have them evaluated. 4.5 transformations&actions
Type :help for more information. 4.6 Spark core
Spark context available as sc (master = local[*], app id = local-1510673299900). 4.7 Spark SQL, DataFrames
4.8 Spark APIs
SQL context available as sqlContext. 4.9 Spark Streaming
4.10 MLlib
scala> 4.11 GraphX
1. Data analytics
2. The “classic” data tools
2.1 RDBMS
• A unified computing engine + set of libraries (& APIs) 2.2 Statistical software
2.3 Machine learning
1. Data analytics
2. The “classic” data tools
2.1 RDBMS
• 2009-2012: Berkeley research project (AMPLab) 2.2 Statistical software
2.3 Machine learning
1. Data analytics
2. The “classic” data tools
Suppose we have an HDFS file mytext.txt, containing some text. 2.1 RDBMS
2.2 Statistical software
Count the word frequencies in the file, and write the answer to HDFS file count.out : 2.3 Machine learning
3. Big Data tools
[Linux]$ wget -O mytext.txt https://nl.lipsum.com/feed/html?amount=150 3.1 Enormous data volumes
3.2 Hadoop
[Linux]$ hadoop fs -put mytext.txt 3.3 MapReduce
[Linux]$ spark-shell 3.4 HDFS
scala> val textFile = sc.textFile("hdfs:/user/peter/mytext.txt") 3.5 Yarn
textFile: org.apache.spark.rdd.RDD[String] = hdfs:/user/peter/mytext.txt MapPartitionsRDD[1] 3.6 Pig & Hive
4. Spark
scala> val words = textFile.flatMap( line => line.split(" ") ) 4.1 command-line
words: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] 4.2 design
scala> val words_as_key_val = words.map( word => (word, 1) ) // or just: map( (_, 1) ) 4.3 history
4.4 a motivating example
words_as_key_val: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[3] 4.5 transformations&actions
scala> val words_with_counts = words_as_key_val.reduceByKey( (v1,v2) => v1 + v2 ) 4.6 Spark core
words_with_count: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[4] 4.7 Spark SQL, DataFrames
scala> words_with_counts.saveAsTextFile("hdfs:/user/peter/count.out") 4.8 Spark APIs
4.9 Spark Streaming
[Linux]$ hadoop fs -ls count.out 4.10 MLlib
-rw-r--r-- 1 peter users 0 2017-11-16 15:23 count.out/_SUCCESS 4.11 GraphX
-rw-r--r-- 1 peter users 6395 2017-11-16 15:23 count.out/part-00000
-rw-r--r-- 1 peter users 6262 2017-11-16 15:23 count.out/part-00001
[Linux]$ hadoop fs -cat count.out/*
(interdum,42)
(mi.,22)
(erat,60)
(fames,13)
(urna,48)
(nunc,,16)
<etc...>
[Linux]$ spark-shell # do the same, using a single spark (scala) instruction:
scala> sc.textFile("hdfs:/user/peter/mytext.txt").flatMap(_.split(" ")).map( (_, 1) ).
| reduceByKey(_ + _).saveAsTextFile("hdfs:/user/peter/count2")
1. Data analytics
2. The “classic” data tools
2.1 RDBMS
• Both can be applied on RDDs 2.2 Statistical software
2.3 Machine learning
(or actually: RDDs have “methods” of both types) 3. Big Data tools
3.1 Enormous data volumes
3.2 Hadoop
• Transformations convert an RDD into a new RDD 3.3 MapReduce
3.4 HDFS
3.5 Yarn
- since RDDs are immutable, they never change once created 3.6 Pig & Hive
4. Spark
- the new RDD is not “instantiated”: 4.1 command-line
4.2 design
4.3 history
a transformation is just a dependency between two RDDs 4.4 a motivating example
4.5 transformations&actions
- multiple transformations can be applied to one RDD => DAG 4.6 Spark core
4.7 Spark SQL, DataFrames
4.8 Spark APIs
• Only when action is applied, the full dependency chain is activated 4.9 Spark Streaming
4.10 MLlib
4.11 GraphX
(including all intermediate transformations)
- examples: write to physical data stream; show on screen
- result of action is not an RDD (but local variable)
• On activation:
- transformations can be combined into a single MapReduce step
- notorious example: sorting followed by top-n filtering
1. Data analytics
2. The “classic” data tools
2.1 RDBMS
• Provides basic support for RDDs & basic transformations & actions 2.2 Statistical software
2.3 Machine learning
- the “Spark context” (sc) is the user “handle” to the cluster 3. Big Data tools
3.1 Enormous data volumes
3.2 Hadoop
- RDD: immutable key-value list; stored on the cluster 3.3 MapReduce
3.4 HDFS
(on HDFS, or in a NoSQL database, or cached in memory, ...) 3.5 Yarn
3.6 Pig & Hive
4. Spark
- examples of transformations: 4.1 command-line
4.2 design
read from file: a = sc.textFile("source name or URL") 4.3 history
4.4 a motivating example
4.5 transformations&actions
create from local: p = sc.parallelize(Array(2,3,5,7,11)) 4.6 Spark core
l = sc.range(1,1001) 4.7 Spark SQL, DataFrames
4.8 Spark APIs
4.9 Spark Streaming
shorten (filter) the RDD list, e.g. based on a text search criterion: 4.10 MLlib
4.11 GraphX
b = a.filter( x => x.contains("search-term") )
(note the “=>” notation (lambda expression): filter arg is function)
“vertical” transformation:
e.g. split in words, take 5th element, take largest of two, ...:
c = b . map( x => x.split(" ") ) // treat rows separately
d = c . map( x => if (x(0) > x(1)) x(0) else x(1) )
e = b . flatMap( x => x.split(" ") ) // “flat list”
g = e . map(x => (x,1)) . reduceByKey( (v1,v2) => v1+v2 )
1. Data analytics
2. The “classic” data tools
1. Data analytics
2. The “classic” data tools
2.1 RDBMS
• DataFrame: 2.2 Statistical software
2.3 Machine learning
- name and concept comes from R 3. Big Data tools
3.1 Enormous data volumes
3.2 Hadoop
- is sort of RDD (distributed data value): 3.3 MapReduce
3.4 HDFS
* is like an RDBMS table: with rows & columns 3.5 Yarn
3.6 Pig & Hive
4. Spark
* columns have names; default names: _1, _2, etc. 4.1 command-line
4.2 design
* in contrast to RDD, storage is columnwise 4.3 history
4.4 a motivating example
4.5 transformations&actions
- RDD can be converted to DataFrame with method toDF() 4.6 Spark core
4.7 Spark SQL, DataFrames
- more prominent since Spark version 2.x 4.8 Spark APIs
4.9 Spark Streaming
4.10 MLlib
• Spark SQL 4.11 GraphX
1. Data analytics
2. The “classic” data tools
2.1 RDBMS
Example: 2.2 Statistical software
2.3 Machine learning
3. Big Data tools
[Linux]$ spark-shell 3.1 Enormous data volumes
scala> val courses = sc.parallelize(Array( 3.2 Hadoop
3.3 MapReduce
(1067,"Db2 for z/OS fundamentals",3,475.00),
3.4 HDFS
( 87,"SQL workshop",2,450.00), 3.5 Yarn
(1686,"Big data in practice using Spark",2,500.00), 3.6 Pig & Hive
( 25,"SAS programming fundamentals",3,450.00) ) ) 4. Spark
4.1 command-line
courses: org.apache.spark.rdd.RDD[(Int,String,Int,Double)] = ParallelCollectionRDD[1] 4.2 design
scala> val coursetable = courses.toDF("cid","ctitle","cdur","cdprice") 4.3 history
coursetable: org.apache.spark.sql.DataFrame =[cid:int,ctitle:string,cdur:int,cdprice:double] 4.4 a motivating example
scala> coursetable.show() 4.5 transformations&actions
4.6 Spark core
+----+--------------------+----+-------+ 4.7 Spark SQL, DataFrames
| cid| ctitle|cdur|cdprice| 4.8 Spark APIs
+----+--------------------+----+-------+ 4.9 Spark Streaming
|1067|Db2 for z/OS fund...| 3| 475.0| 4.10 MLlib
| 87| SQL workshop| 2| 450.0| 4.11 GraphX
|1686|Big data in pract...| 2| 500.0|
| 25|SAS programming f...| 3| 450.0|
+----+--------------------+----+-------+
scala> val cheap = coursetable .where("cdprice < 500") .filter(col("ctitle").like("%Db2%"))
// Only from here on, we start using the Spark SQL library:
scala> coursetable.registerTempTable("courses")
scala> val tot = sqlContext.sql("SELECT sum(cdur*cdprice) AS total
FROM courses WHERE cdprice < 500")
tot: org.apache.spark.sql.DataFrame = [total: double]
scala> tot.collect()
res2: Array[org.apache.spark.sql.Row] = Array([3675.0])
1. Data analytics
2. The “classic” data tools
2.1 RDBMS
• Production applications should run in (e.g.) the JVM 2.2 Statistical software
2.3 Machine learning
and access the cluster through e.g. Yarn, Mesos, or stand-alone 3. Big Data tools
3.1 Enormous data volumes
3.2 Hadoop
• Production version (compiled Scala program) 3.3 MapReduce
3.4 HDFS
3.5 Yarn
should not differ too much from the Fast prototyping version 3.6 Pig & Hive
4. Spark
(created in iteractive spark-shell) 4.1 command-line
4.2 design
4.3 history
• Example: 4.4 a motivating example
import org.apache.spark.SparkContext 4.5 transformations&actions
4.6 Spark core
import org.apache.spark.SparkConf 4.7 Spark SQL, DataFrames
object MyProg { 4.8 Spark APIs
def main(args: Array[String]) { 4.9 Spark Streaming
val conf = new SparkConf().setAppName("MyProg").setMaster("local[4]") 4.10 MLlib
4.11 GraphX
val context = new SparkContext(conf)
val textFile = context.textFile(args(0))
val words = textFile.flatMap( line => line.split(" ") )
val words_as_key_val = words.map( word => (word, 1) )
val words_with_counts = words_as_key_val.reduceByKey( (v1,v2) => v1 + v2 )
words_with_counts.saveAsTextFile(args(1))
context.stop()
}
}
1. Data analytics
2. The “classic” data tools
2.1 RDBMS
Similar programming interfaces exist for: 2.2 Statistical software
2.3 Machine learning
3. Big Data tools
• Java 8 3.1 Enormous data volumes
3.2 Hadoop
3.3 MapReduce
- program will look very similar to Scala version ... 3.4 HDFS
3.5 Yarn
- running in the JVM is 100% identical to running a Scala program 3.6 Pig & Hive
4. Spark
4.1 command-line
• Python 4.2 design
4.3 history
- is an interpreted language => no compiling necessary 4.4 a motivating example
4.5 transformations&actions
4.6 Spark core
- interactive or non-interactive Python script: 4.7 Spark SQL, DataFrames
from pyspark import SparkContext, SparkConf 4.8 Spark APIs
conf = SparkConf().setAppName("MyProg").setMaster("local[4]") 4.9 Spark Streaming
4.10 MLlib
sc = SparkContext(conf=conf) 4.11 GraphX
textFile = sc.textFile("hdfs:/user/peter/mytext.txt")
<etc...>
• R
- is an interpreted language => no compiling necessary
- interactive or non-interactive R script:
install.packages("sparkR", dep=TRUE) # needed only once
library(sparkR) # optionally “import” it
sc <- sparkR.init()
sqlContext <- sparkRSQL.init(sc)
<etc...>
1. Data analytics
2. The “classic” data tools
2.1 RDBMS
• For data “in motion”: live data streams 2.2 Statistical software
2.3 Machine learning
(not stored in e.g. HDFS) 3. Big Data tools
3.1 Enormous data volumes
3.2 Hadoop
- examples: Twitter feeds, audio/video streams 3.3 MapReduce
3.4 HDFS
typically through sockets or TCP/IP ports 3.5 Yarn
3.6 Pig & Hive
4. Spark
- supported sources include Kafka, Flume, Twitter, Kinesis, ... 4.1 command-line
4.2 design
- data not stored any longer than needed for processing 4.3 history
4.4 a motivating example
4.5 transformations&actions
- data will be “cut up” into batches of given size & given overlap 4.6 Spark core
4.7 Spark SQL, DataFrames
• DStream (discretized stream) object: is an RDD sequence 4.8 Spark APIs
4.9 Spark Streaming
4.10 MLlib
• Example: 4.11 GraphX
import org.apache.spark.streaming._
val ssc = new StreamingContext(sc, Seconds(1)) // batch interval: 1 second
val lines = ssc.socketTextStream("localhost", 50000) // port 50000 on localhost
val words = lines.flatMap(_.split(" ")) // a DStream object
val words_with_counts = words.map((_, 1)).reduceByKey(_ + _)
words_with_counts.print()
// the above will run once a second :
ssc.start() ; ssc.awaitTermination()
1. Data analytics
2. The “classic” data tools
2.1 RDBMS
• collection of Machine Learning algorithms: 2.2 Statistical software
2.3 Machine learning
· basic statistics 3. Big Data tools
3.1 Enormous data volumes
· classification & regression (model fitting) 3.2 Hadoop
3.3 MapReduce
· unsupervised learning 3.4 HDFS
3.5 Yarn
· clustering 3.6 Pig & Hive
4. Spark
· pattern mining 4.1 command-line
4.2 design
· and much more! 4.3 history
4.4 a motivating example
4.5 transformations&actions
Example: 4.6 Spark core
4.7 Spark SQL, DataFrames
// start from a DataFrame with columns “label” and “features” (required names) 4.8 Spark APIs
4.9 Spark Streaming
val mydata = sqlContext.read.format("libsvm").load("mydata.csv") 4.10 MLlib
res1: org.apache.spark.sql.DataFrame = [label: double, features: vector] 4.11 GraphX
1. Data analytics
2. The “classic” data tools
2.1 RDBMS
• Spark library; contains functions for processing graphs: 2.2 Statistical software
2.3 Machine learning
- examples: 3. Big Data tools
3.1 Enormous data volumes
3.2 Hadoop
· web pages & their hyperlinks (href) 3.3 MapReduce
3.4 HDFS
· social graphs 3.5 Yarn
3.6 Pig & Hive
- Graph needs 2 RDDs for its representation: Vertices & Edges 4. Spark
4.1 command-line
- both Vertices & Edges have “attributes” (data type e.g. String) 4.2 design
4.3 history
val my_vertices : RDD[(VertexId, String)] = sc.textFile(...).map(...) 4.4 a motivating example
4.5 transformations&actions
val my_edges: RDD[(VertexId, VertexId, String)] = sc.textFile(...).map(...) 4.6 Spark core
4.7 Spark SQL, DataFrames
val my_graph = Graph(my_vertices, my_edges) 4.8 Spark APIs
4.9 Spark Streaming
// apply the famous Google PageRank iterative algorithm: 4.10 MLlib
4.11 GraphX
val ranks = my_graph.pageRank(0.0001).vertices
// Join the ranks with the usernames
val users = sc.textFile("users.txt").map { line => val fields = line.split(",") ;
(fields(0).toLong, fields(1)); }
val ranks = users.join(ranks).map { case (id, (name, rank)) => (name, rank); }
// Print the result
println(ranks.collect().mkString("\n"))
1. Data analytics
2. The “classic” data tools
2.1 RDBMS
2.2 Statistical software
2.3 Machine learning
3. Big Data tools
3.1 Enormous data volumes
3.2 Hadoop
3.3 MapReduce
3.4 HDFS
3.5 Yarn
3.6 Pig & Hive
4. Spark
4.1 command-line
4.2 design
4.3 history
TRAINING & CONSULTING 4.4 a motivating example
4.5 transformations&actions
4.6 Spark core
Thank you! 4.7 Spark SQL, DataFrames
4.8 Spark APIs
4.9 Spark Streaming
4.10 MLlib
4.11 GraphX
Peter Vanroose
ABIS Training & Consulting
[email protected]