0% found this document useful (0 votes)

170 views

Data Analytics With Spark PDF

This document provides an overview of data analytics and summarizes the history and capabilities of Spark, a big data analytics tool. It discusses how data analytics has traditionally used relational databases and statistical software for smaller datasets, but the rise of "Big Data" led to the development of tools like Hadoop, MapReduce, and Spark that can handle much larger data volumes in a distributed manner. The document outlines Spark's core functionality including transformations, actions, Spark SQL, DataFrames, streaming, machine learning libraries, and GraphX.

Uploaded by

BesmirSejdiu

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

170 views

Data Analytics With Spark PDF

Uploaded by

BesmirSejdiu

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/321937804

Data Analytics with Spark

Conference Paper · November 2017

CITATIONS READS

0 1,659

1 author:

Peter Vanroose
ABIS Training & Consulting
39 PUBLICATIONS 175 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Computer vision View project

Sequence spaces View project

All content following this page was uploaded by Peter Vanroose on 20 December 2017.

The user has requested enhancement of the downloaded file.

Data Analytics
with Spark

Peter Vanroose

GSE NL Nat.Conf.
16 November 2017
TRAINING & CONSULTING Almere - Van Der Valk
“Digital Transformation”
Data Analytics with Spark

Outline :
• Data analytics - history
• Spark and its predecessors
• Spark and Scala - some examples
• Spark libraries: SQL, streaming, MLlib, GraphX

ABIS Training & Consulting 2

Data analytics 1
Spark Analytics

1. Data analytics
2. The “classic” data tools
2.1 RDBMS
Wikipedia: 2.2 Statistical software
2.3 Machine learning
3. Big Data tools
- process of inspecting, cleansing, transforming, modeling data; 3.1 Enormous data volumes
3.2 Hadoop
=> discover info, suggest conclusions, support decision-making 3.3 MapReduce
3.4 HDFS
3.5 Yarn
- Related terms: business intelligence (BI); data mining; statistics 3.6 Pig & Hive
4. Spark
4.1 command-line
Business Intelligence (BI): 4.2 design
4.3 history
- relies heavily on aggregation; focus on business information 4.4 a motivating example
4.5 transformations&actions
4.6 Spark core
4.7 Spark SQL, DataFrames
Data mining: 4.8 Spark APIs
4.9 Spark Streaming
- modeling & knowledge discovery for predictive purposes 4.10 MLlib
4.11 GraphX

Statistical data analysis:

- descriptive statistics: visualisation (scatter plot, histogram, ...)
- exploratory data analysis (EDA): discover “features” in data
- confirmatory data analysis (CDA): confirm/falsify hypotheses
- predictive analytics: build statistical models => classification
e.g. linear regression; text analytics

Data Analytics with Spark -- GSE NL NatConf 16 nov 2017 ABIS 3

Data & data analytics - the “classic” tools 2
Spark Analytics

1. Data analytics
2. The “classic” data tools
2.1 RDBMS
2.2 Statistical software
RDBMS (e.g. Db2) 2.1 2.3 Machine learning
3. Big Data tools
• On-Line Analytical Processing (OLAP) 3.1 Enormous data volumes
3.2 Hadoop
3.3 MapReduce
- aggregation (SUM, COUNT, AVG) + grouping sets 3.4 HDFS
3.5 Yarn
3.6 Pig & Hive
• can easily answer BI questions, like: 4. Spark
4.1 command-line
- turnover, revenue => overview per year, month, region, product 4.2 design
4.3 history
4.4 a motivating example
- TOP-10 analysis (10 best customers, 10 most promising new mar- 4.5 transformations&actions
kets, 10 least profitable products, ...) 4.6 Spark core
4.7 Spark SQL, DataFrames
4.8 Spark APIs
=> requires “total sorting” (= n log n) + showing just first part 4.9 Spark Streaming
4.10 MLlib
could use pre-sorted data (indexes) => not always possible ! 4.11 GraphX

• typical setup: data warehouses

- make data available to BI tools (ETL)
- heavy pre-sorting & pre-summarizing
- dimensional modeling (several granularities)

Data Analytics with Spark -- GSE NL NatConf 16 nov 2017 ABIS 4

Data & data analytics - the “classic” tools (2) Spark Analytics

1. Data analytics
2. The “classic” data tools
2.1 RDBMS
2.2 Statistical software
Statistical software (e.g. SPSS, R) 2.2 2.3 Machine learning
3. Big Data tools
• graphical possibilities (better than Excel) 3.1 Enormous data volumes
3.2 Hadoop
3.3 MapReduce
- scatter plot (correlation), heat map, ... 3.4 HDFS
3.5 Yarn
3.6 Pig & Hive
- histogram (frequency distrib.), bar chart (ranking), pie chart, ... 4. Spark
4.1 command-line
- time series (line chart) 4.2 design
4.3 history

- geographic, geospatial 4.4 a motivating example

4.5 transformations&actions
4.6 Spark core
• statistical functionality 4.7 Spark SQL, DataFrames
4.8 Spark APIs
4.9 Spark Streaming
- hypothesis testing (e.g. t-test); confidence intervals 4.10 MLlib
4.11 GraphX
- normality tests
- modeling (linear regression, correlation); with reliability
- clustering; pattern recognition; (un)supervised learning
- trend analysis
- support for taking business policy decisions
=> answer questions like “what if price increased”
• Machine Learning = new term for old functionality ...

Data Analytics with Spark -- GSE NL NatConf 16 nov 2017 ABIS 5

Data & data analytics - the “classic” tools (3) Spark Analytics

1. Data analytics
2. The “classic” data tools
2.1 RDBMS
2.2 Statistical software
Typical “machine learning” applications 2.3 2.3 Machine learning
3. Big Data tools
• Examples: 3.1 Enormous data volumes
3.2 Hadoop
3.3 MapReduce
- spam filters 3.4 HDFS
3.5 Yarn
3.6 Pig & Hive
- virus scanners 4. Spark
4.1 command-line
- break-in detection 4.2 design
4.3 history

- OCR 4.4 a motivating example

4.5 transformations&actions
4.6 Spark core
- search engines with “educated guesses” 4.7 Spark SQL, DataFrames
4.8 Spark APIs
cf Google search 4.9 Spark Streaming
4.10 MLlib
4.11 GraphX
• clustering
• dimension reduction, e.g. PCA (principal component analysis)
• pattern recognition
· trained from labeled “training” data (“golden” data)
· originated from computer vision
• classification, e.g. decision trees

Data Analytics with Spark -- GSE NL NatConf 16 nov 2017 ABIS 6

Data & data analytics - the “Big Data” tools 3
Spark Analytics

1. Data analytics
2. The “classic” data tools
2.1 RDBMS
2.2 Statistical software
Enormous amounts of data 3.1 2.3 Machine learning
3. Big Data tools
• the 3 Vs => need for a new framework? 3.1 Enormous data volumes
3.2 Hadoop
3.3 MapReduce
- volume (TB / PB / ZB / YB) 3.4 HDFS
3.5 Yarn
3.6 Pig & Hive
- velocity (real-time analysis) 4. Spark
4.1 command-line
- variety (unstructured & semi-structured data) 4.2 design
4.3 history
4.4 a motivating example
• “Big Data” => Hadoop 4.5 transformations&actions
4.6 Spark core
- assumes a cluster of commodity hardware (sharding - scale out) 4.7 Spark SQL, DataFrames
4.8 Spark APIs
4.9 Spark Streaming
- fail-safe because of redundance 4.10 MLlib
4.11 GraphX
• but ... less data consistency guarantees
- because of the CAP theorem (Brewer, 2000)
can only have 2 out of 3: consistency, availablility, partitioned
- BASE instead of ACID
• Hadoop’s analytical frame work: MapReduce
=> “access path” responsibility : the programmer

Data Analytics with Spark -- GSE NL NatConf 16 nov 2017 ABIS 7

Hadoop 3.2
Spark Analytics

1. Data analytics
2. The “classic” data tools
2.1 RDBMS
• Apache project (http://hadoop.apache.org/) 2.2 Statistical software
2.3 Machine learning

• implemented in Java => runs in JVM 3. Big Data tools

3.1 Enormous data volumes
3.2 Hadoop
• “Function to Data”: 3.3 MapReduce
3.4 HDFS
3.5 Yarn
- partitioned data resides on different cluster nodes 3.6 Pig & Hive
4. Spark
- parallelized algorithm runs on all data nodes (=processing nodes) 4.1 command-line
4.2 design
4.3 history
- data flow is important! 4.4 a motivating example
4.5 transformations&actions
* minimize data flow between nodes 4.6 Spark core
4.7 Spark SQL, DataFrames
4.8 Spark APIs
* optimize data flow between consecutive steps of algorithm 4.9 Spark Streaming
4.10 MLlib
=> Directed Acyclic Graph (DAG) between MapReduce steps 4.11 GraphX

=> “clever” combination of different Map & Reduce steps

for optimal performance

Data Analytics with Spark -- GSE NL NatConf 16 nov 2017 ABIS 8

Hadoop: MapReduce 3.3
Spark Analytics

1. Data analytics
2. The “classic” data tools
2.1 RDBMS
2.2 Statistical software
2.3 Machine learning
3. Big Data tools
3.1 Enormous data volumes
3.2 Hadoop
3.3 MapReduce
3.4 HDFS
3.5 Yarn
3.6 Pig & Hive
4. Spark
4.1 command-line
4.2 design
4.3 history
4.4 a motivating example
4.5 transformations&actions
4.6 Spark core
4.7 Spark SQL, DataFrames
4.8 Spark APIs
4.9 Spark Streaming
4.10 MLlib
4.11 GraphX

Data Analytics with Spark -- GSE NL NatConf 16 nov 2017 ABIS 9

Hadoop: HDFS & Yarn Spark Analytics

1. Data analytics
2. The “classic” data tools
2.1 RDBMS
2.2 Statistical software
HDFS 3.4 2.3 Machine learning
3. Big Data tools
• Hadoop Distributed File System 3.1 Enormous data volumes
3.2 Hadoop
3.3 MapReduce
• storage abstraction layer 3.4 HDFS
3.5 Yarn
3.6 Pig & Hive
- single HDFS “file” is actually a set of fragments / partitions 4. Spark
4.1 command-line
- residing on different cluster nodes 4.2 design
4.3 history
4.4 a motivating example
- with duplicates (replication factor; default: 3) 4.5 transformations&actions
4.6 Spark core

• end user sees a “normal” hierarchical file system 4.7 Spark SQL, DataFrames
4.8 Spark APIs
4.9 Spark Streaming
hdfs:/user/peter/myfile.txt 4.10 MLlib
4.11 GraphX
- command-line interface (Linux style) & API
· put & get files between client & cluster
· move/rename, remove, append to
· head & tail
· no update !

Yarn 3.5

• Yet Another Resource Negotiator => job scheduler for MR steps

Data Analytics with Spark -- GSE NL NatConf 16 nov 2017 ABIS 10

Hadoop: Pig and Hive 3.6
Spark Analytics

1. Data analytics
2. The “classic” data tools
2.1 RDBMS
• Pig (an Apache project - http://pig.apache.org/) 2.2 Statistical software
2.3 Machine learning
- High-level language interface, compiles into Hadoop MapReduce 3. Big Data tools
3.1 Enormous data volumes
3.2 Hadoop
- Easily readable formulation for standard design patterns 3.3 MapReduce
3.4 HDFS
- Data is represented as “objects”, “variables” 3.5 Yarn
3.6 Pig & Hive
4. Spark
- Example: 4.1 command-line
4.2 design
logs = LOAD 'mytext.txt' USING PigStorage(' '); /* space-delimited input */ 4.3 history
data = FOREACH logs GENERATE $0 AS ip, $6 AS webpage; /* fields 0 and 6 */ 4.4 a motivating example
4.5 transformations&actions
valid = FILTER data BY ip MATCHES '^10(\\.\\d+){3}$'; /* a valid IP address */ 4.6 Spark core
4.7 Spark SQL, DataFrames
STORE valid INTO 'weblog.out'; 4.8 Spark APIs
4.9 Spark Streaming

• Hive (an Apache project - http://hive.apache.org/) 4.10 MLlib

4.11 GraphX

- SQL-like interface
- like Pig, translates “standard” questions
into optimal MapReduce implementation
- Example:
CREATE TABLE weblog (ip STRING, ..., webpage STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' ;
SELECT webpage,COUNT(*)
FROM weblog WHERE ip LIKE '10.%'
GROUP BY webpage;

Data Analytics with Spark -- GSE NL NatConf 16 nov 2017 ABIS 11

Data & data analytics - the “Big Data” tools (5) Spark Analytics

1. Data analytics
2. The “classic” data tools
2.1 RDBMS
2.2 Statistical software
Spark 3.7 2.3 Machine learning
3. Big Data tools
• has learned from Big Data history (esp. Hadoop, Hive) & from 3.1 Enormous data volumes
3.2 Hadoop
R, Python, Jupyter Notebook, Zeppelin, Mahout, Storm, Avro, ... 3.3 MapReduce
3.4 HDFS
3.5 Yarn
• tries to combine the best elements of all its predecessors 3.6 Pig & Hive
4. Spark
4.1 command-line
• top-down approach instead of bottom-up: 4.2 design
4.3 history

- good, simple user interface, prevent making “stupid mistakes”: 4.4 a motivating example
4.5 transformations&actions
4.6 Spark core
· fast prototyping: command interface (interactively) 4.7 Spark SQL, DataFrames
4.8 Spark APIs
· provide for same programming language for the final algorithm 4.9 Spark Streaming
(e.g. to run multiple times, or in a continuous setup) 4.10 MLlib
4.11 GraphX
· provide for a data flow pipeline via immutable objects
& their methods ==> functional programming
- provides for simple integration with existing frameworks:
· data sources & sinks: HDFS, local filesystem, URLs, data streams
· Hadoop framework (which runs on Java and hence on JVM)
· Yarn or similar resource negotiator / workload balancer
· Simple RDBMS interface; connections to Cassandra, MongoDB, ...
• better than its predecessors: e.g. in-memory where possible

Data Analytics with Spark -- GSE NL NatConf 16 nov 2017 ABIS 12

Spark 4
Spark Analytics

1. Data analytics
2. The “classic” data tools
2.1 RDBMS
• Spark from scratch: 2.2 Statistical software
2.3 Machine learning
- no need for a cluster 3. Big Data tools
3.1 Enormous data volumes
3.2 Hadoop
· develop & test on stand-alone system (local or cloud) 3.3 MapReduce
3.4 HDFS
· your Spark prototype programs will easily deploy on cluster 3.5 Yarn
3.6 Pig & Hive
- download & install software on a Linux system 4. Spark
4.1 command-line
· or download a preconfigured virtual image (VMware / VirtualBox) 4.2 design
4.3 history
e.g. CDH from https://www.cloudera.com/downloads/ 4.4 a motivating example
4.5 transformations&actions
or HDP from https://hortonworks.com/downloads/ 4.6 Spark core
4.7 Spark SQL, DataFrames
4.8 Spark APIs
- a typical Spark installation also contains 4.9 Spark Streaming
4.10 MLlib
· Hadoop (with HDFS, MapReduce, Yarn) or Mesos 4.11 GraphX

· Java 8 compiler (JDK 1.8)

· Scala compiler
• preconfigured cloud solutions available
- AWS (Amazon Web Services) EMR (Elastic MapReduce), EC2
- Google Cloud Platform(https://cloud.google.com/hadoop/)
- IBM Cloud: https://www.ibm.com/cloud/spark (Watson, BigInsights)

Data Analytics with Spark -- GSE NL NatConf 16 nov 2017 ABIS 13

Spark - command-line interface 4.1
Spark Analytics

1. Data analytics
2. The “classic” data tools
2.1 RDBMS
[Linux]$ spark-shell 2.2 Statistical software
Setting default log level to "WARN". 2.3 Machine learning
To adjust logging level use sc.setLogLevel(newLevel). 3. Big Data tools
3.1 Enormous data volumes
Welcome to
3.2 Hadoop
____ __ 3.3 MapReduce
/ __/__ ___ _____/ /__ 3.4 HDFS
_\ \/ _ \/ _ `/ __/ '_/ 3.5 Yarn
3.6 Pig & Hive
/___/ .__/\_,_/_/ /_/\_\ version 1.6.0 4. Spark
/_/ 4.1 command-line
4.2 design
Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_67) 4.3 history
4.4 a motivating example
Type in expressions to have them evaluated. 4.5 transformations&actions
Type :help for more information. 4.6 Spark core
Spark context available as sc (master = local[*], app id = local-1510673299900). 4.7 Spark SQL, DataFrames
4.8 Spark APIs
SQL context available as sqlContext. 4.9 Spark Streaming
4.10 MLlib
scala> 4.11 GraphX

• 1st principal interface is interactive command-line: Spark shell

• shell syntax: uses Scala
- a (new) functional programming language
- Spark is itself largely implemented in Scala
- Scala can run on JVM (hence bytecode compatible with Java)
• similar interface provided for Python, R (, Java)

Data Analytics with Spark -- GSE NL NatConf 16 nov 2017 ABIS 14

Spark - design 4.2
Spark Analytics

1. Data analytics
2. The “classic” data tools
2.1 RDBMS
• A unified computing engine + set of libraries (& APIs) 2.2 Statistical software
2.3 Machine learning

• For parallel data processing on a cluster 3. Big Data tools

3.1 Enormous data volumes
3.2 Hadoop
• Hides the “ugly details” of MapReduce 3.3 MapReduce
3.4 HDFS
3.5 Yarn
- user steps in at a level similar to Pig or Hive 3.6 Pig & Hive
4. Spark
- Spark translates your data flow processing requests 4.1 command-line
4.2 design
to distributed algorithms 4.3 history
4.4 a motivating example
(not necessarily MapReduce) 4.5 transformations&actions
4.6 Spark core
4.7 Spark SQL, DataFrames
• Spark “core” provides 4.8 Spark APIs
4.9 Spark Streaming
4.10 MLlib
- limited set of “transformations” & “actions” (see further) 4.11 GraphX

- on distributed data objects (so-called RDDs)

• RDD: Resilient Distributed Dataset: a data abstraction
- similar to an RDBMS table, an R data frame, or a JSON file
- but distributed (cluster)
- accessible though a “handle”: a (Scala) object (= variable)
- lazy evaluation where possible: program flow = DAG (dependencies)

Data Analytics with Spark -- GSE NL NatConf 16 nov 2017 ABIS 15

Spark - history 4.3
Spark Analytics

1. Data analytics
2. The “classic” data tools
2.1 RDBMS
• 2009-2012: Berkeley research project (AMPLab) 2.2 Statistical software
2.3 Machine learning

• 2010: open sourced (BSD license) 3. Big Data tools

3.1 Enormous data volumes
3.2 Hadoop
• 2013: original authors started Databricks 3.3 MapReduce
3.4 HDFS
3.5 Yarn
• 2013: Apache project (Apache license) 3.6 Pig & Hive
4. Spark
4.1 command-line
• Febr. 2014: top-level Apache project 4.2 design
4.3 history
4.4 a motivating example
• > 1000 contributors 4.5 transformations&actions
4.6 Spark core
4.7 Spark SQL, DataFrames
• versions: 4.8 Spark APIs
4.9 Spark Streaming
- May 2014: version 1.0 (Nov. 2016: v. 1.6.3) 4.10 MLlib
4.11 GraphX

- July 2016: version 2.0 (July 2017: v. 2.2.0)

• Spark Summits:
- December 2013: San Francisco
- since then: every year; 3-day event;
(Europe: October 2016: Brussels; October 2017: Dublin)
• IBM: “strategic product” (partnership Databricks; integr. BlueMix)

Data Analytics with Spark -- GSE NL NatConf 16 nov 2017 ABIS 16

A motivating example 4.4
Spark Analytics

1. Data analytics
2. The “classic” data tools
Suppose we have an HDFS file mytext.txt, containing some text. 2.1 RDBMS
2.2 Statistical software
Count the word frequencies in the file, and write the answer to HDFS file count.out : 2.3 Machine learning
3. Big Data tools
[Linux]$ wget -O mytext.txt https://nl.lipsum.com/feed/html?amount=150 3.1 Enormous data volumes
3.2 Hadoop
[Linux]$ hadoop fs -put mytext.txt 3.3 MapReduce
[Linux]$ spark-shell 3.4 HDFS
scala> val textFile = sc.textFile("hdfs:/user/peter/mytext.txt") 3.5 Yarn
textFile: org.apache.spark.rdd.RDD[String] = hdfs:/user/peter/mytext.txt MapPartitionsRDD[1] 3.6 Pig & Hive
4. Spark
scala> val words = textFile.flatMap( line => line.split(" ") ) 4.1 command-line
words: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] 4.2 design
scala> val words_as_key_val = words.map( word => (word, 1) ) // or just: map( (_, 1) ) 4.3 history
4.4 a motivating example
words_as_key_val: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[3] 4.5 transformations&actions
scala> val words_with_counts = words_as_key_val.reduceByKey( (v1,v2) => v1 + v2 ) 4.6 Spark core
words_with_count: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[4] 4.7 Spark SQL, DataFrames
scala> words_with_counts.saveAsTextFile("hdfs:/user/peter/count.out") 4.8 Spark APIs
4.9 Spark Streaming
[Linux]$ hadoop fs -ls count.out 4.10 MLlib
-rw-r--r-- 1 peter users 0 2017-11-16 15:23 count.out/_SUCCESS 4.11 GraphX
-rw-r--r-- 1 peter users 6395 2017-11-16 15:23 count.out/part-00000
-rw-r--r-- 1 peter users 6262 2017-11-16 15:23 count.out/part-00001
[Linux]$ hadoop fs -cat count.out/*
(interdum,42)
(mi.,22)
(erat,60)
(fames,13)
(urna,48)
(nunc,,16)
<etc...>
[Linux]$ spark-shell # do the same, using a single spark (scala) instruction:
scala> sc.textFile("hdfs:/user/peter/mytext.txt").flatMap(_.split(" ")).map( (_, 1) ).
| reduceByKey(_ + _).saveAsTextFile("hdfs:/user/peter/count2")

Data Analytics with Spark -- GSE NL NatConf 16 nov 2017 ABIS 17

Transformations & actions 4.5
Spark Analytics

1. Data analytics
2. The “classic” data tools
2.1 RDBMS
• Both can be applied on RDDs 2.2 Statistical software
2.3 Machine learning

(or actually: RDDs have “methods” of both types) 3. Big Data tools
3.1 Enormous data volumes
3.2 Hadoop
• Transformations convert an RDD into a new RDD 3.3 MapReduce
3.4 HDFS
3.5 Yarn
- since RDDs are immutable, they never change once created 3.6 Pig & Hive
4. Spark
- the new RDD is not “instantiated”: 4.1 command-line
4.2 design
4.3 history
a transformation is just a dependency between two RDDs 4.4 a motivating example
4.5 transformations&actions
- multiple transformations can be applied to one RDD => DAG 4.6 Spark core
4.7 Spark SQL, DataFrames
4.8 Spark APIs
• Only when action is applied, the full dependency chain is activated 4.9 Spark Streaming
4.10 MLlib
4.11 GraphX
(including all intermediate transformations)
- examples: write to physical data stream; show on screen
- result of action is not an RDD (but local variable)
• On activation:
- transformations can be combined into a single MapReduce step
- notorious example: sorting followed by top-n filtering

Data Analytics with Spark -- GSE NL NatConf 16 nov 2017 ABIS 18

Spark core 4.6
Spark Analytics

1. Data analytics
2. The “classic” data tools
2.1 RDBMS
• Provides basic support for RDDs & basic transformations & actions 2.2 Statistical software
2.3 Machine learning
- the “Spark context” (sc) is the user “handle” to the cluster 3. Big Data tools
3.1 Enormous data volumes
3.2 Hadoop
- RDD: immutable key-value list; stored on the cluster 3.3 MapReduce
3.4 HDFS
(on HDFS, or in a NoSQL database, or cached in memory, ...) 3.5 Yarn
3.6 Pig & Hive
4. Spark
- examples of transformations: 4.1 command-line
4.2 design
read from file: a = sc.textFile("source name or URL") 4.3 history
4.4 a motivating example
4.5 transformations&actions
create from local: p = sc.parallelize(Array(2,3,5,7,11)) 4.6 Spark core
l = sc.range(1,1001) 4.7 Spark SQL, DataFrames
4.8 Spark APIs
4.9 Spark Streaming
shorten (filter) the RDD list, e.g. based on a text search criterion: 4.10 MLlib
4.11 GraphX
b = a.filter( x => x.contains("search-term") )
(note the “=>” notation (lambda expression): filter arg is function)
“vertical” transformation:
e.g. split in words, take 5th element, take largest of two, ...:
c = b . map( x => x.split(" ") ) // treat rows separately
d = c . map( x => if (x(0) > x(1)) x(0) else x(1) )
e = b . flatMap( x => x.split(" ") ) // “flat list”
g = e . map(x => (x,1)) . reduceByKey( (v1,v2) => v1+v2 )

Data Analytics with Spark -- GSE NL NatConf 16 nov 2017 ABIS 19

Spark core (2) Spark Analytics

1. Data analytics
2. The “classic” data tools

- examples of actions: 2.1 RDBMS

2.2 Statistical software
2.3 Machine learning
summaries, like c.count() 3. Big Data tools
3.1 Enormous data volumes
“generic” summaries, like finding row with max 10th field: 3.2 Hadoop
3.3 MapReduce
3.4 HDFS
maxc = c . reduce( (a,b) => if (a(9)>b(9)) a else b ) 3.5 Yarn
3.6 Pig & Hive
explicitly converting RDD to local (non-RDD) variable: 4. Spark
4.1 command-line
4.2 design
l = c . collect() // full RDD as Array 4.3 history
v = c . first() // first element of that array 4.4 a motivating example
4.5 transformations&actions
w = c . take(5) // first 5 elements 4.6 Spark core
4.7 Spark SQL, DataFrames
t = c . top(5) // last 5 elements 4.8 Spark APIs
4.9 Spark Streaming
- caching data for faster access, i.e., load in memory on cluster: 4.10 MLlib
4.11 GraphX
cc = c.cache()

Data Analytics with Spark -- GSE NL NatConf 16 nov 2017 ABIS 20

Spark SQL and DataFrames 4.7
Spark Analytics

1. Data analytics
2. The “classic” data tools
2.1 RDBMS
• DataFrame: 2.2 Statistical software
2.3 Machine learning
- name and concept comes from R 3. Big Data tools
3.1 Enormous data volumes
3.2 Hadoop
- is sort of RDD (distributed data value): 3.3 MapReduce
3.4 HDFS
* is like an RDBMS table: with rows & columns 3.5 Yarn
3.6 Pig & Hive
4. Spark
* columns have names; default names: _1, _2, etc. 4.1 command-line
4.2 design
* in contrast to RDD, storage is columnwise 4.3 history
4.4 a motivating example
4.5 transformations&actions
- RDD can be converted to DataFrame with method toDF() 4.6 Spark core
4.7 Spark SQL, DataFrames
- more prominent since Spark version 2.x 4.8 Spark APIs
4.9 Spark Streaming
4.10 MLlib
• Spark SQL 4.11 GraphX

- add-on library of Spark

- similar to Hive
- manipulates DataFrames using (standard) SQL
- can read in DataFrames from Avro, Parquet, ORC, JSON, JDBC
- allows to e.g. join tables from different sources

Data Analytics with Spark -- GSE NL NatConf 16 nov 2017 ABIS 21

Spark SQL and DataFrames (2) Spark Analytics

1. Data analytics
2. The “classic” data tools
2.1 RDBMS
Example: 2.2 Statistical software
2.3 Machine learning
3. Big Data tools
[Linux]$ spark-shell 3.1 Enormous data volumes
scala> val courses = sc.parallelize(Array( 3.2 Hadoop
3.3 MapReduce
(1067,"Db2 for z/OS fundamentals",3,475.00),
3.4 HDFS
( 87,"SQL workshop",2,450.00), 3.5 Yarn
(1686,"Big data in practice using Spark",2,500.00), 3.6 Pig & Hive
( 25,"SAS programming fundamentals",3,450.00) ) ) 4. Spark
4.1 command-line
courses: org.apache.spark.rdd.RDD[(Int,String,Int,Double)] = ParallelCollectionRDD[1] 4.2 design
scala> val coursetable = courses.toDF("cid","ctitle","cdur","cdprice") 4.3 history
coursetable: org.apache.spark.sql.DataFrame =[cid:int,ctitle:string,cdur:int,cdprice:double] 4.4 a motivating example
scala> coursetable.show() 4.5 transformations&actions
4.6 Spark core
+----+--------------------+----+-------+ 4.7 Spark SQL, DataFrames
| cid| ctitle|cdur|cdprice| 4.8 Spark APIs
+----+--------------------+----+-------+ 4.9 Spark Streaming
|1067|Db2 for z/OS fund...| 3| 475.0| 4.10 MLlib
| 87| SQL workshop| 2| 450.0| 4.11 GraphX
|1686|Big data in pract...| 2| 500.0|
| 25|SAS programming f...| 3| 450.0|
+----+--------------------+----+-------+
scala> val cheap = coursetable .where("cdprice < 500") .filter(col("ctitle").like("%Db2%"))

// Only from here on, we start using the Spark SQL library:
scala> coursetable.registerTempTable("courses")
scala> val tot = sqlContext.sql("SELECT sum(cdur*cdprice) AS total
FROM courses WHERE cdprice < 500")
tot: org.apache.spark.sql.DataFrame = [total: double]
scala> tot.collect()
res2: Array[org.apache.spark.sql.Row] = Array([3675.0])

Data Analytics with Spark -- GSE NL NatConf 16 nov 2017 ABIS 22

the Spark APIs 4.8
Spark Analytics

1. Data analytics
2. The “classic” data tools
2.1 RDBMS
• Production applications should run in (e.g.) the JVM 2.2 Statistical software
2.3 Machine learning
and access the cluster through e.g. Yarn, Mesos, or stand-alone 3. Big Data tools
3.1 Enormous data volumes
3.2 Hadoop
• Production version (compiled Scala program) 3.3 MapReduce
3.4 HDFS
3.5 Yarn
should not differ too much from the Fast prototyping version 3.6 Pig & Hive
4. Spark
(created in iteractive spark-shell) 4.1 command-line
4.2 design
4.3 history
• Example: 4.4 a motivating example
import org.apache.spark.SparkContext 4.5 transformations&actions
4.6 Spark core
import org.apache.spark.SparkConf 4.7 Spark SQL, DataFrames
object MyProg { 4.8 Spark APIs
def main(args: Array[String]) { 4.9 Spark Streaming
val conf = new SparkConf().setAppName("MyProg").setMaster("local[4]") 4.10 MLlib
4.11 GraphX
val context = new SparkContext(conf)
val textFile = context.textFile(args(0))
val words = textFile.flatMap( line => line.split(" ") )
val words_as_key_val = words.map( word => (word, 1) )
val words_with_counts = words_as_key_val.reduceByKey( (v1,v2) => v1 + v2 )
words_with_counts.saveAsTextFile(args(1))
context.stop()
}
}

# compile the above into a jar file, then:

[Linux] spark-submit --class MyProg MyJar.jar hdfs:/user/peter/mytext.txt \
hdfs:/user/peter/count.out

Data Analytics with Spark -- GSE NL NatConf 16 nov 2017 ABIS 23

the Spark APIs (2) Spark Analytics

1. Data analytics
2. The “classic” data tools
2.1 RDBMS
Similar programming interfaces exist for: 2.2 Statistical software
2.3 Machine learning
3. Big Data tools
• Java 8 3.1 Enormous data volumes
3.2 Hadoop
3.3 MapReduce
- program will look very similar to Scala version ... 3.4 HDFS
3.5 Yarn
- running in the JVM is 100% identical to running a Scala program 3.6 Pig & Hive
4. Spark
4.1 command-line
• Python 4.2 design
4.3 history
- is an interpreted language => no compiling necessary 4.4 a motivating example
4.5 transformations&actions
4.6 Spark core
- interactive or non-interactive Python script: 4.7 Spark SQL, DataFrames
from pyspark import SparkContext, SparkConf 4.8 Spark APIs
conf = SparkConf().setAppName("MyProg").setMaster("local[4]") 4.9 Spark Streaming
4.10 MLlib
sc = SparkContext(conf=conf) 4.11 GraphX
textFile = sc.textFile("hdfs:/user/peter/mytext.txt")
<etc...>

• R
- is an interpreted language => no compiling necessary
- interactive or non-interactive R script:
install.packages("sparkR", dep=TRUE) # needed only once
library(sparkR) # optionally “import” it
sc <- sparkR.init()
sqlContext <- sparkRSQL.init(sc)
<etc...>

Data Analytics with Spark -- GSE NL NatConf 16 nov 2017 ABIS 24

Spark Streaming 4.9
Spark Analytics

1. Data analytics
2. The “classic” data tools
2.1 RDBMS
• For data “in motion”: live data streams 2.2 Statistical software
2.3 Machine learning
(not stored in e.g. HDFS) 3. Big Data tools
3.1 Enormous data volumes
3.2 Hadoop
- examples: Twitter feeds, audio/video streams 3.3 MapReduce
3.4 HDFS
typically through sockets or TCP/IP ports 3.5 Yarn
3.6 Pig & Hive
4. Spark
- supported sources include Kafka, Flume, Twitter, Kinesis, ... 4.1 command-line
4.2 design
- data not stored any longer than needed for processing 4.3 history
4.4 a motivating example
4.5 transformations&actions
- data will be “cut up” into batches of given size & given overlap 4.6 Spark core
4.7 Spark SQL, DataFrames
• DStream (discretized stream) object: is an RDD sequence 4.8 Spark APIs
4.9 Spark Streaming
4.10 MLlib
• Example: 4.11 GraphX

import org.apache.spark.streaming._
val ssc = new StreamingContext(sc, Seconds(1)) // batch interval: 1 second
val lines = ssc.socketTextStream("localhost", 50000) // port 50000 on localhost
val words = lines.flatMap(_.split(" ")) // a DStream object
val words_with_counts = words.map((_, 1)).reduceByKey(_ + _)
words_with_counts.print()
// the above will run once a second :
ssc.start() ; ssc.awaitTermination()

Data Analytics with Spark -- GSE NL NatConf 16 nov 2017 ABIS 25

MLlib 4.10
Spark Analytics

1. Data analytics
2. The “classic” data tools
2.1 RDBMS
• collection of Machine Learning algorithms: 2.2 Statistical software
2.3 Machine learning
· basic statistics 3. Big Data tools
3.1 Enormous data volumes
· classification & regression (model fitting) 3.2 Hadoop
3.3 MapReduce
· unsupervised learning 3.4 HDFS
3.5 Yarn
· clustering 3.6 Pig & Hive
4. Spark
· pattern mining 4.1 command-line
4.2 design
· and much more! 4.3 history
4.4 a motivating example
4.5 transformations&actions
Example: 4.6 Spark core
4.7 Spark SQL, DataFrames
// start from a DataFrame with columns “label” and “features” (required names) 4.8 Spark APIs
4.9 Spark Streaming
val mydata = sqlContext.read.format("libsvm").load("mydata.csv") 4.10 MLlib
res1: org.apache.spark.sql.DataFrame = [label: double, features: vector] 4.11 GraphX

import org.apache.spark.ml.classification._ // make LogisticRegression available

val logregr = new LogisticRegression() .
setMaxIter(10) . // sets param(s) for the algorithm
setElasticNetParam(0.8)
val model = logregr.fit(mydata) // fits the model to the data
println(s"Coefficients: ${model.coefficients} Intercept: ${model.intercept}")
val predictions = model.transform(someTestData) // apply the model to some data

Data Analytics with Spark -- GSE NL NatConf 16 nov 2017 ABIS 26

GraphX 4.11
Spark Analytics

1. Data analytics
2. The “classic” data tools
2.1 RDBMS
• Spark library; contains functions for processing graphs: 2.2 Statistical software
2.3 Machine learning
- examples: 3. Big Data tools
3.1 Enormous data volumes
3.2 Hadoop
· web pages & their hyperlinks (href) 3.3 MapReduce
3.4 HDFS
· social graphs 3.5 Yarn
3.6 Pig & Hive
- Graph needs 2 RDDs for its representation: Vertices & Edges 4. Spark
4.1 command-line
- both Vertices & Edges have “attributes” (data type e.g. String) 4.2 design
4.3 history
val my_vertices : RDD[(VertexId, String)] = sc.textFile(...).map(...) 4.4 a motivating example
4.5 transformations&actions
val my_edges: RDD[(VertexId, VertexId, String)] = sc.textFile(...).map(...) 4.6 Spark core
4.7 Spark SQL, DataFrames
val my_graph = Graph(my_vertices, my_edges) 4.8 Spark APIs
4.9 Spark Streaming
// apply the famous Google PageRank iterative algorithm: 4.10 MLlib
4.11 GraphX
val ranks = my_graph.pageRank(0.0001).vertices
// Join the ranks with the usernames
val users = sc.textFile("users.txt").map { line => val fields = line.split(",") ;
(fields(0).toLong, fields(1)); }
val ranks = users.join(ranks).map { case (id, (name, rank)) => (name, rank); }
// Print the result
println(ranks.collect().mkString("\n"))

Data Analytics with Spark -- GSE NL NatConf 16 nov 2017 ABIS 27

View publication stats

Questions, remarks, feedback, ... ? Spark Analytics

1. Data analytics
2. The “classic” data tools
2.1 RDBMS
2.2 Statistical software
2.3 Machine learning
3. Big Data tools
3.1 Enormous data volumes
3.2 Hadoop
3.3 MapReduce
3.4 HDFS
3.5 Yarn
3.6 Pig & Hive
4. Spark
4.1 command-line
4.2 design
4.3 history
TRAINING & CONSULTING 4.4 a motivating example
4.5 transformations&actions
4.6 Spark core
Thank you! 4.7 Spark SQL, DataFrames
4.8 Spark APIs
4.9 Spark Streaming
4.10 MLlib
4.11 GraphX

Peter Vanroose
ABIS Training & Consulting
[email protected]

Data Analytics with Spark -- GSE NL NatConf 16 nov 2017 ABIS 28

cp5293 Big Data Analytics Question Bank
0% (1)
cp5293 Big Data Analytics Question Bank
13 pages
Notes - KCS 061 Big Data Unit 1
No ratings yet
Notes - KCS 061 Big Data Unit 1
25 pages
Workload Management Server
No ratings yet
Workload Management Server
40 pages
Big Data
No ratings yet
Big Data
190 pages
Unit 1 Big Data
No ratings yet
Unit 1 Big Data
124 pages
lauras
No ratings yet
lauras
33 pages
Big Data Analytics
No ratings yet
Big Data Analytics
8 pages
Unit-1 Introduction to Data Analytics.pptx
No ratings yet
Unit-1 Introduction to Data Analytics.pptx
35 pages
Big Data Concepts
No ratings yet
Big Data Concepts
15 pages
Unit 1-BigDataTools
No ratings yet
Unit 1-BigDataTools
69 pages
Book Big Data Technology
No ratings yet
Book Big Data Technology
87 pages
BA ppt
No ratings yet
BA ppt
17 pages
Big Data Components
No ratings yet
Big Data Components
58 pages
Lecture 2 - Hadoop 221
No ratings yet
Lecture 2 - Hadoop 221
28 pages
Introduction
No ratings yet
Introduction
10 pages
Analyzing Limitations and Solutions of Existing Data Analytics
No ratings yet
Analyzing Limitations and Solutions of Existing Data Analytics
21 pages
Big Data Components
No ratings yet
Big Data Components
31 pages
Big Data Analytics
100% (3)
Big Data Analytics
79 pages
Chapter 1 - 大数据概念
No ratings yet
Chapter 1 - 大数据概念
21 pages
Data Analytics Compendium BITeSys 2024
No ratings yet
Data Analytics Compendium BITeSys 2024
46 pages
Data Management & Data Architecture
No ratings yet
Data Management & Data Architecture
21 pages
Big Data Technology Report With Pages Removed
No ratings yet
Big Data Technology Report With Pages Removed
32 pages
Big-Data-A-Comprehensive-Overview
No ratings yet
Big-Data-A-Comprehensive-Overview
25 pages
Chapter 1
No ratings yet
Chapter 1
40 pages
Big Data Analytics 1
No ratings yet
Big Data Analytics 1
22 pages
Introduction To Big Data, Hadoop and Spark
No ratings yet
Introduction To Big Data, Hadoop and Spark
40 pages
(IJCST-V5I4P10) :M Dhavapriya
No ratings yet
(IJCST-V5I4P10) :M Dhavapriya
5 pages
Big Data Analytics: Free Guide: 5 Data Science Tools To Consider
No ratings yet
Big Data Analytics: Free Guide: 5 Data Science Tools To Consider
8 pages
BDA1-4 bunits
No ratings yet
BDA1-4 bunits
113 pages
Big Data Analytics Unit-1
100% (1)
Big Data Analytics Unit-1
5 pages
Big Data Manual - Edited
No ratings yet
Big Data Manual - Edited
69 pages
Big Data Tools
No ratings yet
Big Data Tools
29 pages
CS8091 BDA Unit 1
No ratings yet
CS8091 BDA Unit 1
118 pages
Big Data BDO
No ratings yet
Big Data BDO
11 pages
m,
No ratings yet
m,
30 pages
Big Data Analytic Report
No ratings yet
Big Data Analytic Report
10 pages
UNIT-1_BigData
No ratings yet
UNIT-1_BigData
10 pages
21cab14 - Big Data Analytics: Dr.M.Moorthy, Hod / Mca Mca - Ii Semester Regulation-2021
No ratings yet
21cab14 - Big Data Analytics: Dr.M.Moorthy, Hod / Mca Mca - Ii Semester Regulation-2021
22 pages
00 - 00 DS - Overview - FRAMEWORK
No ratings yet
00 - 00 DS - Overview - FRAMEWORK
63 pages
Finance - Unit 4
No ratings yet
Finance - Unit 4
39 pages
big data analytics02
No ratings yet
big data analytics02
20 pages
Big Data Analytics Tools, BHARATH.S (Assignment-1)
No ratings yet
Big Data Analytics Tools, BHARATH.S (Assignment-1)
17 pages
t1
No ratings yet
t1
3 pages
BDT..U1_PPT_08112023
No ratings yet
BDT..U1_PPT_08112023
71 pages
Types of Digital Data: Unit 1 Big Data KCS-061
No ratings yet
Types of Digital Data: Unit 1 Big Data KCS-061
12 pages
Terminologies Used in Big Data Environments
No ratings yet
Terminologies Used in Big Data Environments
3 pages
Da Unit Ii
No ratings yet
Da Unit Ii
25 pages
Cp5293 Big Data Analytics Question Bank
0% (1)
Cp5293 Big Data Analytics Question Bank
13 pages
BDA-1st unit
No ratings yet
BDA-1st unit
39 pages
presentation file
No ratings yet
presentation file
6 pages
Big Data Ashish
No ratings yet
Big Data Ashish
7 pages
BDA 02 - Fundamentals
No ratings yet
BDA 02 - Fundamentals
64 pages
Big Data Workshop
No ratings yet
Big Data Workshop
52 pages
BD unit 1
No ratings yet
BD unit 1
3 pages
Chapter 1
No ratings yet
Chapter 1
49 pages
Big Data Analytics
No ratings yet
Big Data Analytics
31 pages
FUNDAMENTALS OF BIG DATA ANALYTICS Digital Notes
No ratings yet
FUNDAMENTALS OF BIG DATA ANALYTICS Digital Notes
121 pages
Big Data Analytics With Spark: A Practitioner's Guide To Using Spark For Large Scale Data Analysis
No ratings yet
Big Data Analytics With Spark: A Practitioner's Guide To Using Spark For Large Scale Data Analysis
1 page
Productflyer - 978 1 4842 0964 6 PDF
No ratings yet
Productflyer - 978 1 4842 0964 6 PDF
1 page
Databricks Essentials: A Guide to Unified Data Analytics
From Everand
Databricks Essentials: A Guide to Unified Data Analytics
Robert Johnson
No ratings yet
PySpark Essentials: A Practical Guide to Distributed Computing
From Everand
PySpark Essentials: A Practical Guide to Distributed Computing
Robert Johnson
No ratings yet
7 14 - Science Beng PDF
No ratings yet
7 14 - Science Beng PDF
164 pages
BBA Cert 12-R149 For Redi-Rock Modular Block System (March 2012)
No ratings yet
BBA Cert 12-R149 For Redi-Rock Modular Block System (March 2012)
12 pages
Vigas Casteladas PDF
No ratings yet
Vigas Casteladas PDF
5 pages
West Mel Fit Out Cimbined PDF
No ratings yet
West Mel Fit Out Cimbined PDF
7 pages
Unib Ody Floating B All Valves: KVC Fire-Safe and Anti-Static ASME 150/300 One Piece, End Entry
No ratings yet
Unib Ody Floating B All Valves: KVC Fire-Safe and Anti-Static ASME 150/300 One Piece, End Entry
4 pages
Lvup 1X12X4 @CH4+610-Layout1
No ratings yet
Lvup 1X12X4 @CH4+610-Layout1
1 page
Midea - WWW - Kib.bg
0% (1)
Midea - WWW - Kib.bg
45 pages
3248 Copelin Avenue Home Brochure
No ratings yet
3248 Copelin Avenue Home Brochure
11 pages
Brosur Romans 016
No ratings yet
Brosur Romans 016
28 pages
464 Pravilnik o Kvalitetu Cementa
No ratings yet
464 Pravilnik o Kvalitetu Cementa
1 page
GRS Installation Manual
No ratings yet
GRS Installation Manual
38 pages
The Witcher 3 Wild Hunt Artbook (Small)
100% (6)
The Witcher 3 Wild Hunt Artbook (Small)
197 pages
Rigid Pavement
100% (2)
Rigid Pavement
109 pages
AZE RODF186560ACK4W Outdoor Cabinet With AC400W Air Conditioner Drawing 1024
No ratings yet
AZE RODF186560ACK4W Outdoor Cabinet With AC400W Air Conditioner Drawing 1024
2 pages
Cluster and Grid Computing
No ratings yet
Cluster and Grid Computing
37 pages
Abit NF-M2S Manual
No ratings yet
Abit NF-M2S Manual
40 pages
Medical Stores Management System in C#
86% (43)
Medical Stores Management System in C#
123 pages
Virtualizing Business Critical Applications
No ratings yet
Virtualizing Business Critical Applications
17 pages
DXT
No ratings yet
DXT
2 pages
Annexure MRP Revision Circular 0002
No ratings yet
Annexure MRP Revision Circular 0002
140 pages
NIM Configuration On AIX
No ratings yet
NIM Configuration On AIX
22 pages
Exxon IP 9 2 1 Additional Requirements
No ratings yet
Exxon IP 9 2 1 Additional Requirements
8 pages
Pavement Construction PDF
100% (1)
Pavement Construction PDF
83 pages
8CSMP+#C1470-#C1610-DM Pages From FSP 3000R7 R10.3 Hardware Description IssB
No ratings yet
8CSMP+#C1470-#C1610-DM Pages From FSP 3000R7 R10.3 Hardware Description IssB
5 pages
Bund Finance Centre Bund Finance Centre: Shangai Shangai
No ratings yet
Bund Finance Centre Bund Finance Centre: Shangai Shangai
10 pages
Building Automation System
No ratings yet
Building Automation System
19 pages
Pipe Grade and Steel Grade
No ratings yet
Pipe Grade and Steel Grade
1 page
Donatello: Biography
No ratings yet
Donatello: Biography
80 pages

Data Analytics With Spark PDF

Uploaded by

Data Analytics With Spark PDF

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Data Analytics with Spark

Conference Paper · November 2017

Computer vision View project

Sequence spaces View project

The user has requested enhancement of the downloaded file.

ABIS Training & Consulting 2

Statistical data analysis:

Data Analytics with Spark -- GSE NL NatConf 16 nov 2017 ABIS 3

• typical setup: data warehouses

Data Analytics with Spark -- GSE NL NatConf 16 nov 2017 ABIS 4

- geographic, geospatial 4.4 a motivating example

Data Analytics with Spark -- GSE NL NatConf 16 nov 2017 ABIS 5

- OCR 4.4 a motivating example

Data Analytics with Spark -- GSE NL NatConf 16 nov 2017 ABIS 6

Data Analytics with Spark -- GSE NL NatConf 16 nov 2017 ABIS 7

• implemented in Java => runs in JVM 3. Big Data tools

=> “clever” combination of different Map & Reduce steps

Data Analytics with Spark -- GSE NL NatConf 16 nov 2017 ABIS 8

Data Analytics with Spark -- GSE NL NatConf 16 nov 2017 ABIS 9

• Yet Another Resource Negotiator => job scheduler for MR steps

Data Analytics with Spark -- GSE NL NatConf 16 nov 2017 ABIS 10

• Hive (an Apache project - http://hive.apache.org/) 4.10 MLlib

Data Analytics with Spark -- GSE NL NatConf 16 nov 2017 ABIS 11

Data Analytics with Spark -- GSE NL NatConf 16 nov 2017 ABIS 12

· Java 8 compiler (JDK 1.8)

Data Analytics with Spark -- GSE NL NatConf 16 nov 2017 ABIS 13

• 1st principal interface is interactive command-line: Spark shell

Data Analytics with Spark -- GSE NL NatConf 16 nov 2017 ABIS 14

• For parallel data processing on a cluster 3. Big Data tools

- on distributed data objects (so-called RDDs)

Data Analytics with Spark -- GSE NL NatConf 16 nov 2017 ABIS 15

• 2010: open sourced (BSD license) 3. Big Data tools

- July 2016: version 2.0 (July 2017: v. 2.2.0)

Data Analytics with Spark -- GSE NL NatConf 16 nov 2017 ABIS 16

Data Analytics with Spark -- GSE NL NatConf 16 nov 2017 ABIS 17

Data Analytics with Spark -- GSE NL NatConf 16 nov 2017 ABIS 18

Data Analytics with Spark -- GSE NL NatConf 16 nov 2017 ABIS 19

- examples of actions: 2.1 RDBMS

Data Analytics with Spark -- GSE NL NatConf 16 nov 2017 ABIS 20

- add-on library of Spark

Data Analytics with Spark -- GSE NL NatConf 16 nov 2017 ABIS 21

Data Analytics with Spark -- GSE NL NatConf 16 nov 2017 ABIS 22

# compile the above into a jar file, then:

Data Analytics with Spark -- GSE NL NatConf 16 nov 2017 ABIS 23

Data Analytics with Spark -- GSE NL NatConf 16 nov 2017 ABIS 24

Data Analytics with Spark -- GSE NL NatConf 16 nov 2017 ABIS 25

import org.apache.spark.ml.classification._ // make LogisticRegression available

Data Analytics with Spark -- GSE NL NatConf 16 nov 2017 ABIS 26

Data Analytics with Spark -- GSE NL NatConf 16 nov 2017 ABIS 27

Questions, remarks, feedback, ... ? Spark Analytics

Data Analytics with Spark -- GSE NL NatConf 16 nov 2017 ABIS 28

You might also like