Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka

Open Source Big Data in OPC
Edelweiss Kammermann
Frank MunzJava One 2017

About Me
à Computer Engineer, BI and Data Integration Specialist
à Over 20 years of Consulting and Project Management experience in Oracle
technology.
à Co-founder and Vice President of Uruguayan Oracle User Group (UYOUG)
à Director of Community of LAOUC
à Head of BI Team CMS at ITConvergence
à Writer and frequent speaker at international conferences:
à Collaborate, OTN Tour LA, UKOUG Tech & Apps, OOW, Rittman Mead BI Forum
à Oracle ACE Director

Uruguay

6
Dr. Frank Munz
•Founded munz & more in 2007
•17 years Oracle Middleware,
Cloud, and Distributed Computing
•Consulting and
High-End Training
•Wrote two Oracle WLS and
one Cloud book

What is Big Data?
à Volume: The high amount of data
à Variety: The wide range of different data formats and schemas.
Unstructured and semi-structured data
à Velocity: The speed which data is created or consumed
à Oracle added another V in this definition
à Value: Data has intrinsic value—but it must be discovered.

What is Oracle Big Data Cloud Compute Edition?
à Big Data Platform that integrates Oracle Big Data solution with
Open Source tools
à Fully Elastic
à Integrated with Other Paas Services as Database Cloud Service, MySQL Cloud
Service, Event Hub Cloud Service
à Access, Data and Network Security
à REST access to all the funcitonality

Big Data Cloud Service – Compute Edition (BDCS-CE)

BDCS-CE Notebook: Interactive Analysis
à Apache Zeppelin Notebook (version0.7) to interactively work with data

What is Hadoop?
à An open source software platform for distributed storage and
processing
à Manage huge volumes of unstructured data
à Parallel processing of large data set
à Highly scalable
à Fault-tolerant
à Two main components:
à HDFS: Hadoop Distributed File System for storing information
à MapReduce: programming framework that process information

Hadoop Components: HFDS
à Stores the data on the cluster
à Namenode: block registry
à DataNode: block containers themselves (Datanode)
à HDFS cartoon by Mvarshney

Hadoop Components: MapReduce
à Retrieves data from HDFS
à A MapReduce program is composed by
à Map() method: performs filtering and sorting of the <key, value> inputs
à Reduce() method: summarize the <key,value> pairs provided by the Mappers
à Code can be written in many languages (Perl, Python, Java etc)

MapReduce Example

Code Example

#2
Hive

What is Hive?
à An open source data warehouse software on top of Apache Hadoop
à Analyze and query data stored in HDFS
à Structure the data into tables
à Tools for simple ETL
à SQL- like queries (HiveQL)
à Procedural language with HPL-SQL
à Metadata storage in a RDBMS

Hadoop & Hive Demo

Revisited: Map Reduce I/O
munz & more #23
Source: Hadoop Application Architecture Book

Spark
• Orders of magnitude(s) faster than M/R
• Higher level Scala, Java or Python API
• Standalone, in Hadoop, or Mesos
• Principle: Run an operation on all data
-> ”Spark is the new MapReduce”
• See also: Apache Storm, etc
• Uses RDDs, or Dataframes, or Datasets
munz & more #24
https://stackoverflow.com/questions/31508083/difference-between-
dataframe-in-spark-2-0-i-e-datasetrow-and-rdd-in-spark
https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf

RDDs
Resilient Distributed Datasets
Where do they come from?
Collection of data grouped into named columns.
Supports text, JSON, Apache Parquet, sequence.
Read in
HDFS, Local FS, S3, Hbase
Parallelize
existing Collection
Transform
other RDD
-> RDDs are immutable

Lazy Evaluation
munz & more #26
Nothing is executed Execution
Transformations:
map(), flatMap(),
reduceByKey(), groupByKey()
Actions:
collect(), count(), first(), takeOrdered(),
saveAsTextFile(), …
http://spark.apache.org/docs/2.1.1/programming-guide.html

map(func) Return a new distributed dataset formed
by passing each element of the source
through a function func.
flatMap(func) Similar to map, but each input item can be
mapped to 0 or more output items (so func
should return a Seq rather than a single
item).
reduceByKey(func, [numTasks]) When called on a dataset of (K, V) pairs,
returns a dataset of (K, V) pairs where the
values for each key are aggregated using
the given reduce function func, which must
be of type (V,V) => V.
groupByKey([numTasks]) When called on a dataset of (K, V) pairs,
returns a dataset of (K, Iterable<V>) pairs.
Transformations

Apache Zeppelin Notebook
munz & more #31

Word Count and Histogram
munz & more #32
res =
t.flatMap(lambda line: line.split(" "))
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a + b)
res.takeOrdered(5, key = lambda x: -x[1])

Zeppelin Notebooks
munz & more #33

Big Data Compute Service CE
munz & more #34

Kafka
Partitioned, replicated commit log
munz & more #36
0 1 2 3 4 … n
Immutable log: Messages with offset
Producer
Consumer A
Consumer B
https://www.quora.com/Kafka-writes-every-message-to-broker-disk-Still-performance-wise-it-
is-better-than-some-of-the-in-memory-message-storing-message-queues-Why-is-that

Broker1
Broker2
Broker3
Topic A
(1)
Topic A
(2)
Topic A
(3)
Partition /
Leader
Repl A
(1)
Repl A
(2)
Repl A
(3)
Producer
Replication /
Follower
Zoo-
keeper
Zoo-
keeper
Zoo-
keeper
State /
HA

https://www.confluent.io/blog/publishing-apache-kafka-new-york-times/
- 1 topic
- 1 partition
- Contains every article published
since 1851
- Multiple producers / consumers
Example for
Stream / Table Duality

Kafka Clients
SDKs Connect Streams
- OOTB: Java, Scala
- Confluent: Python, C,
C++
Confluent:
- HDFS sink,
- JDBC source,
- S3 sink
- Elastic search sink
- Plugin .jar file
- JDBC: Change data
capture (CDC)
- Real-time data ingestion
- Microservices
- KSQL: SQL streaming
engine for streaming
ETL, anomaly detection,
monitoring
- .jar file runs anywhere
High / low level Kafka API Configuration only
Integrate external Systems
Data in Motion
Stream / Table duality
REST
- Language
agnostic
- Easy for
mobile apps
- Easy to
tunnel
through FW
etc.
Lightweight

Oracle Event Hub Cloud Service
• PaaS: Managed Kafka 0.10.2
• Two deployment modes
– Basic (Broker and ZK on 1 node)
– Recommended (distributed)
• REST Proxy
– Separate sever(s) running REST Proxy
munz & more #40

Event Hub Service
munz & more #42

Ports
You must open ports to allow access for
external clients
• Kafka Broker (from OPC connect string)
• Zookeeper with port 2181
munz & more #43

Scaling
munz & more #44
horizontal (up)vertical

Event Hub REST Interface
munz & more #45
https://129.151.91.31:1080/restproxy/topics/a12345orderTopic
Service = Topic

Interesting to Know
• Event Hub topics are prefixed with ID domain
• With Kafka CLI topics with ID Domain can be
created
• Topics without ID domain are not shown in
OPC console
46

TL;DR #bigData #openSource #OPC
OpenSource: entry point to Oracle Big
Data world / Low(er) setup times /
Check for resource usage & limits in
Big Data OPC / BDCS-CE: managed
Hadoop, Hive, Spark + Event hub:
Kafka / Attend a hands-on workshop! /
Next level: Oracle Big Data tools
@EdelweissK
@FrankMunz

www.linkedin.com/in/frankmunz/
www.munzandmore.com/blog
facebook.com/cloudcomputingbook
facebook.com/weblogicbook
@frankmunz
youtube.com/weblogicbook
-> more than 50 web casts
Don’t be
shy J

email: ekammermann@itconvergence.com
Twitter: @EdelweissK

3 Membership Tiers
• Oracle ACE Director
• Oracle ACE
• Oracle ACE Associate
bit.ly/OracleACEProgram
500+ Technical Experts
Helping Peers Globally
Connect:
Nominate yourself or someone you know: acenomination.oracle.com
@oracleace
Facebook.com/oracleaces
oracle-ace_ww@oracle.com

Sign up for Free Trial
http://cloud.oracle.com

Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka

More Related Content

What's hot (20)

Similar to Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka (20)

More from Frank Munz (9)

Recently uploaded (20)

Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka