SlideShare a Scribd company logo
Open Source Big Data in OPC
Edelweiss Kammermann
Frank MunzJava One 2017
munz & more #2
© IT Convergence 2016. All rights reserved.
© IT Convergence 2016. All rights reserved.
About Me
à Computer Engineer, BI and Data Integration Specialist
à Over 20 years of Consulting and Project Management experience in Oracle
technology.
à Co-founder and Vice President of Uruguayan Oracle User Group (UYOUG)
à Director of Community of LAOUC
à Head of BI Team CMS at ITConvergence
à Writer and frequent speaker at international conferences:
à Collaborate, OTN Tour LA, UKOUG Tech & Apps, OOW, Rittman Mead BI Forum
à Oracle ACE Director
© IT Convergence 2016. All rights reserved.
Uruguay
6
Dr. Frank Munz
•Founded munz & more in 2007
•17 years Oracle Middleware,
Cloud, and Distributed Computing
•Consulting and
High-End Training
•Wrote two Oracle WLS and
one Cloud book
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka
#1
Hadoop
© IT Convergence 2016. All rights reserved.
What is Big Data?
à Volume: The high amount of data
à Variety: The wide range of different data formats and schemas.
Unstructured and semi-structured data
à Velocity: The speed which data is created or consumed
à Oracle added another V in this definition
à Value: Data has intrinsic value—but it must be discovered.
© IT Convergence 2016. All rights reserved.
What is Oracle Big Data Cloud Compute Edition?
à Big Data Platform that integrates Oracle Big Data solution with
Open Source tools
à Fully Elastic
à Integrated with Other Paas Services as Database Cloud Service, MySQL Cloud
Service, Event Hub Cloud Service
à Access, Data and Network Security
à REST access to all the funcitonality
© IT Convergence 2016. All rights reserved.
Big Data Cloud Service – Compute Edition (BDCS-CE)
© IT Convergence 2016. All rights reserved.
BDCS-CE Notebook: Interactive Analysis
à Apache Zeppelin Notebook (version0.7) to interactively work with data
© IT Convergence 2016. All rights reserved.
What is Hadoop?
à An open source software platform for distributed storage and
processing
à Manage huge volumes of unstructured data
à Parallel processing of large data set
à Highly scalable
à Fault-tolerant
à Two main components:
à HDFS: Hadoop Distributed File System for storing information
à MapReduce: programming framework that process information
© IT Convergence 2016. All rights reserved.
Hadoop Components: HFDS
à Stores the data on the cluster
à Namenode: block registry
à DataNode: block containers themselves (Datanode)
à HDFS cartoon by Mvarshney
© IT Convergence 2016. All rights reserved.
Hadoop Components: MapReduce
à Retrieves data from HDFS
à A MapReduce program is composed by
à Map() method: performs filtering and sorting of the <key, value> inputs
à Reduce() method: summarize the <key,value> pairs provided by the Mappers
à Code can be written in many languages (Perl, Python, Java etc)
© IT Convergence 2016. All rights reserved.
MapReduce Example
© IT Convergence 2016. All rights reserved.
Code Example
© IT Convergence 2016. All rights reserved.
Code Example
© IT Convergence 2016. All rights reserved.
#2
Hive
© IT Convergence 2016. All rights reserved.
What is Hive?
à An open source data warehouse software on top of Apache Hadoop
à Analyze and query data stored in HDFS
à Structure the data into tables
à Tools for simple ETL
à SQL- like queries (HiveQL)
à Procedural language with HPL-SQL
à Metadata storage in a RDBMS
© IT Convergence 2016. All rights reserved.
Hadoop & Hive Demo
#3
Spark
Revisited: Map Reduce I/O
munz & more #23
Source:	Hadoop	Application	Architecture	Book
Spark
• Orders of magnitude(s) faster than M/R
• Higher level Scala, Java or Python API
• Standalone, in Hadoop, or Mesos
• Principle: Run an operation on all data
-> ”Spark is the new MapReduce”
• See also: Apache Storm, etc
• Uses RDDs, or Dataframes, or Datasets
munz & more #24
https://stackoverflow.com/questions/31508083/difference-between-
dataframe-in-spark-2-0-i-e-datasetrow-and-rdd-in-spark
https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf
RDDs
Resilient Distributed Datasets
Where do they come from?
Collection of data grouped into named columns.
Supports text, JSON, Apache Parquet, sequence.
Read	in	
HDFS,	Local	FS,	S3,	Hbase
Parallelize	
existing	Collection
Transform
other	RDD
->	RDDs	are	immutable
Lazy Evaluation
munz & more #26
Nothing	is	executed Execution
Transformations:
map(), flatMap(),
reduceByKey(), groupByKey()
Actions:
collect(), count(), first(), takeOrdered(),
saveAsTextFile(), …
http://spark.apache.org/docs/2.1.1/programming-guide.html
map(func) Return	a	new	distributed	dataset	formed	
by	passing	each	element	of	the	source	
through	a	function func.
flatMap(func) Similar	to	map,	but	each	input	item	can	be	
mapped	to	0	or	more	output	items	(so func
should	return	a	Seq rather	than	a	single	
item).
reduceByKey(func,	[numTasks]) When	called	on	a	dataset	of	(K,	V)	pairs,	
returns	a	dataset	of	(K,	V)	pairs	where	the	
values	for	each	key	are	aggregated	using	
the	given	reduce	function func,	which	must	
be	of	type	(V,V)	=>	V.	
groupByKey([numTasks]) When	called	on	a	dataset	of	(K,	V)	pairs,	
returns	a	dataset	of	(K,	Iterable<V>)	pairs.
Transformations
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka
Spark Demo
munz & more #30
Apache Zeppelin Notebook
munz & more #31
Word Count and Histogram
munz & more #32
res =
t.flatMap(lambda line: line.split(" "))
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a + b)
res.takeOrdered(5, key = lambda x: -x[1])
Zeppelin Notebooks
munz & more #33
Big Data Compute Service CE
munz & more #34
#4
Kafka
Kafka
Partitioned, replicated commit log
munz & more #36
0 1 2 3 4 … n
Immutable	log:	Messages	with	offset
Producer
Consumer	A
Consumer	B
https://www.quora.com/Kafka-writes-every-message-to-broker-disk-Still-performance-wise-it-
is-better-than-some-of-the-in-memory-message-storing-message-queues-Why-is-that
Broker1
Broker2
Broker3
Topic	A	
(1)
Topic	A	
(2)
Topic	A	
(3)
Partition	/
Leader
Repl A	
(1)
Repl A	
(2)
Repl A	
(3)
Producer
Replication	/
Follower
Zoo-
keeper
Zoo-
keeper
Zoo-
keeper
State	/
HA
https://www.confluent.io/blog/publishing-apache-kafka-new-york-times/
- 1 topic
- 1 partition
- Contains every article published
since 1851
- Multiple producers / consumers
Example	for	
Stream	/	Table	Duality
Kafka Clients
SDKs Connect Streams
- OOTB:	Java,	Scala
- Confluent:	Python,	C,	
C++
Confluent:
- HDFS	sink,	
- JDBC	source,
- S3	sink
- Elastic	search	sink
- Plugin	.jar	file
- JDBC:	Change	data	
capture	(CDC)
- Real-time	data	ingestion
- Microservices
- KSQL:	SQL	streaming	
engine	for	streaming	
ETL,	anomaly	detection,	
monitoring
- .jar	file	runs	anywhere
High	/	low	level	Kafka	API Configuration	only
Integrate	external	Systems
Data	in	Motion
Stream	/	Table	duality	
REST
- Language	
agnostic
- Easy	for	
mobile	apps
- Easy	to	
tunnel	
through	FW	
etc.
Lightweight
Oracle Event Hub Cloud Service
• PaaS: Managed Kafka 0.10.2
• Two deployment modes
– Basic (Broker and ZK on 1 node)
– Recommended (distributed)
• REST Proxy
– Separate sever(s) running REST Proxy
munz & more #40
Event Hub
munz & more #41
Event Hub Service
munz & more #42
Ports
You must open ports to allow access for
external clients
• Kafka Broker (from OPC connect string)
• Zookeeper with port 2181
munz & more #43
Scaling
munz & more #44
horizontal (up)vertical
Event Hub REST Interface
munz & more #45
https://129.151.91.31:1080/restproxy/topics/a12345orderTopic
Service = Topic
Interesting to Know
• Event Hub topics are prefixed with ID domain
• With Kafka CLI topics with ID Domain can be
created
• Topics without ID domain are not shown in
OPC console
46
#5
Conclusion
TL;DR #bigData #openSource #OPC
OpenSource: entry point to Oracle Big
Data world / Low(er) setup times /
Check for resource usage & limits in
Big Data OPC / BDCS-CE: managed
Hadoop, Hive, Spark + Event hub:
Kafka / Attend a hands-on workshop! /
Next level: Oracle Big Data tools
@EdelweissK
@FrankMunz
www.linkedin.com/in/frankmunz/
www.munzandmore.com/blog
facebook.com/cloudcomputingbook
facebook.com/weblogicbook
@frankmunz
youtube.com/weblogicbook
-> more than 50 web casts
Don’t be
shy J
email:	ekammermann@itconvergence.com
Twitter:	@EdelweissK
3	Membership	Tiers
• Oracle	ACE	Director
• Oracle	ACE
• Oracle	ACE	Associate
bit.ly/OracleACEProgram
500+	Technical	Experts	
Helping	Peers	Globally
Connect:
Nominate	yourself	or	someone	you	know:	acenomination.oracle.com
@oracleace
Facebook.com/oracleaces
oracle-ace_ww@oracle.com
Sign up for Free Trial
http://cloud.oracle.com

More Related Content

What's hot (20)

PDF
Wido den hollander cloud stack and ceph
ShapeBlue
 
PPT
Introduction to Apache CloudStack by David Nalley
buildacloud
 
PPTX
Big Data in Container; Hadoop Spark in Docker and Mesos
Heiko Loewe
 
PPTX
Scalable On-Demand Hadoop Clusters with Docker and Mesos
nelsonadpresent
 
PDF
The Future of SDN in CloudStack by Chiradeep Vittal
buildacloud
 
PPT
February 2016 HUG: Running Spark Clusters in Containers with Docker
Yahoo Developer Network
 
PPTX
Server 2016 sneak peek
Michael Rüefli
 
PPTX
OpenStack Cinder
Deepti Ramakrishna
 
PDF
DockerCon 2016 Ecosystem - Everything You Need to Know About Docker and Stora...
ClusterHQ
 
PDF
OpenStack Best Practices and Considerations - terasky tech day
Arthur Berezin
 
PDF
Cloud OS development
Sean Chang
 
PDF
CloudStack Hyderabad Meetup: How the Apache community works
CloudStack - Open Source Cloud Computing Project
 
PPTX
Introducing Node.js in an Oracle technology environment (including hands-on)
Lucas Jellema
 
PPTX
Lessons Learned from Dockerizing Spark Workloads
BlueData, Inc.
 
PPTX
Build public private cloud using openstack
Framgia Vietnam
 
POTX
Jenkins, jclouds, CloudStack, and CentOS by David Nalley
buildacloud
 
PDF
Cloud stack design camp on jun 15
Isaac Chiang
 
PDF
Cloud stack for_beginners
Radhika Puthiyetath
 
PPT
Docker based Hadoop provisioning - Hadoop Summit 2014
Janos Matyas
 
PDF
Cloud data center and openstack
Andrew Yongjoon Kong
 
Wido den hollander cloud stack and ceph
ShapeBlue
 
Introduction to Apache CloudStack by David Nalley
buildacloud
 
Big Data in Container; Hadoop Spark in Docker and Mesos
Heiko Loewe
 
Scalable On-Demand Hadoop Clusters with Docker and Mesos
nelsonadpresent
 
The Future of SDN in CloudStack by Chiradeep Vittal
buildacloud
 
February 2016 HUG: Running Spark Clusters in Containers with Docker
Yahoo Developer Network
 
Server 2016 sneak peek
Michael Rüefli
 
OpenStack Cinder
Deepti Ramakrishna
 
DockerCon 2016 Ecosystem - Everything You Need to Know About Docker and Stora...
ClusterHQ
 
OpenStack Best Practices and Considerations - terasky tech day
Arthur Berezin
 
Cloud OS development
Sean Chang
 
CloudStack Hyderabad Meetup: How the Apache community works
CloudStack - Open Source Cloud Computing Project
 
Introducing Node.js in an Oracle technology environment (including hands-on)
Lucas Jellema
 
Lessons Learned from Dockerizing Spark Workloads
BlueData, Inc.
 
Build public private cloud using openstack
Framgia Vietnam
 
Jenkins, jclouds, CloudStack, and CentOS by David Nalley
buildacloud
 
Cloud stack design camp on jun 15
Isaac Chiang
 
Cloud stack for_beginners
Radhika Puthiyetath
 
Docker based Hadoop provisioning - Hadoop Summit 2014
Janos Matyas
 
Cloud data center and openstack
Andrew Yongjoon Kong
 

Similar to Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka (20)

PDF
The Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
Edelweiss Kammermann
 
PPTX
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
DataWorks Summit/Hadoop Summit
 
ODP
The other Apache Technologies your Big Data solution needs
gagravarr
 
PPTX
Hadoop 3 in a Nutshell
DataWorks Summit/Hadoop Summit
 
PDF
Tools and techniques for data science
Ajay Ohri
 
PPTX
Overview of big data & hadoop v1
Thanh Nguyen
 
PDF
Agile data lake? An oxymoron?
samthemonad
 
PPTX
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
MLconf
 
PPTX
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
inside-BigData.com
 
PPT
Bhupeshbansal bigdata
Bhupesh Bansal
 
PPTX
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Lester Martin
 
PPTX
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Data Con LA
 
PDF
Big Data Hoopla Simplified - TDWI Memphis 2014
Rajan Kanitkar
 
PDF
Attunity Hortonworks Webinar- Sept 22, 2016
Hortonworks
 
PPTX
Big data or big deal
eduarderwee
 
PDF
SQL Engines for Hadoop - The case for Impala
markgrover
 
PPT
Eric Baldeschwieler Keynote from Storage Developers Conference
Hortonworks
 
PDF
Present and future of unified, portable, and efficient data processing with A...
DataWorks Summit
 
PDF
Accelerating Hadoop, Spark, and Memcached with HPC Technologies
inside-BigData.com
 
ODP
Impala turbocharge your big data access
Ophir Cohen
 
The Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
Edelweiss Kammermann
 
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
DataWorks Summit/Hadoop Summit
 
The other Apache Technologies your Big Data solution needs
gagravarr
 
Hadoop 3 in a Nutshell
DataWorks Summit/Hadoop Summit
 
Tools and techniques for data science
Ajay Ohri
 
Overview of big data & hadoop v1
Thanh Nguyen
 
Agile data lake? An oxymoron?
samthemonad
 
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
MLconf
 
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
inside-BigData.com
 
Bhupeshbansal bigdata
Bhupesh Bansal
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Lester Martin
 
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Data Con LA
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Rajan Kanitkar
 
Attunity Hortonworks Webinar- Sept 22, 2016
Hortonworks
 
Big data or big deal
eduarderwee
 
SQL Engines for Hadoop - The case for Impala
markgrover
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Hortonworks
 
Present and future of unified, portable, and efficient data processing with A...
DataWorks Summit
 
Accelerating Hadoop, Spark, and Memcached with HPC Technologies
inside-BigData.com
 
Impala turbocharge your big data access
Ophir Cohen
 
Ad

More from Frank Munz (9)

PDF
Microservices Runtimes
Frank Munz
 
PDF
From Docker Swarm to OCCS and Wercker: Live-hacking at Oracle CODE Mexico 2017
Frank Munz
 
PDF
Docker from A to Z, including Swarm and OCCS
Frank Munz
 
PPTX
Oracle Service Bus 12c (12.2.1) What You Always Wanted to Know
Frank Munz
 
PDF
What You Should Know About WebLogic Server 12c (12.2.1.2) #oow2015 #otntour2...
Frank Munz
 
PDF
Docker in the Oracle Universe / WebLogic 12c / OFM 12c
Frank Munz
 
PDF
12 Things About WebLogic 12.1.3 #oow2014 #otnla15
Frank Munz
 
PDF
WebLogic JMX for DevOps
Frank Munz
 
PDF
Oracle Service Bus (OSB) for the Busy IT Professonial
Frank Munz
 
Microservices Runtimes
Frank Munz
 
From Docker Swarm to OCCS and Wercker: Live-hacking at Oracle CODE Mexico 2017
Frank Munz
 
Docker from A to Z, including Swarm and OCCS
Frank Munz
 
Oracle Service Bus 12c (12.2.1) What You Always Wanted to Know
Frank Munz
 
What You Should Know About WebLogic Server 12c (12.2.1.2) #oow2015 #otntour2...
Frank Munz
 
Docker in the Oracle Universe / WebLogic 12c / OFM 12c
Frank Munz
 
12 Things About WebLogic 12.1.3 #oow2014 #otnla15
Frank Munz
 
WebLogic JMX for DevOps
Frank Munz
 
Oracle Service Bus (OSB) for the Busy IT Professonial
Frank Munz
 
Ad

Recently uploaded (20)

PDF
BRKAPP-1102 - Proactive Network and Application Monitoring.pdf
fcesargonca
 
PDF
BRKACI-1003 ACI Brownfield Migration - Real World Experiences and Best Practi...
fcesargonca
 
PDF
Boardroom AI: The Next 10 Moves | Cerebraix Talent Tech
ssuser73bdb11
 
DOCX
Custom vs. Off-the-Shelf Banking Software
KristenCarter35
 
PPTX
西班牙巴利阿里群岛大学电子版毕业证{UIBLetterUIB文凭证书}文凭复刻
Taqyea
 
PDF
Cleaning up your RPKI invalids, presented at PacNOG 35
APNIC
 
PPTX
Metaphysics_Presentation_With_Visuals.pptx
erikjohnsales1
 
PPTX
Lec15_Mutability Immutability-converted.pptx
khanjahanzaib1
 
PPTX
PHIPA-Compliant Web Hosting in Toronto: What Healthcare Providers Must Know
steve198109
 
PPTX
04 Output 1 Instruments & Tools (3).pptx
GEDYIONGebre
 
PDF
The Internet - By the numbers, presented at npNOG 11
APNIC
 
PPTX
Softuni - Psychology of entrepreneurship
Kalin Karakehayov
 
PPTX
Networking_Essentials_version_3.0_-_Module_3.pptx
ryan622010
 
PDF
FutureCon Seattle 2025 Presentation Slides - You Had One Job
Suzanne Aldrich
 
PDF
Top 10 Testing Procedures to Ensure Your Magento to Shopify Migration Success...
CartCoders
 
PDF
BRKSP-2551 - Introduction to Segment Routing.pdf
fcesargonca
 
PPTX
法国巴黎第二大学本科毕业证{Paris 2学费发票Paris 2成绩单}办理方法
Taqyea
 
PPTX
Orchestrating things in Angular application
Peter Abraham
 
PDF
Enhancing Parental Roles in Protecting Children from Online Sexual Exploitati...
ICT Frame Magazine Pvt. Ltd.
 
PPTX
Presentation3gsgsgsgsdfgadgsfgfgsfgagsfgsfgzfdgsdgs.pptx
SUB03
 
BRKAPP-1102 - Proactive Network and Application Monitoring.pdf
fcesargonca
 
BRKACI-1003 ACI Brownfield Migration - Real World Experiences and Best Practi...
fcesargonca
 
Boardroom AI: The Next 10 Moves | Cerebraix Talent Tech
ssuser73bdb11
 
Custom vs. Off-the-Shelf Banking Software
KristenCarter35
 
西班牙巴利阿里群岛大学电子版毕业证{UIBLetterUIB文凭证书}文凭复刻
Taqyea
 
Cleaning up your RPKI invalids, presented at PacNOG 35
APNIC
 
Metaphysics_Presentation_With_Visuals.pptx
erikjohnsales1
 
Lec15_Mutability Immutability-converted.pptx
khanjahanzaib1
 
PHIPA-Compliant Web Hosting in Toronto: What Healthcare Providers Must Know
steve198109
 
04 Output 1 Instruments & Tools (3).pptx
GEDYIONGebre
 
The Internet - By the numbers, presented at npNOG 11
APNIC
 
Softuni - Psychology of entrepreneurship
Kalin Karakehayov
 
Networking_Essentials_version_3.0_-_Module_3.pptx
ryan622010
 
FutureCon Seattle 2025 Presentation Slides - You Had One Job
Suzanne Aldrich
 
Top 10 Testing Procedures to Ensure Your Magento to Shopify Migration Success...
CartCoders
 
BRKSP-2551 - Introduction to Segment Routing.pdf
fcesargonca
 
法国巴黎第二大学本科毕业证{Paris 2学费发票Paris 2成绩单}办理方法
Taqyea
 
Orchestrating things in Angular application
Peter Abraham
 
Enhancing Parental Roles in Protecting Children from Online Sexual Exploitati...
ICT Frame Magazine Pvt. Ltd.
 
Presentation3gsgsgsgsdfgadgsfgfgsfgagsfgsfgzfdgsdgs.pptx
SUB03
 

Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka

  • 1. Open Source Big Data in OPC Edelweiss Kammermann Frank MunzJava One 2017
  • 3. © IT Convergence 2016. All rights reserved.
  • 4. © IT Convergence 2016. All rights reserved. About Me à Computer Engineer, BI and Data Integration Specialist à Over 20 years of Consulting and Project Management experience in Oracle technology. à Co-founder and Vice President of Uruguayan Oracle User Group (UYOUG) à Director of Community of LAOUC à Head of BI Team CMS at ITConvergence à Writer and frequent speaker at international conferences: à Collaborate, OTN Tour LA, UKOUG Tech & Apps, OOW, Rittman Mead BI Forum à Oracle ACE Director
  • 5. © IT Convergence 2016. All rights reserved. Uruguay
  • 6. 6 Dr. Frank Munz •Founded munz & more in 2007 •17 years Oracle Middleware, Cloud, and Distributed Computing •Consulting and High-End Training •Wrote two Oracle WLS and one Cloud book
  • 9. © IT Convergence 2016. All rights reserved. What is Big Data? à Volume: The high amount of data à Variety: The wide range of different data formats and schemas. Unstructured and semi-structured data à Velocity: The speed which data is created or consumed à Oracle added another V in this definition à Value: Data has intrinsic value—but it must be discovered.
  • 10. © IT Convergence 2016. All rights reserved. What is Oracle Big Data Cloud Compute Edition? à Big Data Platform that integrates Oracle Big Data solution with Open Source tools à Fully Elastic à Integrated with Other Paas Services as Database Cloud Service, MySQL Cloud Service, Event Hub Cloud Service à Access, Data and Network Security à REST access to all the funcitonality
  • 11. © IT Convergence 2016. All rights reserved. Big Data Cloud Service – Compute Edition (BDCS-CE)
  • 12. © IT Convergence 2016. All rights reserved. BDCS-CE Notebook: Interactive Analysis à Apache Zeppelin Notebook (version0.7) to interactively work with data
  • 13. © IT Convergence 2016. All rights reserved. What is Hadoop? à An open source software platform for distributed storage and processing à Manage huge volumes of unstructured data à Parallel processing of large data set à Highly scalable à Fault-tolerant à Two main components: à HDFS: Hadoop Distributed File System for storing information à MapReduce: programming framework that process information
  • 14. © IT Convergence 2016. All rights reserved. Hadoop Components: HFDS à Stores the data on the cluster à Namenode: block registry à DataNode: block containers themselves (Datanode) à HDFS cartoon by Mvarshney
  • 15. © IT Convergence 2016. All rights reserved. Hadoop Components: MapReduce à Retrieves data from HDFS à A MapReduce program is composed by à Map() method: performs filtering and sorting of the <key, value> inputs à Reduce() method: summarize the <key,value> pairs provided by the Mappers à Code can be written in many languages (Perl, Python, Java etc)
  • 16. © IT Convergence 2016. All rights reserved. MapReduce Example
  • 17. © IT Convergence 2016. All rights reserved. Code Example
  • 18. © IT Convergence 2016. All rights reserved. Code Example
  • 19. © IT Convergence 2016. All rights reserved. #2 Hive
  • 20. © IT Convergence 2016. All rights reserved. What is Hive? à An open source data warehouse software on top of Apache Hadoop à Analyze and query data stored in HDFS à Structure the data into tables à Tools for simple ETL à SQL- like queries (HiveQL) à Procedural language with HPL-SQL à Metadata storage in a RDBMS
  • 21. © IT Convergence 2016. All rights reserved. Hadoop & Hive Demo
  • 23. Revisited: Map Reduce I/O munz & more #23 Source: Hadoop Application Architecture Book
  • 24. Spark • Orders of magnitude(s) faster than M/R • Higher level Scala, Java or Python API • Standalone, in Hadoop, or Mesos • Principle: Run an operation on all data -> ”Spark is the new MapReduce” • See also: Apache Storm, etc • Uses RDDs, or Dataframes, or Datasets munz & more #24 https://stackoverflow.com/questions/31508083/difference-between- dataframe-in-spark-2-0-i-e-datasetrow-and-rdd-in-spark https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf
  • 25. RDDs Resilient Distributed Datasets Where do they come from? Collection of data grouped into named columns. Supports text, JSON, Apache Parquet, sequence. Read in HDFS, Local FS, S3, Hbase Parallelize existing Collection Transform other RDD -> RDDs are immutable
  • 26. Lazy Evaluation munz & more #26 Nothing is executed Execution Transformations: map(), flatMap(), reduceByKey(), groupByKey() Actions: collect(), count(), first(), takeOrdered(), saveAsTextFile(), … http://spark.apache.org/docs/2.1.1/programming-guide.html
  • 27. map(func) Return a new distributed dataset formed by passing each element of the source through a function func. flatMap(func) Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item). reduceByKey(func, [numTasks]) When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. groupByKey([numTasks]) When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs. Transformations
  • 30. Spark Demo munz & more #30
  • 32. Word Count and Histogram munz & more #32 res = t.flatMap(lambda line: line.split(" ")) .map(lambda word: (word, 1)) .reduceByKey(lambda a, b: a + b) res.takeOrdered(5, key = lambda x: -x[1])
  • 34. Big Data Compute Service CE munz & more #34
  • 36. Kafka Partitioned, replicated commit log munz & more #36 0 1 2 3 4 … n Immutable log: Messages with offset Producer Consumer A Consumer B https://www.quora.com/Kafka-writes-every-message-to-broker-disk-Still-performance-wise-it- is-better-than-some-of-the-in-memory-message-storing-message-queues-Why-is-that
  • 37. Broker1 Broker2 Broker3 Topic A (1) Topic A (2) Topic A (3) Partition / Leader Repl A (1) Repl A (2) Repl A (3) Producer Replication / Follower Zoo- keeper Zoo- keeper Zoo- keeper State / HA
  • 38. https://www.confluent.io/blog/publishing-apache-kafka-new-york-times/ - 1 topic - 1 partition - Contains every article published since 1851 - Multiple producers / consumers Example for Stream / Table Duality
  • 39. Kafka Clients SDKs Connect Streams - OOTB: Java, Scala - Confluent: Python, C, C++ Confluent: - HDFS sink, - JDBC source, - S3 sink - Elastic search sink - Plugin .jar file - JDBC: Change data capture (CDC) - Real-time data ingestion - Microservices - KSQL: SQL streaming engine for streaming ETL, anomaly detection, monitoring - .jar file runs anywhere High / low level Kafka API Configuration only Integrate external Systems Data in Motion Stream / Table duality REST - Language agnostic - Easy for mobile apps - Easy to tunnel through FW etc. Lightweight
  • 40. Oracle Event Hub Cloud Service • PaaS: Managed Kafka 0.10.2 • Two deployment modes – Basic (Broker and ZK on 1 node) – Recommended (distributed) • REST Proxy – Separate sever(s) running REST Proxy munz & more #40
  • 41. Event Hub munz & more #41
  • 43. Ports You must open ports to allow access for external clients • Kafka Broker (from OPC connect string) • Zookeeper with port 2181 munz & more #43
  • 44. Scaling munz & more #44 horizontal (up)vertical
  • 45. Event Hub REST Interface munz & more #45 https://129.151.91.31:1080/restproxy/topics/a12345orderTopic Service = Topic
  • 46. Interesting to Know • Event Hub topics are prefixed with ID domain • With Kafka CLI topics with ID Domain can be created • Topics without ID domain are not shown in OPC console 46
  • 48. TL;DR #bigData #openSource #OPC OpenSource: entry point to Oracle Big Data world / Low(er) setup times / Check for resource usage & limits in Big Data OPC / BDCS-CE: managed Hadoop, Hive, Spark + Event hub: Kafka / Attend a hands-on workshop! / Next level: Oracle Big Data tools @EdelweissK @FrankMunz
  • 51. 3 Membership Tiers • Oracle ACE Director • Oracle ACE • Oracle ACE Associate bit.ly/OracleACEProgram 500+ Technical Experts Helping Peers Globally Connect: Nominate yourself or someone you know: acenomination.oracle.com @oracleace Facebook.com/oracleaces [email protected]
  • 52. Sign up for Free Trial http://cloud.oracle.com