SlideShare a Scribd company logo
Modern Lambda architecture in Big Data
Piotr Hejwowski
Hello world :)
■ Who am I ?
■ Java developer working in Codete
■ Keen on Big Data and modern backend approach
■ Luckily can develop this passion in Codete
■ https://github.com/Hejwo
■ piotr.hejwowski@codete.com
■ Disclaimer 1 - we will use Polish, but with lot of English, business specific terms.
■ Disclaimer 2 - Discipline is large so will going to cover only bigger picture
■ Disclaimer 3 - Live coding ? Next time
■ Disclaimer 4 - From zero to hero style
Recap & Intro
■ Recap - at the end of last GDG we were talking about Machine Learning
■ We talk about difference between Data Science and Big Data, often confused
Recap
Data science
■ Data science, also known as data-driven science, is an interdisciplinary field about scientific methods and processes to extract
knowledge or insights from data in various forms, either structured or unstructured
■ Data science is focused on availability of cleaning gathered data, math, statistic, business understanding and extracting valuable
information
Big data
■ Modern methods of gathering, processing big volumes of data
■ More info in next 40 mins ;)
What’s Big Data ?
What’s Big Data ?
What’s Big Data ?
■ Amount of our data is getting larger and larger
■ Important role in it is Internet of Things -> sensors, sensors are everywhere !
■ At some point EVEN business guys discovered that there’s great value behind unstructured data
■ ETL’s on massive scale
■ Recommendation systems based on FB likes
■ Analysing user traffic on e-shops and optimizing contents
■ Raw data from car’s sensors
■ Optimizing traffic like in Lublin :)
■ POTENTIAL and AMOUNT of data that we need is HUGE
■ Fun fact - having raw data means that we don’t know what we’re looking for and that’s great !!!
■ Discovering new relations in our data
But… When Big Data ?
How to process Big Data ?
Moore’s law is dying [*]
“Moore's law is the observation that the number of transistors in a
single core doubles approximately every two years”
■ Right now every new transistor progress is getting more and
more expensive.
■ New processors are getting more and more expensive.
■ Since now we could rely on Moore's law. If our
infrastructure is not doing well after two years and
approximately same cost we could have faster.
■ But… we still have many cores. But… sometimes distributing
work on many cores it’s still not enough.
How to process Big Data ? - Scale up vs. Scale out
Scale up
■ Costy components
■ Complexed application/system logic. Often multithreaded
■ Poor fault-tolerance
■ Machine is getting hot as Mordor.
■ Cheaper machines
■ Easier application and system logic
■ Thanks to orchestrating tools such as Mesos, Kubernetes it’s not THAT hard to maintain.
■ Fault-tolerance - If half of our machines will explode we still can do something
■ Needs data centers :(
Scale out
How to process Big Data ? - Scale up vs. Scale out
Meet Apache Spark - Big Data processing engine !
Meet Apache Spark - Big Data processing engine !
■ Created in Berkley university
■ At beginning it was Proof of Concept for Mesos cluster management
■ Much more faster than his father - Hadoop
■ By default it operates on memory.
■ No frequent disc writes means more speed
■ Rich and simple caching mechanism
■ There are ton of other Big Data processing engines - Hadoop, Storm, Flink, Splunk
■ We're gonna focus on Spark due to time
Meet Apache Spark - Big Data processing engine !
Is Big Data processing THE only direction ?
Spark is faster than Hadoop, but still… it’s heavy machinery
Is Big Data THE only direction ?
Reactive Manifesto
■ Responsive - What happens when Wifi
is down ? Users want FAST responses !
■ Elastic - Large system tend to have
frequent, massive loads
■ Resilient - System must stay available
and any kind of response is better than
no response.
■ Message Driven - isolation and
non-blocking is achieved via async
communication. Thanks to that we have
clear boundaries, isolation,
transparency.
How to achieve this two goals ? Let’s go lambda !
?
?
Meet our systems heart - Apache Kafka
■ Lightlight fast Messaging system
■ Heart of Big Data system
■ Distributed
■ Build by LinkedIn
■ Written in Scala
■ Producers and Consumers concept
■ Auto recovery, Brokers detection
Meet our systems heart - Apache Kafka
Meet our systems heart - Apache Kafka
We’ve got two parts of puzzle !
?
Spark Streaming - when batch is not enough
Spark Streaming - when batch is not enough
■ Spark Streaming makes it easy to build scalable fault-tolerant streaming applications.
■ By running on Spark, Spark Streaming lets you reuse the same code for batch processing, join streams against historical data, or run
ad-hoc queries on stream state.
■ Used as a rapid fast micro batching
■ Before Spark Streaming, building complex pipelines that encompass streaming, batch, or even machine learning capabilities with open
source software meant dealing with multiple frameworks
■ Streaming ETL – Data is continuously cleaned and aggregated before being pushed into data stores. No more SAP.
■ Triggers – Anomalous behavior is detected in real-time and further downstream actions are triggered accordingly.
■ Data enrichment – Live data is enriched with more information by joining it with a static dataset allowing for a more complete real-time
analysis.
Witch done with the puzzle !
Now… let’s store it ! NoSql store it !
■ Large datasets
■ Easy to scale out
■ Less schema validation on write means faster
■ Schemaless databases can be a great value in Big Data, all
thought we sometimes don’t know what we need and we
want our data to be dirty.
Why NoSQL ?
Now… let’s store it ! NoSql store it !
Now… let’s store it ! NoSql store it !
Why it’s modern ?
■ Fast, reliable
■ More like - write once, run everywhere thanks to Spark, Spark Shell, Zeppelin
■ Less code (Hadoop’s MapReduce ? It’s an essay)
■ Comparing to older approach - less chaos thanks to Kafka.
Cons
■ More like micro batching not real time
■ Lot of stuff is still evolving (Spark, Kafka) and hasn’t got professional customer support
■ Things tend to get complicated when we’re Kafka messages within single topic evolve
■ DevOps, needed, strong powerful developers needed
■ Distributed world is complicated world
■ Thousands of frameworks and ideas every year
What next ?
Apache Spark resources :
■ http://spark.apache.org/
■ https://hortonworks.com/tutorials/
■ https://codete.com/blog/
Apache Kafka resources :
■ http://spark.apache.org/
■ http://blog.cloudera.com/blog/2014/09/apache-kafka-for-beginners/
NoSql resources :
■ http://openmymind.net/2011/8/15/How-You-Should-Go-About-Learning-NoSQL/
Sources
■ Internet in a minute : http://www.visualcapitalist.com/what-happens-internet-minute-2016/
■ Big Data and V4’s : https://www.linkedin.com/pulse/20140306073407-64875646-big-data-the-5-vs-everyone-must-know
■ Moore’s law : https://en.wikipedia.org/wiki/Moore%27s_law
■ Apache Spark : http://horicky.blogspot.com/2013/12/spark-low-latency-massively-parallel.html
■ Apache Kafka : https://softwareengineeringdaily.com/2015/08/06/kafka-with-guozhang-wang/
■ Spark Streamming : http://ingest.tips/2015/06/24/real-time-analytics-with-kafka-and-spark-streaming/
■ NoSQL : https://www.getfilecloud.com/blog/2014/08/leading-nosql-databases-to-consider/
Thank you ? No. Thank YOU
Spark ? Interesting alternative to ETL Hell
■ SAP, SAS, Elixir
■ ETL has nice visual building blocks, but this means....
■ … Click, Click, Click, Click… (RSI danger !)
■ Building blocks means that plain-text code hidden in stages. Hard to debug, Hard to unit test.
■ Waste of resources. ETL jobs are fired at night where we have peak performance. Then resources are unused.
■ Data is getting out of sync. So ETL pipeline gets out of sync.
■ In Big Data world we have Apache Avro for schema registry
■ Big Data can handle more
■ Legacy code
■ $$$ It’s for FREE $$$
■ Can throw Machine Learning into it and do interesting things. Not only batches.
■ Lack professional support
■ Big Data is not that mature
■ Let’s look what will happen here
Apache Kafka vs. Rabbit MQ
Apache Kafka vs. Rabbit MQ
Apache Kafka vs. Rabbit MQ
Kafka :
■ + Fire hose of events (100k+/sec)
■ + Availability of re-read messages (Good for CQRS)
■ + Scale out
■ + Confluent -> Kafka Connect, Kafka Streams, Schema Registry
■ - You don't mind supporting on your own
■ - No AMQP and complexed routing
RabbitMQ :
■ + Messages may be routed in complexed way to consumers
■ + Mature - You like yelling at support guys rather than fixing be yourself ? Place for you !
■ + Scale out
■ - (20k+/sec) messages
■ - Messages are deleted after consumers ack
Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data
Ad

More Related Content

What's hot (19)

Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...
Zekeriya Besiroglu
 
Big Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIneBig Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIne
Douglas Moore
 
Not Just Another Overview of Apache Hadoop
Not Just Another Overview of Apache HadoopNot Just Another Overview of Apache Hadoop
Not Just Another Overview of Apache Hadoop
Adaryl "Bob" Wakefield, MBA
 
Testing Big Data in AWS - Sept 2021
Testing Big Data in AWS - Sept 2021Testing Big Data in AWS - Sept 2021
Testing Big Data in AWS - Sept 2021
Michael98364
 
Technology behind-real-time-log-analytics
Technology behind-real-time-log-analytics Technology behind-real-time-log-analytics
Technology behind-real-time-log-analytics
Data Science Thailand
 
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...
Data Con LA
 
Big Data Processing Using Hadoop Infrastructure
Big Data Processing Using Hadoop InfrastructureBig Data Processing Using Hadoop Infrastructure
Big Data Processing Using Hadoop Infrastructure
Dmitry Buzdin
 
Bigdata : Big picture
Bigdata : Big pictureBigdata : Big picture
Bigdata : Big picture
Zekeriya Besiroglu
 
Impala turbocharge your big data access
Impala   turbocharge your big data accessImpala   turbocharge your big data access
Impala turbocharge your big data access
Ophir Cohen
 
Apache Spark Briefing
Apache Spark BriefingApache Spark Briefing
Apache Spark Briefing
Thomas W. Dinsmore
 
Atlanta hadoop users group july 2013
Atlanta hadoop users group july 2013Atlanta hadoop users group july 2013
Atlanta hadoop users group july 2013
Christopher Curtin
 
Big data pipelines
Big data pipelinesBig data pipelines
Big data pipelines
Vivek Aanand Ganesan
 
Datascience lab 2017 odessa kappa architecture 2.0
Datascience lab 2017 odessa   kappa architecture 2.0Datascience lab 2017 odessa   kappa architecture 2.0
Datascience lab 2017 odessa kappa architecture 2.0
Juantomás García Molina
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
Paco Nathan
 
Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely heading
Paco Nathan
 
introduction to Neo4j (Tabriz Software Open Talks)
introduction to Neo4j (Tabriz Software Open Talks)introduction to Neo4j (Tabriz Software Open Talks)
introduction to Neo4j (Tabriz Software Open Talks)
Farzin Bagheri
 
Big Data with Apache Hadoop
Big Data with Apache HadoopBig Data with Apache Hadoop
Big Data with Apache Hadoop
InfoFarm
 
Amazon aws big data demystified | Introduction to streaming and messaging flu...
Amazon aws big data demystified | Introduction to streaming and messaging flu...Amazon aws big data demystified | Introduction to streaming and messaging flu...
Amazon aws big data demystified | Introduction to streaming and messaging flu...
Omid Vahdaty
 
Big Data in the Cloud - Montreal April 2015
Big Data in the Cloud - Montreal April 2015Big Data in the Cloud - Montreal April 2015
Big Data in the Cloud - Montreal April 2015
Cindy Gross
 
Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...
Zekeriya Besiroglu
 
Big Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIneBig Data Anti-Patterns: Lessons From the Front LIne
Big Data Anti-Patterns: Lessons From the Front LIne
Douglas Moore
 
Testing Big Data in AWS - Sept 2021
Testing Big Data in AWS - Sept 2021Testing Big Data in AWS - Sept 2021
Testing Big Data in AWS - Sept 2021
Michael98364
 
Technology behind-real-time-log-analytics
Technology behind-real-time-log-analytics Technology behind-real-time-log-analytics
Technology behind-real-time-log-analytics
Data Science Thailand
 
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...
Data Con LA
 
Big Data Processing Using Hadoop Infrastructure
Big Data Processing Using Hadoop InfrastructureBig Data Processing Using Hadoop Infrastructure
Big Data Processing Using Hadoop Infrastructure
Dmitry Buzdin
 
Impala turbocharge your big data access
Impala   turbocharge your big data accessImpala   turbocharge your big data access
Impala turbocharge your big data access
Ophir Cohen
 
Atlanta hadoop users group july 2013
Atlanta hadoop users group july 2013Atlanta hadoop users group july 2013
Atlanta hadoop users group july 2013
Christopher Curtin
 
Datascience lab 2017 odessa kappa architecture 2.0
Datascience lab 2017 odessa   kappa architecture 2.0Datascience lab 2017 odessa   kappa architecture 2.0
Datascience lab 2017 odessa kappa architecture 2.0
Juantomás García Molina
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
Paco Nathan
 
Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely heading
Paco Nathan
 
introduction to Neo4j (Tabriz Software Open Talks)
introduction to Neo4j (Tabriz Software Open Talks)introduction to Neo4j (Tabriz Software Open Talks)
introduction to Neo4j (Tabriz Software Open Talks)
Farzin Bagheri
 
Big Data with Apache Hadoop
Big Data with Apache HadoopBig Data with Apache Hadoop
Big Data with Apache Hadoop
InfoFarm
 
Amazon aws big data demystified | Introduction to streaming and messaging flu...
Amazon aws big data demystified | Introduction to streaming and messaging flu...Amazon aws big data demystified | Introduction to streaming and messaging flu...
Amazon aws big data demystified | Introduction to streaming and messaging flu...
Omid Vahdaty
 
Big Data in the Cloud - Montreal April 2015
Big Data in the Cloud - Montreal April 2015Big Data in the Cloud - Montreal April 2015
Big Data in the Cloud - Montreal April 2015
Cindy Gross
 

Similar to Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data (20)

SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017
SnappyData
 
Big data nyu
Big data nyuBig data nyu
Big data nyu
Edward Capriolo
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Databricks
 
NoSQL for great good [hanoi.rb talk]
NoSQL for great good [hanoi.rb talk]NoSQL for great good [hanoi.rb talk]
NoSQL for great good [hanoi.rb talk]
Huy Do
 
Big Data! Great! Now What? #SymfonyCon 2014
Big Data! Great! Now What? #SymfonyCon 2014Big Data! Great! Now What? #SymfonyCon 2014
Big Data! Great! Now What? #SymfonyCon 2014
Ricard Clau
 
Architecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & ManipulationArchitecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & Manipulation
George Long
 
Spark
SparkSpark
Spark
Nitish Upreti
 
MongoDB World 2019: MongoDB Cluster Design: From Redundancy to GDPR
MongoDB World 2019: MongoDB Cluster Design: From Redundancy to GDPRMongoDB World 2019: MongoDB Cluster Design: From Redundancy to GDPR
MongoDB World 2019: MongoDB Cluster Design: From Redundancy to GDPR
MongoDB
 
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
DataWorks Summit
 
Introduction to NetGuardians' Big Data Software Stack
Introduction to NetGuardians' Big Data Software StackIntroduction to NetGuardians' Big Data Software Stack
Introduction to NetGuardians' Big Data Software Stack
Jérôme Kehrli
 
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
Resilience: the key requirement of a [big] [data] architecture  - StampedeCon...Resilience: the key requirement of a [big] [data] architecture  - StampedeCon...
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
StampedeCon
 
Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @Datadog
C4Media
 
Tech Spark Presentation
Tech Spark PresentationTech Spark Presentation
Tech Spark Presentation
Stephen Borg
 
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsfPyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
SQL or NoSQL, that is the question!
SQL or NoSQL, that is the question!SQL or NoSQL, that is the question!
SQL or NoSQL, that is the question!
Andraz Tori
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Sarah Guido
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
Adaryl "Bob" Wakefield, MBA
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
MongoDB
 
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014
gmalouf678
 
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkThe Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
Krishna Sankar
 
SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017
SnappyData
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Databricks
 
NoSQL for great good [hanoi.rb talk]
NoSQL for great good [hanoi.rb talk]NoSQL for great good [hanoi.rb talk]
NoSQL for great good [hanoi.rb talk]
Huy Do
 
Big Data! Great! Now What? #SymfonyCon 2014
Big Data! Great! Now What? #SymfonyCon 2014Big Data! Great! Now What? #SymfonyCon 2014
Big Data! Great! Now What? #SymfonyCon 2014
Ricard Clau
 
Architecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & ManipulationArchitecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & Manipulation
George Long
 
MongoDB World 2019: MongoDB Cluster Design: From Redundancy to GDPR
MongoDB World 2019: MongoDB Cluster Design: From Redundancy to GDPRMongoDB World 2019: MongoDB Cluster Design: From Redundancy to GDPR
MongoDB World 2019: MongoDB Cluster Design: From Redundancy to GDPR
MongoDB
 
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
DataWorks Summit
 
Introduction to NetGuardians' Big Data Software Stack
Introduction to NetGuardians' Big Data Software StackIntroduction to NetGuardians' Big Data Software Stack
Introduction to NetGuardians' Big Data Software Stack
Jérôme Kehrli
 
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
Resilience: the key requirement of a [big] [data] architecture  - StampedeCon...Resilience: the key requirement of a [big] [data] architecture  - StampedeCon...
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
StampedeCon
 
Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @Datadog
C4Media
 
Tech Spark Presentation
Tech Spark PresentationTech Spark Presentation
Tech Spark Presentation
Stephen Borg
 
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsfPyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
SQL or NoSQL, that is the question!
SQL or NoSQL, that is the question!SQL or NoSQL, that is the question!
SQL or NoSQL, that is the question!
Andraz Tori
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Sarah Guido
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
MongoDB
 
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014
gmalouf678
 
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkThe Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
Krishna Sankar
 
Ad

Recently uploaded (20)

Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptxPerencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
PareaRusan
 
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
ThanushsaranS
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
Ch3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendencyCh3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendency
ayeleasefa2
 
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
How to join illuminati Agent in uganda call+256776963507/0741506136
How to join illuminati Agent in uganda call+256776963507/0741506136How to join illuminati Agent in uganda call+256776963507/0741506136
How to join illuminati Agent in uganda call+256776963507/0741506136
illuminati Agent uganda call+256776963507/0741506136
 
Simple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptxSimple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptx
ssuser2aa19f
 
FPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptxFPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptx
ssuser4ef83d
 
chapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptxchapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptx
justinebandajbn
 
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptxmd-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
fatimalazaar2004
 
How iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost FundsHow iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost Funds
ireneschmid345
 
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 
Developing Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response ApplicationsDeveloping Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response Applications
VICTOR MAESTRE RAMIREZ
 
Flip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptxFlip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptx
mubashirkhan45461
 
03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia
Alexander Romero Arosquipa
 
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnTemplate_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
cegiver630
 
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptxPerencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
PareaRusan
 
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
ThanushsaranS
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
Ch3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendencyCh3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendency
ayeleasefa2
 
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
Simple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptxSimple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptx
ssuser2aa19f
 
FPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptxFPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptx
ssuser4ef83d
 
chapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptxchapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptx
justinebandajbn
 
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptxmd-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
fatimalazaar2004
 
How iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost FundsHow iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost Funds
ireneschmid345
 
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 
Developing Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response ApplicationsDeveloping Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response Applications
VICTOR MAESTRE RAMIREZ
 
Flip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptxFlip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptx
mubashirkhan45461
 
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnTemplate_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
cegiver630
 
Ad

Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data

  • 1. Modern Lambda architecture in Big Data Piotr Hejwowski
  • 2. Hello world :) ■ Who am I ? ■ Java developer working in Codete ■ Keen on Big Data and modern backend approach ■ Luckily can develop this passion in Codete ■ https://github.com/Hejwo ■ [email protected] ■ Disclaimer 1 - we will use Polish, but with lot of English, business specific terms. ■ Disclaimer 2 - Discipline is large so will going to cover only bigger picture ■ Disclaimer 3 - Live coding ? Next time ■ Disclaimer 4 - From zero to hero style
  • 3. Recap & Intro ■ Recap - at the end of last GDG we were talking about Machine Learning ■ We talk about difference between Data Science and Big Data, often confused Recap Data science ■ Data science, also known as data-driven science, is an interdisciplinary field about scientific methods and processes to extract knowledge or insights from data in various forms, either structured or unstructured ■ Data science is focused on availability of cleaning gathered data, math, statistic, business understanding and extracting valuable information Big data ■ Modern methods of gathering, processing big volumes of data ■ More info in next 40 mins ;)
  • 6. What’s Big Data ? ■ Amount of our data is getting larger and larger ■ Important role in it is Internet of Things -> sensors, sensors are everywhere ! ■ At some point EVEN business guys discovered that there’s great value behind unstructured data ■ ETL’s on massive scale ■ Recommendation systems based on FB likes ■ Analysing user traffic on e-shops and optimizing contents ■ Raw data from car’s sensors ■ Optimizing traffic like in Lublin :) ■ POTENTIAL and AMOUNT of data that we need is HUGE ■ Fun fact - having raw data means that we don’t know what we’re looking for and that’s great !!! ■ Discovering new relations in our data
  • 8. How to process Big Data ? Moore’s law is dying [*] “Moore's law is the observation that the number of transistors in a single core doubles approximately every two years” ■ Right now every new transistor progress is getting more and more expensive. ■ New processors are getting more and more expensive. ■ Since now we could rely on Moore's law. If our infrastructure is not doing well after two years and approximately same cost we could have faster. ■ But… we still have many cores. But… sometimes distributing work on many cores it’s still not enough.
  • 9. How to process Big Data ? - Scale up vs. Scale out Scale up ■ Costy components ■ Complexed application/system logic. Often multithreaded ■ Poor fault-tolerance ■ Machine is getting hot as Mordor. ■ Cheaper machines ■ Easier application and system logic ■ Thanks to orchestrating tools such as Mesos, Kubernetes it’s not THAT hard to maintain. ■ Fault-tolerance - If half of our machines will explode we still can do something ■ Needs data centers :( Scale out
  • 10. How to process Big Data ? - Scale up vs. Scale out
  • 11. Meet Apache Spark - Big Data processing engine !
  • 12. Meet Apache Spark - Big Data processing engine ! ■ Created in Berkley university ■ At beginning it was Proof of Concept for Mesos cluster management ■ Much more faster than his father - Hadoop ■ By default it operates on memory. ■ No frequent disc writes means more speed ■ Rich and simple caching mechanism ■ There are ton of other Big Data processing engines - Hadoop, Storm, Flink, Splunk ■ We're gonna focus on Spark due to time
  • 13. Meet Apache Spark - Big Data processing engine !
  • 14. Is Big Data processing THE only direction ? Spark is faster than Hadoop, but still… it’s heavy machinery
  • 15. Is Big Data THE only direction ? Reactive Manifesto ■ Responsive - What happens when Wifi is down ? Users want FAST responses ! ■ Elastic - Large system tend to have frequent, massive loads ■ Resilient - System must stay available and any kind of response is better than no response. ■ Message Driven - isolation and non-blocking is achieved via async communication. Thanks to that we have clear boundaries, isolation, transparency.
  • 16. How to achieve this two goals ? Let’s go lambda ! ? ?
  • 17. Meet our systems heart - Apache Kafka ■ Lightlight fast Messaging system ■ Heart of Big Data system ■ Distributed ■ Build by LinkedIn ■ Written in Scala ■ Producers and Consumers concept ■ Auto recovery, Brokers detection
  • 18. Meet our systems heart - Apache Kafka
  • 19. Meet our systems heart - Apache Kafka
  • 20. We’ve got two parts of puzzle ! ?
  • 21. Spark Streaming - when batch is not enough
  • 22. Spark Streaming - when batch is not enough ■ Spark Streaming makes it easy to build scalable fault-tolerant streaming applications. ■ By running on Spark, Spark Streaming lets you reuse the same code for batch processing, join streams against historical data, or run ad-hoc queries on stream state. ■ Used as a rapid fast micro batching ■ Before Spark Streaming, building complex pipelines that encompass streaming, batch, or even machine learning capabilities with open source software meant dealing with multiple frameworks ■ Streaming ETL – Data is continuously cleaned and aggregated before being pushed into data stores. No more SAP. ■ Triggers – Anomalous behavior is detected in real-time and further downstream actions are triggered accordingly. ■ Data enrichment – Live data is enriched with more information by joining it with a static dataset allowing for a more complete real-time analysis.
  • 23. Witch done with the puzzle !
  • 24. Now… let’s store it ! NoSql store it ! ■ Large datasets ■ Easy to scale out ■ Less schema validation on write means faster ■ Schemaless databases can be a great value in Big Data, all thought we sometimes don’t know what we need and we want our data to be dirty. Why NoSQL ?
  • 25. Now… let’s store it ! NoSql store it !
  • 26. Now… let’s store it ! NoSql store it !
  • 27. Why it’s modern ? ■ Fast, reliable ■ More like - write once, run everywhere thanks to Spark, Spark Shell, Zeppelin ■ Less code (Hadoop’s MapReduce ? It’s an essay) ■ Comparing to older approach - less chaos thanks to Kafka.
  • 28. Cons ■ More like micro batching not real time ■ Lot of stuff is still evolving (Spark, Kafka) and hasn’t got professional customer support ■ Things tend to get complicated when we’re Kafka messages within single topic evolve ■ DevOps, needed, strong powerful developers needed ■ Distributed world is complicated world ■ Thousands of frameworks and ideas every year
  • 29. What next ? Apache Spark resources : ■ http://spark.apache.org/ ■ https://hortonworks.com/tutorials/ ■ https://codete.com/blog/ Apache Kafka resources : ■ http://spark.apache.org/ ■ http://blog.cloudera.com/blog/2014/09/apache-kafka-for-beginners/ NoSql resources : ■ http://openmymind.net/2011/8/15/How-You-Should-Go-About-Learning-NoSQL/
  • 30. Sources ■ Internet in a minute : http://www.visualcapitalist.com/what-happens-internet-minute-2016/ ■ Big Data and V4’s : https://www.linkedin.com/pulse/20140306073407-64875646-big-data-the-5-vs-everyone-must-know ■ Moore’s law : https://en.wikipedia.org/wiki/Moore%27s_law ■ Apache Spark : http://horicky.blogspot.com/2013/12/spark-low-latency-massively-parallel.html ■ Apache Kafka : https://softwareengineeringdaily.com/2015/08/06/kafka-with-guozhang-wang/ ■ Spark Streamming : http://ingest.tips/2015/06/24/real-time-analytics-with-kafka-and-spark-streaming/ ■ NoSQL : https://www.getfilecloud.com/blog/2014/08/leading-nosql-databases-to-consider/
  • 31. Thank you ? No. Thank YOU
  • 32. Spark ? Interesting alternative to ETL Hell ■ SAP, SAS, Elixir ■ ETL has nice visual building blocks, but this means.... ■ … Click, Click, Click, Click… (RSI danger !) ■ Building blocks means that plain-text code hidden in stages. Hard to debug, Hard to unit test. ■ Waste of resources. ETL jobs are fired at night where we have peak performance. Then resources are unused. ■ Data is getting out of sync. So ETL pipeline gets out of sync. ■ In Big Data world we have Apache Avro for schema registry ■ Big Data can handle more ■ Legacy code ■ $$$ It’s for FREE $$$ ■ Can throw Machine Learning into it and do interesting things. Not only batches. ■ Lack professional support ■ Big Data is not that mature ■ Let’s look what will happen here
  • 33. Apache Kafka vs. Rabbit MQ
  • 34. Apache Kafka vs. Rabbit MQ
  • 35. Apache Kafka vs. Rabbit MQ Kafka : ■ + Fire hose of events (100k+/sec) ■ + Availability of re-read messages (Good for CQRS) ■ + Scale out ■ + Confluent -> Kafka Connect, Kafka Streams, Schema Registry ■ - You don't mind supporting on your own ■ - No AMQP and complexed routing RabbitMQ : ■ + Messages may be routed in complexed way to consumers ■ + Mature - You like yelling at support guys rather than fixing be yourself ? Place for you ! ■ + Scale out ■ - (20k+/sec) messages ■ - Messages are deleted after consumers ack