Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data

Modern Lambda architecture in Big Data
Piotr Hejwowski

Hello world :)
■ Who am I ?
■ Java developer working in Codete
■ Keen on Big Data and modern backend approach
■ Luckily can develop this passion in Codete
■ https://github.com/Hejwo
■ piotr.hejwowski@codete.com
■ Disclaimer 1 - we will use Polish, but with lot of English, business specific terms.
■ Disclaimer 2 - Discipline is large so will going to cover only bigger picture
■ Disclaimer 3 - Live coding ? Next time
■ Disclaimer 4 - From zero to hero style

Recap & Intro
■ Recap - at the end of last GDG we were talking about Machine Learning
■ We talk about difference between Data Science and Big Data, often confused
Recap
Data science
■ Data science, also known as data-driven science, is an interdisciplinary field about scientific methods and processes to extract
knowledge or insights from data in various forms, either structured or unstructured
■ Data science is focused on availability of cleaning gathered data, math, statistic, business understanding and extracting valuable
information
Big data
■ Modern methods of gathering, processing big volumes of data
■ More info in next 40 mins ;)

What’s Big Data ?
■ Amount of our data is getting larger and larger
■ Important role in it is Internet of Things -> sensors, sensors are everywhere !
■ At some point EVEN business guys discovered that there’s great value behind unstructured data
■ ETL’s on massive scale
■ Recommendation systems based on FB likes
■ Analysing user traffic on e-shops and optimizing contents
■ Raw data from car’s sensors
■ Optimizing traffic like in Lublin :)
■ POTENTIAL and AMOUNT of data that we need is HUGE
■ Fun fact - having raw data means that we don’t know what we’re looking for and that’s great !!!
■ Discovering new relations in our data

How to process Big Data ?
Moore’s law is dying [*]
“Moore's law is the observation that the number of transistors in a
single core doubles approximately every two years”
■ Right now every new transistor progress is getting more and
more expensive.
■ New processors are getting more and more expensive.
■ Since now we could rely on Moore's law. If our
infrastructure is not doing well after two years and
approximately same cost we could have faster.
■ But… we still have many cores. But… sometimes distributing
work on many cores it’s still not enough.

How to process Big Data ? - Scale up vs. Scale out
Scale up
■ Costy components
■ Complexed application/system logic. Often multithreaded
■ Poor fault-tolerance
■ Machine is getting hot as Mordor.
■ Cheaper machines
■ Easier application and system logic
■ Thanks to orchestrating tools such as Mesos, Kubernetes it’s not THAT hard to maintain.
■ Fault-tolerance - If half of our machines will explode we still can do something
■ Needs data centers :(
Scale out

How to process Big Data ? - Scale up vs. Scale out

Meet Apache Spark - Big Data processing engine !

Meet Apache Spark - Big Data processing engine !
■ Created in Berkley university
■ At beginning it was Proof of Concept for Mesos cluster management
■ Much more faster than his father - Hadoop
■ By default it operates on memory.
■ No frequent disc writes means more speed
■ Rich and simple caching mechanism
■ There are ton of other Big Data processing engines - Hadoop, Storm, Flink, Splunk
■ We're gonna focus on Spark due to time

Is Big Data processing THE only direction ?
Spark is faster than Hadoop, but still… it’s heavy machinery

Is Big Data THE only direction ?
Reactive Manifesto
■ Responsive - What happens when Wifi
is down ? Users want FAST responses !
■ Elastic - Large system tend to have
frequent, massive loads
■ Resilient - System must stay available
and any kind of response is better than
no response.
■ Message Driven - isolation and
non-blocking is achieved via async
communication. Thanks to that we have
clear boundaries, isolation,
transparency.

How to achieve this two goals ? Let’s go lambda !
?
?

Meet our systems heart - Apache Kafka
■ Lightlight fast Messaging system
■ Heart of Big Data system
■ Distributed
■ Build by LinkedIn
■ Written in Scala
■ Producers and Consumers concept
■ Auto recovery, Brokers detection

Meet our systems heart - Apache Kafka

We’ve got two parts of puzzle !
?

Spark Streaming - when batch is not enough

Spark Streaming - when batch is not enough
■ Spark Streaming makes it easy to build scalable fault-tolerant streaming applications.
■ By running on Spark, Spark Streaming lets you reuse the same code for batch processing, join streams against historical data, or run
ad-hoc queries on stream state.
■ Used as a rapid fast micro batching
■ Before Spark Streaming, building complex pipelines that encompass streaming, batch, or even machine learning capabilities with open
source software meant dealing with multiple frameworks
■ Streaming ETL – Data is continuously cleaned and aggregated before being pushed into data stores. No more SAP.
■ Triggers – Anomalous behavior is detected in real-time and further downstream actions are triggered accordingly.
■ Data enrichment – Live data is enriched with more information by joining it with a static dataset allowing for a more complete real-time
analysis.

Now… let’s store it ! NoSql store it !
■ Large datasets
■ Easy to scale out
■ Less schema validation on write means faster
■ Schemaless databases can be a great value in Big Data, all
thought we sometimes don’t know what we need and we
want our data to be dirty.
Why NoSQL ?

Now… let’s store it ! NoSql store it !

Why it’s modern ?
■ Fast, reliable
■ More like - write once, run everywhere thanks to Spark, Spark Shell, Zeppelin
■ Less code (Hadoop’s MapReduce ? It’s an essay)
■ Comparing to older approach - less chaos thanks to Kafka.

Cons
■ More like micro batching not real time
■ Lot of stuff is still evolving (Spark, Kafka) and hasn’t got professional customer support
■ Things tend to get complicated when we’re Kafka messages within single topic evolve
■ DevOps, needed, strong powerful developers needed
■ Distributed world is complicated world
■ Thousands of frameworks and ideas every year

What next ?
Apache Spark resources :
■ http://spark.apache.org/
■ https://hortonworks.com/tutorials/
■ https://codete.com/blog/
Apache Kafka resources :
■ http://spark.apache.org/
■ http://blog.cloudera.com/blog/2014/09/apache-kafka-for-beginners/
NoSql resources :
■ http://openmymind.net/2011/8/15/How-You-Should-Go-About-Learning-NoSQL/

Sources
■ Internet in a minute : http://www.visualcapitalist.com/what-happens-internet-minute-2016/
■ Big Data and V4’s : https://www.linkedin.com/pulse/20140306073407-64875646-big-data-the-5-vs-everyone-must-know
■ Moore’s law : https://en.wikipedia.org/wiki/Moore%27s_law
■ Apache Spark : http://horicky.blogspot.com/2013/12/spark-low-latency-massively-parallel.html
■ Apache Kafka : https://softwareengineeringdaily.com/2015/08/06/kafka-with-guozhang-wang/
■ Spark Streamming : http://ingest.tips/2015/06/24/real-time-analytics-with-kafka-and-spark-streaming/
■ NoSQL : https://www.getfilecloud.com/blog/2014/08/leading-nosql-databases-to-consider/

Spark ? Interesting alternative to ETL Hell
■ SAP, SAS, Elixir
■ ETL has nice visual building blocks, but this means....
■ … Click, Click, Click, Click… (RSI danger !)
■ Building blocks means that plain-text code hidden in stages. Hard to debug, Hard to unit test.
■ Waste of resources. ETL jobs are fired at night where we have peak performance. Then resources are unused.
■ Data is getting out of sync. So ETL pipeline gets out of sync.
■ In Big Data world we have Apache Avro for schema registry
■ Big Data can handle more
■ Legacy code
■ $$$ It’s for FREE $$$
■ Can throw Machine Learning into it and do interesting things. Not only batches.
■ Lack professional support
■ Big Data is not that mature
■ Let’s look what will happen here

Apache Kafka vs. Rabbit MQ
Kafka :
■ + Fire hose of events (100k+/sec)
■ + Availability of re-read messages (Good for CQRS)
■ + Scale out
■ + Confluent -> Kafka Connect, Kafka Streams, Schema Registry
■ - You don't mind supporting on your own
■ - No AMQP and complexed routing
RabbitMQ :
■ + Messages may be routed in complexed way to consumers
■ + Mature - You like yelling at support guys rather than fixing be yourself ? Place for you !
■ + Scale out
■ - (20k+/sec) messages
■ - Messages are deleted after consumers ack

Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data

Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data

Recommended

More Related Content

What's hot (19)

Similar to Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data (20)

Recently uploaded (20)

Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data