What no one tells you about writing a streaming apphadooparchbook
This document discusses 5 things that are often not addressed when writing streaming applications:
1. Managing and monitoring long-running streaming jobs can be challenging as frameworks were not originally designed for streaming workloads. Options include using cluster mode to ensure jobs continue if clients disconnect and leveraging monitoring tools to track metrics.
2. Preventing data loss requires different approaches depending on the data source. File and receiver-based sources benefit from checkpointing while Kafka's commit log ensures data is not lost.
3. Spark Streaming is well-suited for tasks involving windowing, aggregations, and machine learning but may not be needed for all streaming use cases.
4. Achieving exactly-once semantics requires techniques
The document discusses best practices for streaming applications. It covers common streaming use cases like ingestion, transformations, and counting. It also discusses advanced streaming use cases that involve machine learning. The document provides an overview of streaming architectures and compares different streaming engines like Spark Streaming, Flink, Storm, and Kafka Streams. It discusses when to use different storage systems and message brokers like Kafka for ingestion pipelines. The goal is to understand common streaming use cases and their architectures.
Top 5 mistakes when writing Streaming applicationshadooparchbook
This document discusses 5 common mistakes when writing streaming applications and provides solutions. It covers: 1) Not shutting down apps gracefully by using thread hooks or external markers to stop processing after batches finish. 2) Assuming exactly-once semantics when things can fail at multiple points requiring offsets and idempotent operations. 3) Using streaming for everything when batch processing is better for some goals. 4) Not preventing data loss by enabling checkpointing and write-ahead logs. 5) Not monitoring jobs by using tools like Spark Streaming UI, Graphite and YARN cluster mode for automatic restarts.
Application architectures with Hadoop – Big Data TechCon 2014hadooparchbook
Building applications using Apache Hadoop with a use-case of clickstream analysis. Presented by Mark Grover and Jonathan Seidman at Big Data TechCon, Boston in April 2014
Architecting application with Hadoop - using clickstream analytics as an examplehadooparchbook
Delivered by Mark Grover at Northern CO Hadoop User Group:
http://www.meetup.com/Northern-Colorado-Big-Data-Meetup/events/224717963/
Architecting applications with Hadoop - Fraud Detectionhadooparchbook
This document discusses architectures for fraud detection applications using Hadoop. It provides an overview of requirements for such an application, including the need for real-time alerts and batch processing. It proposes using Kafka for ingestion due to its high throughput and partitioning. HBase and HDFS would be used for storage, with HBase better supporting random access for profiles. The document outlines using Flume, Spark Streaming, and HBase for near real-time processing and alerting on incoming events. Batch processing would use HDFS, Impala, and Spark. Caching profiles in memory is also suggested to improve performance.
Hadoop application architectures - using Customer 360 as an examplehadooparchbook
Hadoop application architectures - using Customer 360 (more generally, Entity 360) as an example. By Ted Malaska, Jonathan Seidman and Mark Grover at Strata + Hadoop World 2016 in NYC.
Top 5 mistakes when writing Spark applicationsmarkgrover
This document discusses 5 common mistakes people make when writing Spark applications.
The first mistake is improperly sizing Spark executors by not considering factors like the number of cores, amount of memory, and overhead needed. The second mistake is running into the 2GB limit on Spark shuffle blocks, which can cause jobs to fail. The third mistake is not addressing data skew during joins and shuffles, which can cause some tasks to be much slower than others. The fourth mistake is poorly managing the DAG by overusing shuffles, not using techniques like ReduceByKey instead of GroupByKey, and not using complex data types. The fifth mistake is classpath conflicts between the versions of libraries used by Spark and those added by the user.
The document discusses application architectures using Hadoop. It provides an example case study of clickstream analysis of web logs. It discusses challenges of Hadoop implementation and various architectural considerations for data storage, modeling, ingestion, processing and what specific processing needs to happen for the case study. These include sessionization, filtering, and business intelligence/discovery. Storage options, file formats, schema design, and processing engines like MapReduce, Spark and Impala are also covered.
The document discusses real-time fraud detection patterns and architectures. It provides an overview of key technologies like Kafka, Flume, and Spark Streaming used for real-time event processing. It then describes a high-level architecture involving ingesting events through Flume and Kafka into Spark Streaming for real-time processing, with results stored in HBase, HDFS, and Solr. The document also covers partitioning strategies, micro-batching, complex topologies, and ingestion of real-time and batch data.
The document discusses architectural considerations for Hadoop applications based on a case study of clickstream analysis. It covers requirements for data ingestion, storage, processing, and orchestration. For data storage, it recommends storing raw clickstream data in HDFS using the Avro file format with Snappy compression. For processed data, it recommends using the Parquet columnar storage format to enable efficient analytical queries. The document also discusses partitioning strategies and HDFS directory layout design.
This document discusses a presentation on fraud detection application architectures using Hadoop. It provides an overview of different fraud use cases and challenges in implementing Hadoop-based solutions. Requirements for the applications include handling high volumes, velocities and varieties of data, generating real-time alerts with low latency, and performing both stream and batch processing. A high-level architecture is proposed using Hadoop, HBase, HDFS, Kafka and Spark to meet the requirements. Storage layer choices and considerations are also discussed.
Architectural considerations for Hadoop Applicationshadooparchbook
The document discusses architectural considerations for Hadoop applications using a case study on clickstream analysis. It covers requirements for data ingestion, storage, processing, and orchestration. For data storage, it considers HDFS vs HBase, file formats, and compression formats. SequenceFiles are identified as a good choice for raw data storage as they allow for splittable compression.
Application Architectures with Hadoop - UK Hadoop User Grouphadooparchbook
This document discusses architectural considerations for analyzing clickstream data using Hadoop. It covers choices for data storage layers like HDFS vs HBase, data formats like Avro and Parquet, partitioning strategies, and data ingestion using tools like Flume and Kafka. It also discusses processing engines like MapReduce, Spark and Impala and how they can be used to sessionize data and perform other analytics.
The NameNode was experiencing high load and instability after being restarted. Graphs showed unknown high load between checkpoints on the NameNode. DataNode logs showed repeated 60000 millisecond timeouts in communication with the NameNode. Thread dumps revealed NameNode server handlers waiting on the same lock, indicating a bottleneck. Source code analysis pointed to repeated block reports from DataNodes to the NameNode as the likely cause of the high load.
I gave this talk on the Highload++ conference 2015 in Moscow. Slides have been translated into English. They cover the Apache HAWQ components, its architecture, query processing logic, and also competitive information
Architecting a Fraud Detection Application with HadoopDataWorks Summit
The document discusses real-time fraud detection patterns and architectures. It provides an overview of key technologies like Kafka, Flume, and Spark Streaming used for real-time event processing. It then describes a high-level architecture that focuses first on near real-time processing using technologies like Kafka and Spark Streaming for initial event processing before completing the picture with micro-batching, ingestion, and batch processing.
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkEvan Chan
This document discusses using Spark and Cassandra together for interactive analytics. It describes how Evan Chan uses both technologies at Ooyala to solve the problem of generating analytics from raw data in Cassandra in a flexible and fast way. It outlines their architecture of using Spark to generate materialized views from Cassandra data and then powering queries with those cached views for low latency queries.
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2L6bZbn
This CloudxLab Introduction to Spark Streaming & Apache Kafka tutorial helps you to understand Spark Streaming and Kafka in detail. Below are the topics covered in this tutorial:
1) Spark Streaming - Workflow
2) Use Cases - E-commerce, Real-time Sentiment Analysis & Real-time Fraud Detection
3) Spark Streaming - DStream
4) Word Count Hands-on using Spark Streaming
5) Spark Streaming - Running Locally Vs Running on Cluster
6) Introduction to Apache Kafka
7) Apache Kafka Hands-on on CloudxLab
8) Integrating Spark Streaming & Kafka
9) Spark Streaming & Kafka Hands-on
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...Helena Edelson
Streaming Big Data: Delivering Meaning In Near-Real Time At High Velocity At Massive Scale with Apache Spark, Apache Kafka, Apache Cassandra, Akka and the Spark Cassandra Connector. Why this pairing of technologies and How easy it is to implement. Example application: https://github.com/killrweather/killrweather
Productionizing Spark and the REST Job Server- Evan ChanSpark Summit
The document discusses productionizing Apache Spark and using the Spark REST Job Server. It provides an overview of Spark deployment options like YARN, Mesos, and Spark Standalone mode. It also covers Spark configuration topics like jars management, classpath configuration, and tuning garbage collection. The document then discusses running Spark applications in a cluster using tools like spark-submit and the Spark Job Server. It highlights features of the Spark Job Server like enabling low-latency Spark queries and sharing cached RDDs across jobs. Finally, it provides examples of using the Spark Job Server in production environments.
Spark after Dark by Chris Fregly of DatabricksData Con LA
Spark After Dark is a mock dating site that uses the latest Spark libraries, AWS Kinesis, Lambda Architecture, and Probabilistic Data Structures to generate dating recommendations.
There will be 5+ demos covering everything from basic data ETL to advanced data processing including Alternating Least Squares Machine Learning/Collaborative Filtering and PageRank Graph Processing.
There is heavy emphasis on Spark Streaming and AWS Kinesis.
Watch the video here
https://www.youtube.com/watch?v=g0i_d8YT-Bs
Hadoop Application Architectures tutorial at Big DataService 2015hadooparchbook
This document outlines a presentation on architectural considerations for Hadoop applications. It introduces the presenters who are experts from Cloudera and contributors to Apache Hadoop projects. It then discusses a case study on clickstream analysis, how this was challenging before Hadoop due to data storage limitations, and how Hadoop provides a better solution by enabling active archiving of large volumes and varieties of data at scale. Finally, it covers some of the challenges in implementing Hadoop, such as choices around storage managers, data modeling and file formats, data movement workflows, metadata management, and data access and processing frameworks.
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfCharles Givre
Study after study shows that data preparation and other data janitorial work consume 50-90% of most data scientists’ time. Apache Drill is a very promising tool which can help address this. Drill works with many different forms of “self describing data” and allows analysts to run ad-hoc queries in ANSI SQL against that data. Unlike HIVE or other SQL on Hadoop tools, Drill is not a wrapper for Map-Reduce and can scale to clusters of up to 10k nodes.
Spark Streaming & Kafka-The Future of Stream ProcessingJack Gudenkauf
Hari Shreedharan/Cloudera @Playtika. With its easy to use interfaces and native integration with some of the most popular ingest tools, such as Kafka, Flume, Kinesis etc, Spark Streaming has become go-to tool for stream processing. Code sharing with Spark also makes it attractive. In this talk, we will discuss the latest features in Spark Streaming and how it integrates with Kafka natively with no data loss, and even do exactly once processing!
Architecting a Next Generation Data Platformhadooparchbook
This document discusses a presentation on architecting Hadoop application architectures for a next generation data platform. It provides an overview of the presentation topics which include a case study on using Hadoop for an Internet of Things and entity 360 application. It introduces the key components of the proposed high level architecture including ingesting streaming and batch data using Kafka and Flume, stream processing with Kafka streams and storage in Hadoop.
This document provides an overview of SolrCloud on Hadoop. It discusses how SolrCloud allows for distributed, highly scalable search capabilities on Hadoop clusters. Key components that work with SolrCloud are also summarized, including HDFS for storage, MapReduce for processing, and ZooKeeper for coordination services. The document demonstrates how SolrCloud can index and query large datasets stored in Hadoop.
Top 5 mistakes when writing Spark applicationsmarkgrover
This document discusses 5 common mistakes people make when writing Spark applications.
The first mistake is improperly sizing Spark executors by not considering factors like the number of cores, amount of memory, and overhead needed. The second mistake is running into the 2GB limit on Spark shuffle blocks, which can cause jobs to fail. The third mistake is not addressing data skew during joins and shuffles, which can cause some tasks to be much slower than others. The fourth mistake is poorly managing the DAG by overusing shuffles, not using techniques like ReduceByKey instead of GroupByKey, and not using complex data types. The fifth mistake is classpath conflicts between the versions of libraries used by Spark and those added by the user.
The document discusses application architectures using Hadoop. It provides an example case study of clickstream analysis of web logs. It discusses challenges of Hadoop implementation and various architectural considerations for data storage, modeling, ingestion, processing and what specific processing needs to happen for the case study. These include sessionization, filtering, and business intelligence/discovery. Storage options, file formats, schema design, and processing engines like MapReduce, Spark and Impala are also covered.
The document discusses real-time fraud detection patterns and architectures. It provides an overview of key technologies like Kafka, Flume, and Spark Streaming used for real-time event processing. It then describes a high-level architecture involving ingesting events through Flume and Kafka into Spark Streaming for real-time processing, with results stored in HBase, HDFS, and Solr. The document also covers partitioning strategies, micro-batching, complex topologies, and ingestion of real-time and batch data.
The document discusses architectural considerations for Hadoop applications based on a case study of clickstream analysis. It covers requirements for data ingestion, storage, processing, and orchestration. For data storage, it recommends storing raw clickstream data in HDFS using the Avro file format with Snappy compression. For processed data, it recommends using the Parquet columnar storage format to enable efficient analytical queries. The document also discusses partitioning strategies and HDFS directory layout design.
This document discusses a presentation on fraud detection application architectures using Hadoop. It provides an overview of different fraud use cases and challenges in implementing Hadoop-based solutions. Requirements for the applications include handling high volumes, velocities and varieties of data, generating real-time alerts with low latency, and performing both stream and batch processing. A high-level architecture is proposed using Hadoop, HBase, HDFS, Kafka and Spark to meet the requirements. Storage layer choices and considerations are also discussed.
Architectural considerations for Hadoop Applicationshadooparchbook
The document discusses architectural considerations for Hadoop applications using a case study on clickstream analysis. It covers requirements for data ingestion, storage, processing, and orchestration. For data storage, it considers HDFS vs HBase, file formats, and compression formats. SequenceFiles are identified as a good choice for raw data storage as they allow for splittable compression.
Application Architectures with Hadoop - UK Hadoop User Grouphadooparchbook
This document discusses architectural considerations for analyzing clickstream data using Hadoop. It covers choices for data storage layers like HDFS vs HBase, data formats like Avro and Parquet, partitioning strategies, and data ingestion using tools like Flume and Kafka. It also discusses processing engines like MapReduce, Spark and Impala and how they can be used to sessionize data and perform other analytics.
The NameNode was experiencing high load and instability after being restarted. Graphs showed unknown high load between checkpoints on the NameNode. DataNode logs showed repeated 60000 millisecond timeouts in communication with the NameNode. Thread dumps revealed NameNode server handlers waiting on the same lock, indicating a bottleneck. Source code analysis pointed to repeated block reports from DataNodes to the NameNode as the likely cause of the high load.
I gave this talk on the Highload++ conference 2015 in Moscow. Slides have been translated into English. They cover the Apache HAWQ components, its architecture, query processing logic, and also competitive information
Architecting a Fraud Detection Application with HadoopDataWorks Summit
The document discusses real-time fraud detection patterns and architectures. It provides an overview of key technologies like Kafka, Flume, and Spark Streaming used for real-time event processing. It then describes a high-level architecture that focuses first on near real-time processing using technologies like Kafka and Spark Streaming for initial event processing before completing the picture with micro-batching, ingestion, and batch processing.
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkEvan Chan
This document discusses using Spark and Cassandra together for interactive analytics. It describes how Evan Chan uses both technologies at Ooyala to solve the problem of generating analytics from raw data in Cassandra in a flexible and fast way. It outlines their architecture of using Spark to generate materialized views from Cassandra data and then powering queries with those cached views for low latency queries.
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2L6bZbn
This CloudxLab Introduction to Spark Streaming & Apache Kafka tutorial helps you to understand Spark Streaming and Kafka in detail. Below are the topics covered in this tutorial:
1) Spark Streaming - Workflow
2) Use Cases - E-commerce, Real-time Sentiment Analysis & Real-time Fraud Detection
3) Spark Streaming - DStream
4) Word Count Hands-on using Spark Streaming
5) Spark Streaming - Running Locally Vs Running on Cluster
6) Introduction to Apache Kafka
7) Apache Kafka Hands-on on CloudxLab
8) Integrating Spark Streaming & Kafka
9) Spark Streaming & Kafka Hands-on
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...Helena Edelson
Streaming Big Data: Delivering Meaning In Near-Real Time At High Velocity At Massive Scale with Apache Spark, Apache Kafka, Apache Cassandra, Akka and the Spark Cassandra Connector. Why this pairing of technologies and How easy it is to implement. Example application: https://github.com/killrweather/killrweather
Productionizing Spark and the REST Job Server- Evan ChanSpark Summit
The document discusses productionizing Apache Spark and using the Spark REST Job Server. It provides an overview of Spark deployment options like YARN, Mesos, and Spark Standalone mode. It also covers Spark configuration topics like jars management, classpath configuration, and tuning garbage collection. The document then discusses running Spark applications in a cluster using tools like spark-submit and the Spark Job Server. It highlights features of the Spark Job Server like enabling low-latency Spark queries and sharing cached RDDs across jobs. Finally, it provides examples of using the Spark Job Server in production environments.
Spark after Dark by Chris Fregly of DatabricksData Con LA
Spark After Dark is a mock dating site that uses the latest Spark libraries, AWS Kinesis, Lambda Architecture, and Probabilistic Data Structures to generate dating recommendations.
There will be 5+ demos covering everything from basic data ETL to advanced data processing including Alternating Least Squares Machine Learning/Collaborative Filtering and PageRank Graph Processing.
There is heavy emphasis on Spark Streaming and AWS Kinesis.
Watch the video here
https://www.youtube.com/watch?v=g0i_d8YT-Bs
Hadoop Application Architectures tutorial at Big DataService 2015hadooparchbook
This document outlines a presentation on architectural considerations for Hadoop applications. It introduces the presenters who are experts from Cloudera and contributors to Apache Hadoop projects. It then discusses a case study on clickstream analysis, how this was challenging before Hadoop due to data storage limitations, and how Hadoop provides a better solution by enabling active archiving of large volumes and varieties of data at scale. Finally, it covers some of the challenges in implementing Hadoop, such as choices around storage managers, data modeling and file formats, data movement workflows, metadata management, and data access and processing frameworks.
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfCharles Givre
Study after study shows that data preparation and other data janitorial work consume 50-90% of most data scientists’ time. Apache Drill is a very promising tool which can help address this. Drill works with many different forms of “self describing data” and allows analysts to run ad-hoc queries in ANSI SQL against that data. Unlike HIVE or other SQL on Hadoop tools, Drill is not a wrapper for Map-Reduce and can scale to clusters of up to 10k nodes.
Spark Streaming & Kafka-The Future of Stream ProcessingJack Gudenkauf
Hari Shreedharan/Cloudera @Playtika. With its easy to use interfaces and native integration with some of the most popular ingest tools, such as Kafka, Flume, Kinesis etc, Spark Streaming has become go-to tool for stream processing. Code sharing with Spark also makes it attractive. In this talk, we will discuss the latest features in Spark Streaming and how it integrates with Kafka natively with no data loss, and even do exactly once processing!
Architecting a Next Generation Data Platformhadooparchbook
This document discusses a presentation on architecting Hadoop application architectures for a next generation data platform. It provides an overview of the presentation topics which include a case study on using Hadoop for an Internet of Things and entity 360 application. It introduces the key components of the proposed high level architecture including ingesting streaming and batch data using Kafka and Flume, stream processing with Kafka streams and storage in Hadoop.
This document provides an overview of SolrCloud on Hadoop. It discusses how SolrCloud allows for distributed, highly scalable search capabilities on Hadoop clusters. Key components that work with SolrCloud are also summarized, including HDFS for storage, MapReduce for processing, and ZooKeeper for coordination services. The document demonstrates how SolrCloud can index and query large datasets stored in Hadoop.
Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...confluent
Apache Kafka has come the modern central point for a fast and scalable streaming platform. Now, thanks to the open source explosion over the last decade, there are now numerous data stores available as sinks for Kafka-brokered data, from search to document stores, columnular DBs, time series DBs and more. While many claim they are the swiss army knife, in reality each is designed for specific types of data and analytics approaches. In this talk, we will cover the taxonomy of various data sinks, delve into each categories pros, cons and ideal use cases, so you can select the right ones and tie them together with Kafka into a well-considered architecture.
Architecting next generation big data platformhadooparchbook
A tutorial on architecting next generation big data platform by the authors of O'Reilly's Hadoop Application Architectures book. This tutorial discusses how to build a customer 360 (or entity 360) big data application.
Audience: Technical.
Being Ready for Apache Kafka - Apache: Big Data Europe 2015Michael Noll
These are the slides of my Kafka talk at Apache: Big Data Europe in Budapest, Hungary. Enjoy! --Michael
Apache Kafka is a high-throughput distributed messaging system that has become a mission-critical infrastructure component for modern data platforms. Kafka is used across a wide range of industries by thousands of companies such as Twitter, Netflix, Cisco, PayPal, and many others.
After a brief introduction to Kafka this talk will provide an update on the growth and status of the Kafka project community. Rest of the talk will focus on walking the audience through what's required to put Kafka in production. We’ll give an overview of the current ecosystem of Kafka, including: client libraries for creating your own apps; operational tools; peripheral components required for running Kafka in production and for integration with other systems like Hadoop. We will cover the upcoming project roadmap, which adds key features to make Kafka even more convenient to use and more robust in production.
Apache Deep Learning 101 - ApacheCon Montreal 2018 v0.31Timothy Spann
Apache Deep Learning 101 - ApacheCon Montreal 2018 v0.31
An overview for Big Data Engineers on how one could use Apache projects to run deep learning workflows with Apache NiFi, YARN, Spark, Kafka and many other Apache projects.
Emerging technologies /frameworks in Big DataRahul Jain
A short overview presentation on Emerging technologies /frameworks in Big Data covering Apache Parquet, Apache Flink, Apache Drill with basic concepts of Columnar Storage and Dremel.
"The common use cases of Spark SQL include ad hoc analysis, logical warehouse, query federation, and ETL processing. Spark SQL also powers the other Spark libraries, including structured streaming for stream processing, MLlib for machine learning, and GraphFrame for graph-parallel computation. For boosting the speed of your Spark applications, you can perform the optimization efforts on the queries prior employing to the production systems. Spark query plans and Spark UIs provide you insight on the performance of your queries. This talk discloses how to read and tune the query plans for enhanced performance. It will also cover the major related features in the recent and upcoming releases of Apache Spark.
"
In my talk I will discuss and show examples of using Apache Hadoop, Apache Hive, Apache MXNet, Apache OpenNLP, Apache NiFi and Apache Spark for deep learning applications. This is the follow up to last years Apache Deep Learning 101 that was done at Dataworks Summit and ApacheCon.
As part of my talk I will walk through using Apache NXNet Pre-Built Models, MXNet's New Model Server with Apache NiFi, executing MXNet with Apache NiFi and running Apache MXNet on edge nodes utilizing Python and Apache MiniFi.
This talk is geared towards Data Engineers interested in the basics of Deep Learning with open source Apache tools in a Big Data environment. I will walk through source code examples available in github and run the code live on an Apache Hadoop / YARN / Apache Spark cluster.
This will be an introduction to executing Deep Learning Pipelines in an Apache Big Data environment.
My talk at Data Works Summit Sydney was listed in top 7 -> https://hortonworks.com/blog/7-sessions-dataworks-summit-sydney-see/
Also have speak at and run Future of Data Princeton and at Oracle Code NYC.
https://www.slideshare.net/oom65/hadoop-security-architecture?next_slideshow=1
https://community.hortonworks.com/articles/83100/deep-learning-iot-workflows-with-raspberry-pi-mqtt.html
https://community.hortonworks.com/articles/146704/edge-analytics-with-nvidia-jetson-tx1-running-apac.html
https://dzone.com/refcardz/introduction-to-tensorflow
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Michael Rys
This presentation shows how you can build solutions that follow the modern data warehouse architecture and introduces the .NET for Apache Spark support (https://dot.net/spark, https://github.com/dotnet/spark)
Intro to Machine Learning with H2O and AWSSri Ambati
Navdeep Gill @ Galvanize Seattle- May 2016
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Hadoop made fast - Why Virtual Reality Needed Stream Processing to Surviveconfluent
1. A streaming platform like Kafka can provide the benefits of Hadoop for batch processing but in a faster, real-time way by processing data as it arrives rather than storing all data.
2. Virtual reality applications require stream processing to power features like VR mirroring and capture in real-time. Kafka's stream processing capabilities address challenges like this for VR.
3. The document discusses how AltspaceVR uses Kafka stream processing for applications like VR mirroring and capture, presence tracking, scheduled tasks, and more to power their real-time VR experiences.
What's New in Apache Spark 2.3 & Why Should You CareDatabricks
The Apache Spark 2.3 release marks a big step forward in speed, unification, and API support.
This talk will quickly walk through what’s new and how you can benefit from the upcoming improvements:
* Continuous Processing in Structured Streaming.
* PySpark support for vectorization, giving Python developers the ability to run native Python code fast.
* Native Kubernetes support, marrying the best of container orchestration and distributed data processing.
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
A presentation cum workshop on Real time Analytics with Apache Kafka and Apache Spark. Apache Kafka is a distributed publish-subscribe messaging while other side Spark Streaming brings Spark's language-integrated API to stream processing, allows to write streaming applications very quickly and easily. It supports both Java and Scala. In this workshop we are going to explore Apache Kafka, Zookeeper and Spark with a Web click streaming example using Spark Streaming. A clickstream is the recording of the parts of the screen a computer user clicks on while web browsing.
This presentation will be useful to those who would like to get acquainted with Apache Spark architecture, top features and see some of them in action, e.g. RDD transformations and actions, Spark SQL, etc. Also it covers real life use cases related to one of ours commercial projects and recall roadmap how we’ve integrated Apache Spark into it.
Was presented on Morning@Lohika tech talks in Lviv.
Design by Yarko Filevych: http://www.filevych.com/
Talk given by Max Feldman, Product Security Engineer at Salesforce, at AppSec USA.
One challenging aspect of achieving software security is the struggle to catch up with the speed of development and deployment. We built Providence with the goal of preventing obvious bugs from ever being deployed into production.
Providence is a lightweight and scalable tool which finds bugs and anti-patterns of varying complexity from code commits, and we’ve used it to prevent vulnerabilities ranging from XSS, to access control issues, to XXE. It works by continuously monitoring and pulling commits from version control systems and scanning them for bugs with rules defined in plugins. Additional plugins are easy to create and deploy, which has allowed for quick reaction to new bugs or problems as they are discovered.
Providence is easily integrated with SDLC workflows or bug-tracking tools, and we will discuss how we have integrated it in-house in an unobtrusive manner. This model of addressing issues also provides relative immediacy of resolution; on average, potential problems found by Providence are resolved more quickly than other vulnerabilities because developers are presented the issues right after they commit the code, instead of weeks to months later.
We are currently in the process of open-sourcing Providence in order to share the tool with the DevOps/security community (or any interested parties). This talk will cover the internals of Providence, its engine and plugin architecture (including examples of plugins and their ease of creation), as well as its integration with our SDLC and the faster and more efficient responses we’ve achieved as a result. We’re continuing to build new plugins and features, and we’re excited see what ideas others may have in mind!
Architecting a next-generation data platformhadooparchbook
This document discusses a high-level architecture for analyzing taxi trip data in real-time and batch using Apache Hadoop and streaming technologies. The architecture includes ingesting data from multiple sources using Kafka, processing streaming data using stream processing engines, storing data in data stores like HDFS, and enabling real-time and batch querying and analytics. Key considerations discussed are choosing data transport and stream processing technologies, scaling and reliability, and processing both streaming and batch data.
Top 5 mistakes when writing Spark applicationshadooparchbook
This document discusses common mistakes made when writing Spark applications and provides recommendations to address them. It covers issues like having executors that are too small or large, shuffle blocks exceeding size limits, data skew slowing jobs, and excessive stages. The key recommendations are to optimize executor and partition sizes, increase partitions to reduce skew, use techniques like salting to address skew, and favor transformations like ReduceByKey over GroupByKey to minimize shuffles and memory usage.
Building a fraud detection application using the tools in the Hadoop ecosystem. Presentation given by authors of O'Reilly's Hadoop Application Architectures book at Strata + Hadoop World in San Jose, CA 2016.
Top 5 mistakes when writing Spark applicationshadooparchbook
This document discusses common mistakes people make when writing Spark applications and provides recommendations to address them. It covers issues related to executor configuration, application failures due to shuffle block sizes exceeding limits, slow jobs caused by data skew, and managing the DAG to avoid excessive shuffles and stages. Recommendations include using smaller executors, increasing the number of partitions, addressing skew through techniques like salting, and preferring ReduceByKey over GroupByKey and TreeReduce over Reduce to improve performance and resource usage.
This document discusses a case study on fraud detection using Hadoop. It begins with an overview of fraud detection requirements, including the need for real-time and near real-time processing of large volumes and varieties of data. It then covers considerations for the system architecture, including using HDFS and HBase for storage, Kafka for ingestion, and Spark and Storm for stream and batch processing. Data modeling with HBase and caching options are also discussed.
The document provides an agenda and slides for a presentation on architectural considerations for data warehousing with Hadoop. The presentation discusses typical data warehouse architectures and challenges, how Hadoop can complement existing architectures, and provides an example use case of implementing a data warehouse with Hadoop using the Movielens dataset. Key aspects covered include ingestion of data from various sources using tools like Flume and Sqoop, data modeling and storage formats in Hadoop, processing the data using tools like Hive and Spark, and exporting results to a data warehouse.
This document discusses application architectures using Hadoop. It provides an example case study of clickstream analysis. It covers challenges of Hadoop implementation and various architectural considerations for data storage and modeling, data ingestion, and data processing. For data processing, it discusses different processing engines like MapReduce, Pig, Hive, Spark and Impala. It also discusses what specific processing needs to be done for the clickstream data like sessionization and filtering.
Impala Architecture Presentation at Toronto Hadoop User Group, in January 2014 by Mark Grover.
Event details:
http://www.meetup.com/TorontoHUG/events/150328602/
Spark is a powerhouse for large datasets, but when it comes to smaller data workloads, its overhead can sometimes slow things down. What if you could achieve high performance and efficiency without the need for Spark?
At S&P Global Commodity Insights, having a complete view of global energy and commodities markets enables customers to make data-driven decisions with confidence and create long-term, sustainable value. 🌍
Explore delta-rs + CDC and how these open-source innovations power lightweight, high-performance data applications beyond Spark! 🚀
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul
Artificial intelligence is changing how businesses operate. Companies are using AI agents to automate tasks, reduce time spent on repetitive work, and focus more on high-value activities. Noah Loul, an AI strategist and entrepreneur, has helped dozens of companies streamline their operations using smart automation. He believes AI agents aren't just tools—they're workers that take on repeatable tasks so your human team can focus on what matters. If you want to reduce time waste and increase output, AI agents are the next move.
TrsLabs - Fintech Product & Business ConsultingTrs Labs
Hybrid Growth Mandate Model with TrsLabs
Strategic Investments, Inorganic Growth, Business Model Pivoting are critical activities that business don't do/change everyday. In cases like this, it may benefit your business to choose a temporary external consultant.
An unbiased plan driven by clearcut deliverables, market dynamics and without the influence of your internal office equations empower business leaders to make right choices.
Getting things done within a budget within a timeframe is key to Growing Business - No matter whether you are a start-up or a big company
Talk to us & Unlock the competitive advantage
AI and Data Privacy in 2025: Global TrendsInData Labs
In this infographic, we explore how businesses can implement effective governance frameworks to address AI data privacy. Understanding it is crucial for developing effective strategies that ensure compliance, safeguard customer trust, and leverage AI responsibly. Equip yourself with insights that can drive informed decision-making and position your organization for success in the future of data privacy.
This infographic contains:
-AI and data privacy: Key findings
-Statistics on AI data privacy in the today’s world
-Tips on how to overcome data privacy challenges
-Benefits of AI data security investments.
Keep up-to-date on how AI is reshaping privacy standards and what this entails for both individuals and organizations.
This is the keynote of the Into the Box conference, highlighting the release of the BoxLang JVM language, its key enhancements, and its vision for the future.
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...Alan Dix
Talk at the final event of Data Fusion Dynamics: A Collaborative UK-Saudi Initiative in Cybersecurity and Artificial Intelligence funded by the British Council UK-Saudi Challenge Fund 2024, Cardiff Metropolitan University, 29th April 2025
https://alandix.com/academic/talks/CMet2025-AI-Changes-Everything/
Is AI just another technology, or does it fundamentally change the way we live and think?
Every technology has a direct impact with micro-ethical consequences, some good, some bad. However more profound are the ways in which some technologies reshape the very fabric of society with macro-ethical impacts. The invention of the stirrup revolutionised mounted combat, but as a side effect gave rise to the feudal system, which still shapes politics today. The internal combustion engine offers personal freedom and creates pollution, but has also transformed the nature of urban planning and international trade. When we look at AI the micro-ethical issues, such as bias, are most obvious, but the macro-ethical challenges may be greater.
At a micro-ethical level AI has the potential to deepen social, ethnic and gender bias, issues I have warned about since the early 1990s! It is also being used increasingly on the battlefield. However, it also offers amazing opportunities in health and educations, as the recent Nobel prizes for the developers of AlphaFold illustrate. More radically, the need to encode ethics acts as a mirror to surface essential ethical problems and conflicts.
At the macro-ethical level, by the early 2000s digital technology had already begun to undermine sovereignty (e.g. gambling), market economics (through network effects and emergent monopolies), and the very meaning of money. Modern AI is the child of big data, big computation and ultimately big business, intensifying the inherent tendency of digital technology to concentrate power. AI is already unravelling the fundamentals of the social, political and economic world around us, but this is a world that needs radical reimagining to overcome the global environmental and human challenges that confront us. Our challenge is whether to let the threads fall as they may, or to use them to weave a better future.
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPathCommunity
Join this UiPath Community Berlin meetup to explore the Orchestrator API, Swagger interface, and the Test Manager API. Learn how to leverage these tools to streamline automation, enhance testing, and integrate more efficiently with UiPath. Perfect for developers, testers, and automation enthusiasts!
📕 Agenda
Welcome & Introductions
Orchestrator API Overview
Exploring the Swagger Interface
Test Manager API Highlights
Streamlining Automation & Testing with APIs (Demo)
Q&A and Open Discussion
Perfect for developers, testers, and automation enthusiasts!
👉 Join our UiPath Community Berlin chapter: https://community.uipath.com/berlin/
This session streamed live on April 29, 2025, 18:00 CET.
Check out all our upcoming UiPath Community sessions at https://community.uipath.com/events/.
Quantum Computing Quick Research Guide by Arthur MorganArthur Morgan
This is a Quick Research Guide (QRG).
QRGs include the following:
- A brief, high-level overview of the QRG topic.
- A milestone timeline for the QRG topic.
- Links to various free online resource materials to provide a deeper dive into the QRG topic.
- Conclusion and a recommendation for at least two books available in the SJPL system on the QRG topic.
QRGs planned for the series:
- Artificial Intelligence QRG
- Quantum Computing QRG
- Big Data Analytics QRG
- Spacecraft Guidance, Navigation & Control QRG (coming 2026)
- UK Home Computing & The Birth of ARM QRG (coming 2027)
Any questions or comments?
- Please contact Arthur Morgan at [email protected].
100% human made.
Procurement Insights Cost To Value Guide.pptxJon Hansen
Procurement Insights integrated Historic Procurement Industry Archives, serves as a powerful complement — not a competitor — to other procurement industry firms. It fills critical gaps in depth, agility, and contextual insight that most traditional analyst and association models overlook.
Learn more about this value- driven proprietary service offering here.
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Impelsys Inc.
Impelsys provided a robust testing solution, leveraging a risk-based and requirement-mapped approach to validate ICU Connect and CritiXpert. A well-defined test suite was developed to assess data communication, clinical data collection, transformation, and visualization across integrated devices.
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveScyllaDB
Want to learn practical tips for designing systems that can scale efficiently without compromising speed?
Join us for a workshop where we’ll address these challenges head-on and explore how to architect low-latency systems using Rust. During this free interactive workshop oriented for developers, engineers, and architects, we’ll cover how Rust’s unique language features and the Tokio async runtime enable high-performance application development.
As you explore key principles of designing low-latency systems with Rust, you will learn how to:
- Create and compile a real-world app with Rust
- Connect the application to ScyllaDB (NoSQL data store)
- Negotiate tradeoffs related to data modeling and querying
- Manage and monitor the database for consistently low latencies
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxshyamraj55
We’re bringing the TDX energy to our community with 2 power-packed sessions:
🛠️ Workshop: MuleSoft for Agentforce
Explore the new version of our hands-on workshop featuring the latest Topic Center and API Catalog updates.
📄 Talk: Power Up Document Processing
Dive into smart automation with MuleSoft IDP, NLP, and Einstein AI for intelligent document workflows.
2. 2
About the book
• @hadooparchbook
• hadooparchitecturebook.com
• github.com/hadooparchitecturebook
• slideshare.com/hadooparchbook
Questions? tiny.cloudera.com/streaming-singapore-questions
3. 3
About the presenters
• Principal Solutions Architect at
Cloudera
• Done Hadoop for 6 years
– Worked with > 70 companies in 8
countries
• Previously, lead architect at FINRA
• Contributor to Apache Hadoop,
HBase, Flume, Avro, Pig and Spark
• Contributor to Apache Hadoop,
HBase, Flume, Avro, Pig and Spark
• Marvel fan boy, runner
• Software Engineer at Cloudera,
working on Spark
• Committer on Apache Bigtop, PMC
member on Apache Sentry
(incubating)
• Contributor to Apache Hadoop,
Spark, Hive, Sqoop, Pig and Flume
Questions? tiny.cloudera.com/streaming-singapore-questions
Ted Malaska Mark Grover
7. 7
When to stream, and when not to
• We are looking for a SLA sweet spot
• Multi milliseconds to seconds
• Not minutes
• Not constant low milliseconds or under
• Doesn’t come for free
Questions? tiny.cloudera.com/streaming-singapore-questions
12. 12
But there multiple sources
Questions? tiny.cloudera.com/streaming-singapore-questions
Source System 1
Destination systemSource System 2
Source System 3
Ingest
13. 13
But..
• Sources, sinks, ingestion channels may go down
• Sources and sinks may be producing/consuming at different rates
• Regular maintenance windows may need to be scheduled
• We need a resilient message broker
Questions? tiny.cloudera.com/streaming-singapore-questions
14. 14
Need for a message broker
Questions? tiny.cloudera.com/streaming-singapore-questions
Source System 1
Destination systemSource System 2
Source System 3
Ingest Extract
Message broker
16. 16
But ‘queue’ doesn’t ‘push’
Questions? tiny.cloudera.com/streaming-singapore-questions
Source System 1
Storage
systemSource System 2
Source System 3
Ingest Extract
Streaming
ingestion
process
Push
Message broker
17. 17
Streaming data ingestion process
Questions? tiny.cloudera.com/streaming-singapore-questions
Source System 1
Storage
systemSource System 2
Source System 3
Ingest Extract
Streaming
ingestion
process
Push
Kafka Connect
Apache Flume
Message broker
18. 18
Streaming architecture for ingestion
Questions? tiny.cloudera.com/streaming-singapore-questions
Source System 1
Storage
systemSource System 2
Source System 3
Ingest Extract
Streaming
ingestion
process
Push
Kafka
Connect
Apache
Flume
Message broker
20. 20
Streaming architecture for ingestion
Questions? tiny.cloudera.com/streaming-singapore-questions
Source System 1
Storage
systemSource System 2
Source System 3
Ingest Extract
Streaming
ingestion
process
Push
Kafka
connect
Apache
Flume
Message broker
21. 21
Streaming architecture for ingestion
Questions? tiny.cloudera.com/streaming-singapore-questions
Source System 1
Storage
systemSource System 2
Source System 3
Ingest Extract
Streaming
ingestion
process
Push
Kafka
connect
Apache
Flume
Message broker
Can be used to
do simple
transformations
22. 22
Two types of transformations
Atomic
• Need to work with one event at a
time
• Example – mask a credit card
number
With context
• Need to refer to external context
• Example - convert zip code to state,
by looking up a cache
Questions? tiny.cloudera.com/streaming-singapore-questions
23. 23
Atomic transformations
• Require no context
• Can be simply done within Flume interceptors, Kafka connect or
Spark streaming
Questions? tiny.cloudera.com/streaming-singapore-questions
24. 24
Flume Interceptors
• Mask fields
• Validate information
against external source
• Extract fields
• Modify data format
• Filter or split events
Questions? tiny.cloudera.com/streaming-singapore-questions
25. 25
Streaming architecture for ingestion
Questions? tiny.cloudera.com/streaming-singapore-questions
Source System 1
Storage
systemSource System 2
Source System 3
Ingest Extract
Streaming
ingestion
process
Push
Kafka
connect
Apache
Flume
Message broker
Can be used to
do simple
transformations
28. 28
Streaming architecture for ingestion
Questions? tiny.cloudera.com/streaming-singapore-questions
Source System 1
Storage
systemSource System 2
Source System 3
Ingest Extract
Streaming
ingestion
process
Push
Copycat
Apache
Flume
Message broker
Can be used to
do simple
transformations
29. 29
Semantic types
• At most once
– Not good for many cases
– Only where performance/SLA is more important than accuracy
• Exactly once
– Expensive to achieve but desirable
• At least once
– Easiest to achieve
Questions? tiny.cloudera.com/streaming-singapore-questions
30. 30
Categories of storage systems
“Puts” based
• Can be re-inserted without side
effects since re-inserted record will
have duplicate key
“Appends” based
• Can not be re-inserted
Questions? tiny.cloudera.com/streaming-singapore-questions
31. 31
How to achieve exactly once?
• For “puts” based storage systems
– At least once is enough (keys have to be unique though i.e. primary key)
– Re-inserted records will have duplicate keys
– Will simply overwrite the exist record with the same value
• For “appends” based storage systems (e.g. HDFS)
– Still easiest to do at least once
– Need to de-duplicate before processing
Questions? tiny.cloudera.com/streaming-singapore-questions
37. 37
Why did Streaming Suck
• Increments with Cassandra
• Double increment
• No strong consistency
• Storm with out Kafka
• Not only once
• Not at least once
• Batch would have to re-process EVERY record to remove
dups
Questions? tiny.cloudera.com/streaming-singapore-questions
38. 38
We have come a long way
• We don’t have to use Increments any more and we can
have consistency
• HBase
• We can have state in our streaming platform
• Spark Streaming
• We don’t lose data
• Spark Streaming
• Kafka
• Other options
• Full universe of Deduping
• Again HBase with versions
Questions? tiny.cloudera.com/streaming-singapore-questions
41. 41
Advanced Streaming
• Ad-hoc will produce Identify Value
• Ad-hoc will become batch
• The value will demand less latency on batch
• Batch will become Streaming
Questions? tiny.cloudera.com/streaming-singapore-questions
42. 42
Advanced Streaming
• Requirements for Ideal Batch to Streaming frameworks
• Something that can snap both paradigms
• Something that can use the tools of Ad-hoc
• SQL
• MlLib
• R
• Scala
• Java
• Development through a common IDE
• Debugging
• Unit Testing
• Common deployment model
Questions? tiny.cloudera.com/streaming-singapore-questions
43. 43
Spark Streaming Example
Questions? tiny.cloudera.com/streaming-singapore-questions
1. val conf = new SparkConf().setMaster("local[2]”)
2. val sc = new SparkContext(conf)
3. val lines = sc.textFile(path, 2)
4. val words = lines.flatMap(_.split(" "))
5. val pairs = words.map(word => (word, 1))
6. val wordCounts = pairs.reduceByKey(_ + _)
7. wordCounts.print()
44. 44
Spark Streaming Example
Questions? tiny.cloudera.com/streaming-singapore-questions
1. val conf = new SparkConf().setMaster("local[2]”)
2. val ssc = new StreamingContext(conf, Seconds(1))
3. val lines = ssc.socketTextStream("localhost", 9999)
4. val words = lines.flatMap(_.split(" "))
5. val pairs = words.map(word => (word, 1))
6. val wordCounts = pairs.reduceByKey(_ + _)
7. wordCounts.print()
8. SSC.start()
46. 46
Advanced Streaming
• In Spark Streaming
• A DStream is a collection of RDD with respect to micro batch
intervals
• If we can access RDDs in Spark Streaming
• We can convert to Vectors
• KMeans
• Principal component analysis
• We can convert to LabeledPoint
• NaiveBayes
• Random Forest
• Linear Support Vector Machines
• We can convert to a DataFrames
• SQL
• R
Questions? tiny.cloudera.com/streaming-singapore-questions