Building a fraud detection application using the tools in the Hadoop ecosystem. Presentation given by authors of O'Reilly's Hadoop Application Architectures book at Strata + Hadoop World in San Jose, CA 2016.
Architecting a next-generation data platformhadooparchbook
This document discusses a high-level architecture for analyzing taxi trip data in real-time and batch using Apache Hadoop and streaming technologies. The architecture includes ingesting data from multiple sources using Kafka, processing streaming data using stream processing engines, storing data in data stores like HDFS, and enabling real-time and batch querying and analytics. Key considerations discussed are choosing data transport and stream processing technologies, scaling and reliability, and processing both streaming and batch data.
This document discusses a case study on fraud detection using Hadoop. It begins with an overview of fraud detection requirements, including the need for real-time and near real-time processing of large volumes and varieties of data. It then covers considerations for the system architecture, including using HDFS and HBase for storage, Kafka for ingestion, and Spark and Storm for stream and batch processing. Data modeling with HBase and caching options are also discussed.
Architecting next generation big data platformhadooparchbook
A tutorial on architecting next generation big data platform by the authors of O'Reilly's Hadoop Application Architectures book. This tutorial discusses how to build a customer 360 (or entity 360) big data application.
Audience: Technical.
This document discusses a presentation on fraud detection application architectures using Hadoop. It provides an overview of different fraud use cases and challenges in implementing Hadoop-based solutions. Requirements for the applications include handling high volumes, velocities and varieties of data, generating real-time alerts with low latency, and performing both stream and batch processing. A high-level architecture is proposed using Hadoop, HBase, HDFS, Kafka and Spark to meet the requirements. Storage layer choices and considerations are also discussed.
Hadoop application architectures - using Customer 360 as an examplehadooparchbook
Hadoop application architectures - using Customer 360 (more generally, Entity 360) as an example. By Ted Malaska, Jonathan Seidman and Mark Grover at Strata + Hadoop World 2016 in NYC.
What no one tells you about writing a streaming apphadooparchbook
This document discusses 5 things that are often not addressed when writing streaming applications:
1. Managing and monitoring long-running streaming jobs can be challenging as frameworks were not originally designed for streaming workloads. Options include using cluster mode to ensure jobs continue if clients disconnect and leveraging monitoring tools to track metrics.
2. Preventing data loss requires different approaches depending on the data source. File and receiver-based sources benefit from checkpointing while Kafka's commit log ensures data is not lost.
3. Spark Streaming is well-suited for tasks involving windowing, aggregations, and machine learning but may not be needed for all streaming use cases.
4. Achieving exactly-once semantics requires techniques
The document discusses application architectures using Hadoop. It provides an example case study of clickstream analysis of web logs. It discusses challenges of Hadoop implementation and various architectural considerations for data storage, modeling, ingestion, processing and what specific processing needs to happen for the case study. These include sessionization, filtering, and business intelligence/discovery. Storage options, file formats, schema design, and processing engines like MapReduce, Spark and Impala are also covered.
Architecting application with Hadoop - using clickstream analytics as an examplehadooparchbook
Delivered by Mark Grover at Northern CO Hadoop User Group:
http://www.meetup.com/Northern-Colorado-Big-Data-Meetup/events/224717963/
Top 5 mistakes when writing Streaming applicationshadooparchbook
This document discusses 5 common mistakes when writing streaming applications and provides solutions. It covers: 1) Not shutting down apps gracefully by using thread hooks or external markers to stop processing after batches finish. 2) Assuming exactly-once semantics when things can fail at multiple points requiring offsets and idempotent operations. 3) Using streaming for everything when batch processing is better for some goals. 4) Not preventing data loss by enabling checkpointing and write-ahead logs. 5) Not monitoring jobs by using tools like Spark Streaming UI, Graphite and YARN cluster mode for automatic restarts.
The document discusses best practices for streaming applications. It covers common streaming use cases like ingestion, transformations, and counting. It also discusses advanced streaming use cases that involve machine learning. The document provides an overview of streaming architectures and compares different streaming engines like Spark Streaming, Flink, Storm, and Kafka Streams. It discusses when to use different storage systems and message brokers like Kafka for ingestion pipelines. The goal is to understand common streaming use cases and their architectures.
Architecting applications with Hadoop - Fraud Detectionhadooparchbook
This document discusses architectures for fraud detection applications using Hadoop. It provides an overview of requirements for such an application, including the need for real-time alerts and batch processing. It proposes using Kafka for ingestion due to its high throughput and partitioning. HBase and HDFS would be used for storage, with HBase better supporting random access for profiles. The document outlines using Flume, Spark Streaming, and HBase for near real-time processing and alerting on incoming events. Batch processing would use HDFS, Impala, and Spark. Caching profiles in memory is also suggested to improve performance.
The document discusses architectural considerations for Hadoop applications based on a case study of clickstream analysis. It covers requirements for data ingestion, storage, processing, and orchestration. For data storage, it recommends storing raw clickstream data in HDFS using the Avro file format with Snappy compression. For processed data, it recommends using the Parquet columnar storage format to enable efficient analytical queries. The document also discusses partitioning strategies and HDFS directory layout design.
Architectural considerations for Hadoop Applicationshadooparchbook
The document discusses architectural considerations for Hadoop applications using a case study on clickstream analysis. It covers requirements for data ingestion, storage, processing, and orchestration. For data storage, it considers HDFS vs HBase, file formats, and compression formats. SequenceFiles are identified as a good choice for raw data storage as they allow for splittable compression.
Application Architectures with Hadoop - UK Hadoop User Grouphadooparchbook
This document discusses architectural considerations for analyzing clickstream data using Hadoop. It covers choices for data storage layers like HDFS vs HBase, data formats like Avro and Parquet, partitioning strategies, and data ingestion using tools like Flume and Kafka. It also discusses processing engines like MapReduce, Spark and Impala and how they can be used to sessionize data and perform other analytics.
The document discusses real-time fraud detection patterns and architectures. It provides an overview of key technologies like Kafka, Flume, and Spark Streaming used for real-time event processing. It then describes a high-level architecture involving ingesting events through Flume and Kafka into Spark Streaming for real-time processing, with results stored in HBase, HDFS, and Solr. The document also covers partitioning strategies, micro-batching, complex topologies, and ingestion of real-time and batch data.
Hadoop Application Architectures tutorial at Big DataService 2015hadooparchbook
This document outlines a presentation on architectural considerations for Hadoop applications. It introduces the presenters who are experts from Cloudera and contributors to Apache Hadoop projects. It then discusses a case study on clickstream analysis, how this was challenging before Hadoop due to data storage limitations, and how Hadoop provides a better solution by enabling active archiving of large volumes and varieties of data at scale. Finally, it covers some of the challenges in implementing Hadoop, such as choices around storage managers, data modeling and file formats, data movement workflows, metadata management, and data access and processing frameworks.
Architecting a Next Generation Data Platformhadooparchbook
This document discusses a presentation on architecting Hadoop application architectures for a next generation data platform. It provides an overview of the presentation topics which include a case study on using Hadoop for an Internet of Things and entity 360 application. It introduces the key components of the proposed high level architecture including ingesting streaming and batch data using Kafka and Flume, stream processing with Kafka streams and storage in Hadoop.
Application architectures with Hadoop – Big Data TechCon 2014hadooparchbook
Building applications using Apache Hadoop with a use-case of clickstream analysis. Presented by Mark Grover and Jonathan Seidman at Big Data TechCon, Boston in April 2014
Architecting a Fraud Detection Application with HadoopDataWorks Summit
The document discusses real-time fraud detection patterns and architectures. It provides an overview of key technologies like Kafka, Flume, and Spark Streaming used for real-time event processing. It then describes a high-level architecture that focuses first on near real-time processing using technologies like Kafka and Spark Streaming for initial event processing before completing the picture with micro-batching, ingestion, and batch processing.
Architecting a Next Generation Data Platform – Strata Singapore 2017Jonathan Seidman
This document discusses the high-level architecture for a data platform to support a customer 360 view using data from connected vehicles (taxis). The architecture includes data sources, streaming data ingestion using Kafka, schema validation, stream processing for transformations and routing, and storage for analytics, search and long-term retention. The presentation covers design considerations for reliability, scalability and processing of both streaming and batch data to meet requirements like querying, visualization, and batch processing of historical data.
This document discusses application architectures using Hadoop. It provides an example case study of clickstream analysis. It covers challenges of Hadoop implementation and various architectural considerations for data storage and modeling, data ingestion, and data processing. For data processing, it discusses different processing engines like MapReduce, Pig, Hive, Spark and Impala. It also discusses what specific processing needs to be done for the clickstream data like sessionization and filtering.
Top 5 mistakes when writing Spark applicationsmarkgrover
This document discusses 5 common mistakes people make when writing Spark applications.
The first mistake is improperly sizing Spark executors by not considering factors like the number of cores, amount of memory, and overhead needed. The second mistake is running into the 2GB limit on Spark shuffle blocks, which can cause jobs to fail. The third mistake is not addressing data skew during joins and shuffles, which can cause some tasks to be much slower than others. The fourth mistake is poorly managing the DAG by overusing shuffles, not using techniques like ReduceByKey instead of GroupByKey, and not using complex data types. The fifth mistake is classpath conflicts between the versions of libraries used by Spark and those added by the user.
Architecting a Next Gen Data Platform – Strata London 2018Jonathan Seidman
This document summarizes a presentation on architecting data platforms given at the Strata Data Conference in London 2018. The presentation discusses building a customer 360 view using streaming vehicle and other IoT data. It outlines the requirements to support real-time querying, batch processing, and analytics. The high-level architecture shown includes data sources, streaming pipelines, storage systems, and processing engines. Key challenges discussed are reliably ingesting multiple data types and scaling to support various workloads and access patterns.
Architecting a Next Gen Data Platform – Strata New York 2018Jonathan Seidman
Using Customer 360 and the internet of things as examples, this tutorial explains how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, including components like Kafka, Flink, Kudu, Spark Streaming, and Spark SQL and modern storage engines to enable new forms of data processing and analytics.
Architecting application with Hadoop - using clickstream analytics as an examplehadooparchbook
Delivered by Mark Grover at Northern CO Hadoop User Group:
http://www.meetup.com/Northern-Colorado-Big-Data-Meetup/events/224717963/
Top 5 mistakes when writing Streaming applicationshadooparchbook
This document discusses 5 common mistakes when writing streaming applications and provides solutions. It covers: 1) Not shutting down apps gracefully by using thread hooks or external markers to stop processing after batches finish. 2) Assuming exactly-once semantics when things can fail at multiple points requiring offsets and idempotent operations. 3) Using streaming for everything when batch processing is better for some goals. 4) Not preventing data loss by enabling checkpointing and write-ahead logs. 5) Not monitoring jobs by using tools like Spark Streaming UI, Graphite and YARN cluster mode for automatic restarts.
The document discusses best practices for streaming applications. It covers common streaming use cases like ingestion, transformations, and counting. It also discusses advanced streaming use cases that involve machine learning. The document provides an overview of streaming architectures and compares different streaming engines like Spark Streaming, Flink, Storm, and Kafka Streams. It discusses when to use different storage systems and message brokers like Kafka for ingestion pipelines. The goal is to understand common streaming use cases and their architectures.
Architecting applications with Hadoop - Fraud Detectionhadooparchbook
This document discusses architectures for fraud detection applications using Hadoop. It provides an overview of requirements for such an application, including the need for real-time alerts and batch processing. It proposes using Kafka for ingestion due to its high throughput and partitioning. HBase and HDFS would be used for storage, with HBase better supporting random access for profiles. The document outlines using Flume, Spark Streaming, and HBase for near real-time processing and alerting on incoming events. Batch processing would use HDFS, Impala, and Spark. Caching profiles in memory is also suggested to improve performance.
The document discusses architectural considerations for Hadoop applications based on a case study of clickstream analysis. It covers requirements for data ingestion, storage, processing, and orchestration. For data storage, it recommends storing raw clickstream data in HDFS using the Avro file format with Snappy compression. For processed data, it recommends using the Parquet columnar storage format to enable efficient analytical queries. The document also discusses partitioning strategies and HDFS directory layout design.
Architectural considerations for Hadoop Applicationshadooparchbook
The document discusses architectural considerations for Hadoop applications using a case study on clickstream analysis. It covers requirements for data ingestion, storage, processing, and orchestration. For data storage, it considers HDFS vs HBase, file formats, and compression formats. SequenceFiles are identified as a good choice for raw data storage as they allow for splittable compression.
Application Architectures with Hadoop - UK Hadoop User Grouphadooparchbook
This document discusses architectural considerations for analyzing clickstream data using Hadoop. It covers choices for data storage layers like HDFS vs HBase, data formats like Avro and Parquet, partitioning strategies, and data ingestion using tools like Flume and Kafka. It also discusses processing engines like MapReduce, Spark and Impala and how they can be used to sessionize data and perform other analytics.
The document discusses real-time fraud detection patterns and architectures. It provides an overview of key technologies like Kafka, Flume, and Spark Streaming used for real-time event processing. It then describes a high-level architecture involving ingesting events through Flume and Kafka into Spark Streaming for real-time processing, with results stored in HBase, HDFS, and Solr. The document also covers partitioning strategies, micro-batching, complex topologies, and ingestion of real-time and batch data.
Hadoop Application Architectures tutorial at Big DataService 2015hadooparchbook
This document outlines a presentation on architectural considerations for Hadoop applications. It introduces the presenters who are experts from Cloudera and contributors to Apache Hadoop projects. It then discusses a case study on clickstream analysis, how this was challenging before Hadoop due to data storage limitations, and how Hadoop provides a better solution by enabling active archiving of large volumes and varieties of data at scale. Finally, it covers some of the challenges in implementing Hadoop, such as choices around storage managers, data modeling and file formats, data movement workflows, metadata management, and data access and processing frameworks.
Architecting a Next Generation Data Platformhadooparchbook
This document discusses a presentation on architecting Hadoop application architectures for a next generation data platform. It provides an overview of the presentation topics which include a case study on using Hadoop for an Internet of Things and entity 360 application. It introduces the key components of the proposed high level architecture including ingesting streaming and batch data using Kafka and Flume, stream processing with Kafka streams and storage in Hadoop.
Application architectures with Hadoop – Big Data TechCon 2014hadooparchbook
Building applications using Apache Hadoop with a use-case of clickstream analysis. Presented by Mark Grover and Jonathan Seidman at Big Data TechCon, Boston in April 2014
Architecting a Fraud Detection Application with HadoopDataWorks Summit
The document discusses real-time fraud detection patterns and architectures. It provides an overview of key technologies like Kafka, Flume, and Spark Streaming used for real-time event processing. It then describes a high-level architecture that focuses first on near real-time processing using technologies like Kafka and Spark Streaming for initial event processing before completing the picture with micro-batching, ingestion, and batch processing.
Architecting a Next Generation Data Platform – Strata Singapore 2017Jonathan Seidman
This document discusses the high-level architecture for a data platform to support a customer 360 view using data from connected vehicles (taxis). The architecture includes data sources, streaming data ingestion using Kafka, schema validation, stream processing for transformations and routing, and storage for analytics, search and long-term retention. The presentation covers design considerations for reliability, scalability and processing of both streaming and batch data to meet requirements like querying, visualization, and batch processing of historical data.
This document discusses application architectures using Hadoop. It provides an example case study of clickstream analysis. It covers challenges of Hadoop implementation and various architectural considerations for data storage and modeling, data ingestion, and data processing. For data processing, it discusses different processing engines like MapReduce, Pig, Hive, Spark and Impala. It also discusses what specific processing needs to be done for the clickstream data like sessionization and filtering.
Top 5 mistakes when writing Spark applicationsmarkgrover
This document discusses 5 common mistakes people make when writing Spark applications.
The first mistake is improperly sizing Spark executors by not considering factors like the number of cores, amount of memory, and overhead needed. The second mistake is running into the 2GB limit on Spark shuffle blocks, which can cause jobs to fail. The third mistake is not addressing data skew during joins and shuffles, which can cause some tasks to be much slower than others. The fourth mistake is poorly managing the DAG by overusing shuffles, not using techniques like ReduceByKey instead of GroupByKey, and not using complex data types. The fifth mistake is classpath conflicts between the versions of libraries used by Spark and those added by the user.
Architecting a Next Gen Data Platform – Strata London 2018Jonathan Seidman
This document summarizes a presentation on architecting data platforms given at the Strata Data Conference in London 2018. The presentation discusses building a customer 360 view using streaming vehicle and other IoT data. It outlines the requirements to support real-time querying, batch processing, and analytics. The high-level architecture shown includes data sources, streaming pipelines, storage systems, and processing engines. Key challenges discussed are reliably ingesting multiple data types and scaling to support various workloads and access patterns.
Architecting a Next Gen Data Platform – Strata New York 2018Jonathan Seidman
Using Customer 360 and the internet of things as examples, this tutorial explains how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, including components like Kafka, Flink, Kudu, Spark Streaming, and Spark SQL and modern storage engines to enable new forms of data processing and analytics.
Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksDatabricks
The cloud has become one of the most attractive ways for enterprises to purchase software, but it requires building products in a very different way from traditional software
Horses for Courses: Database RoundtableEric Kavanagh
The blessing and curse of today's database market? So many choices! While relational databases still dominate the day-to-day business, a host of alternatives has evolved around very specific use cases: graph, document, NoSQL, hybrid (HTAP), column store, the list goes on. And the database tools market is teeming with activity as well. Register for this special Research Webcast to hear Dr. Robin Bloor share his early findings about the evolving database market. He'll be joined by Steve Sarsfield of HPE Vertica, and Robert Reeves of Datical in a roundtable discussion with Bloor Group CEO Eric Kavanagh. Send any questions to [email protected], or tweet with #DBSurvival.
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Databricks
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Comcast, GrubHub, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
First in Class: Optimizing the Data Lake for Tighter IntegrationInside Analysis
The Briefing Room with Dr. Robin Bloor and Teradata RainStor
Live Webcast October 13, 2015
Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=012bb2c290097165911872b1f241531d
Hadoop data lakes are emerging as peers to corporate data warehouses. However, successful data management solutions require a fusion of all relevant data, new and old, which has proven challenging for many companies. With a data lake that’s been optimized for fast queries, solid governance and lifecycle management, users can take data management to a whole new level.
Register for this episode of The Briefing Room to learn from veteran Analyst Dr. Robin Bloor as he discusses the relevance of data lakes in today’s information landscape. He’ll be briefed by Mark Cusack of Teradata, who will explain how his company’s archiving solution has developed into a storage point for raw data. He’ll show how the proven compression, scalability and governance of Teradata RainStor combined with Hadoop can enable an optimized data lake that serves as both reservoir for historical data and as a "system of record” for the enterprise.
Visit InsideAnalysis.com for more information.
The Heart of the Data Mesh Beats in Real-Time with Apache KafkaKai Wähner
If there were a buzzword of the hour, it would certainly be "data mesh"! This new architectural paradigm unlocks analytic data at scale and enables rapid access to an ever-growing number of distributed domain datasets for various usage scenarios.
As such, the data mesh addresses the most common weaknesses of the traditional centralized data lake or data platform architecture. And the heart of a data mesh infrastructure must be real-time, decoupled, reliable, and scalable.
This presentation explores how Apache Kafka, as an open and scalable decentralized real-time platform, can be the basis of a data mesh infrastructure and - complemented by many other data platforms like a data warehouse, data lake, and lakehouse - solve real business problems.
There is no silver bullet or single technology/product/cloud service for implementing a data mesh. The key outcome of a data mesh architecture is the ability to build data products; with the right tool for the job.
A good data mesh combines data streaming technology like Apache Kafka or Confluent Cloud with cloud-native data warehouse and data lake architectures from Snowflake, Databricks, Google BigQuery, et al.
MyDBOPS Team has presented on Oracle MySQL user Camp ( 29-07-2016 ). This presentation is about Grafana and Prometheus for MySQL alerting and Dashboard setup.
Unbundling the Modern Streaming Stack With Dunith Dhanushka | Current 2022HostedbyConfluent
Unbundling the Modern Streaming Stack With Dunith Dhanushka | Current 2022
Event-first thinking and streaming help organizations transition from followers to leaders in the market. A reliable, scalable, and economical streaming architecture helps them get there.
This talk first explores the ""classic streaming stack,"" based on the Lambda architecture, its origin, and why it didn't pick up amongst data-driven organizations. The modern streaming stack (MSS) is a lean, cloud-native, and economical alternative to classic streaming architectures, where it aims to make event-driven real-time applications viable for organizations.
The second half of the talk explores the MSS in detail, including its core components, their purposes, and how Kappa architecture has influenced it. Moreover, the talk lays out a few considerations before planning a new streaming application within an organization. The talk concludes by discussing the challenges in the streaming world and how vendors are trying to overcome them in the future.
ShareChat’s Path to High-Performance NoSQL with ScyllaDBScyllaDB
🎥 Sign up for upcoming webinars or browse through our library of on-demand recordings here: https://www.scylladb.com/resources/webinars/
To power India’s leading social media platform, ShareChat needs to deliver impressive performance at rapidly-increasing scale – but without rapidly-increasing cost. With 180M monthly active users expecting real-time engagement with 2.5B posts per month, having the right database implementation is essential.
Hear how Geetish Nayak, Staff Engineer/Architect - Platforms at ShareChat modernized the database powering their services to stay ahead of these challenges. Geetish will share the strategies they applied to improve performance 3-5x while reducing costs 50-80%. You will learn:
- About their technical challenges related to supporting new locations, debugging, and performance at scale
- What NoSQL capabilities were most critical for preparing their data architecture for the company’s next level of growth
- How they measured the impact of their NoSQL migration, and what they have achieved so far
- Best Practices that ShareChat followed when adopting ScyllaDB
Confluent hosted a technical thought leadership session to discuss how leading organisations move to real-time architecture to support business growth and enhance customer experience.
Event Streaming CTO Roundtable for Cloud-native Kafka ArchitecturesKai Wähner
Technical thought leadership presentation to discuss how leading organizations move to real-time architecture to support business growth and enhance customer experience. This is a forum to discuss use cases with your peers to understand how other digital-native companies are utilizing data in motion to drive competitive advantage.
Agenda:
- Data in Motion with Event Streaming and Apache Kafka
- Streaming ETL Pipelines
- IT Modernisation and Hybrid Multi-Cloud
- Customer Experience and Customer 360
- IoT and Big Data Processing
- Machine Learning and Analytics
Couchbase Cloud No Equal (Rick Jacobs, Couchbase) Kafka Summit 2020HostedbyConfluent
This session will describe and demonstrate the longstanding integration between Couchbase Server and Apache Kafka and will include descriptions of both the mechanics of the integration and practical situations when combining these products is appropriate.
Today, data lakes are widely used and have become extremely affordable as data volumes have grown. However, they are only meant for storage and by themselves provide no direct value. With up to 80% of data stored in the data lake today, how do you unlock the value of the data lake? The value lies in the compute engine that runs on top of a data lake.
Join us for this webinar where Ahana co-founder and Chief Product Officer Dipti Borkar will discuss how to unlock the value of your data lake with the emerging Open Data Lake analytics architecture.
Dipti will cover:
-Open Data Lake analytics - what it is and what use cases it supports
-Why companies are moving to an open data lake analytics approach
-Why the open source data lake query engine Presto is critical to this approach
Denny Lee introduced Azure DocumentDB, a fully managed NoSQL database service. DocumentDB provides elastic scaling of throughput and storage, global distribution with low latency reads and writes, and supports querying JSON documents with SQL and JavaScript. Common scenarios that benefit from DocumentDB include storing product catalogs, user profiles, sensor telemetry, and social graphs due to its ability to handle hierarchical and de-normalized data at massive scale.
Most data visualisation solutions today still work on data sources which are stored persistently in a data store, using the so called “data at rest” paradigms. More and more data sources today provide a constant stream of data, from IoT devices to Social Media streams. These data stream publish with high velocity and messages often have to be processed as quick as possible. For the processing and analytics on the data, so called stream processing solutions are available. But these only provide minimal or no visualisation capabilities. One was is to first persist the data into a data store and then use a traditional data visualisation solution to present the data.
If latency is not an issue, such a solution might be good enough. An other question is which data store solution is necessary to keep up with the high load on write and read. If it is not an RDBMS but an NoSQL database, then not all traditional visualisation tools might already integrate with the specific data store. An other option is to use a Streaming Visualisation solution. They are specially built for streaming data and often do not support batch data. A much better solution would be to have one tool capable of handling both, batch and streaming data. This talk presents different architecture blueprints for integrating data visualisation into a fast data solution and highlights some of the products available to implement these blueprints.
Take Action: The New Reality of Data-Driven BusinessInside Analysis
The Briefing Room with Dr. Robin Bloor and WebAction
Live Webcast on July 23, 2014
Watch the archive:
https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=360d371d3a49ad256942f55350aa0a8b
The waiting used to be the hardest part, but not anymore. Today’s cutting-edge enterprises can seize opportunities faster than ever, thanks to an array of technologies that enable real-time responsiveness across the spectrum of business processes. Early adopters are solving critical business challenges by enabling the rapid-fire design, development and production of very specific applications. Functionality can range from improved customer engagement to dynamic machine-to-machine interactions.
Register for this episode of The Briefing Room to learn from veteran Analyst Dr. Robin Bloor, who will tout a new era in data-driven organizations, and why a data flow architecture will soon be critical for industry leaders. He’ll be briefed by Sami Akbay of WebAction, who will showcase his company’s real-time data management platform, which combines all the component parts needed to access, process and leverage data big and small. He’ll explain how this new approach can provide game-changing power to organizations of all types and sizes.
Visit InsideAnlaysis.com for more information.
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPathCommunity
Join this UiPath Community Berlin meetup to explore the Orchestrator API, Swagger interface, and the Test Manager API. Learn how to leverage these tools to streamline automation, enhance testing, and integrate more efficiently with UiPath. Perfect for developers, testers, and automation enthusiasts!
📕 Agenda
Welcome & Introductions
Orchestrator API Overview
Exploring the Swagger Interface
Test Manager API Highlights
Streamlining Automation & Testing with APIs (Demo)
Q&A and Open Discussion
Perfect for developers, testers, and automation enthusiasts!
👉 Join our UiPath Community Berlin chapter: https://community.uipath.com/berlin/
This session streamed live on April 29, 2025, 18:00 CET.
Check out all our upcoming UiPath Community sessions at https://community.uipath.com/events/.
How Can I use the AI Hype in my Business Context?Daniel Lehner
𝙄𝙨 𝘼𝙄 𝙟𝙪𝙨𝙩 𝙝𝙮𝙥𝙚? 𝙊𝙧 𝙞𝙨 𝙞𝙩 𝙩𝙝𝙚 𝙜𝙖𝙢𝙚 𝙘𝙝𝙖𝙣𝙜𝙚𝙧 𝙮𝙤𝙪𝙧 𝙗𝙪𝙨𝙞𝙣𝙚𝙨𝙨 𝙣𝙚𝙚𝙙𝙨?
Everyone’s talking about AI but is anyone really using it to create real value?
Most companies want to leverage AI. Few know 𝗵𝗼𝘄.
✅ What exactly should you ask to find real AI opportunities?
✅ Which AI techniques actually fit your business?
✅ Is your data even ready for AI?
If you’re not sure, you’re not alone. This is a condensed version of the slides I presented at a Linkedin webinar for Tecnovy on 28.04.2025.
Spark is a powerhouse for large datasets, but when it comes to smaller data workloads, its overhead can sometimes slow things down. What if you could achieve high performance and efficiency without the need for Spark?
At S&P Global Commodity Insights, having a complete view of global energy and commodities markets enables customers to make data-driven decisions with confidence and create long-term, sustainable value. 🌍
Explore delta-rs + CDC and how these open-source innovations power lightweight, high-performance data applications beyond Spark! 🚀
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfAbi john
Analyze the growth of meme coins from mere online jokes to potential assets in the digital economy. Explore the community, culture, and utility as they elevate themselves to a new era in cryptocurrency.
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersToradex
Toradex brings robust Linux support to SMARC (Smart Mobility Architecture), ensuring high performance and long-term reliability for embedded applications. Here’s how:
• Optimized Torizon OS & Yocto Support – Toradex provides Torizon OS, a Debian-based easy-to-use platform, and Yocto BSPs for customized Linux images on SMARC modules.
• Seamless Integration with i.MX 8M Plus and i.MX 95 – Toradex SMARC solutions leverage NXP’s i.MX 8 M Plus and i.MX 95 SoCs, delivering power efficiency and AI-ready performance.
• Secure and Reliable – With Secure Boot, over-the-air (OTA) updates, and LTS kernel support, Toradex ensures industrial-grade security and longevity.
• Containerized Workflows for AI & IoT – Support for Docker, ROS, and real-time Linux enables scalable AI, ML, and IoT applications.
• Strong Ecosystem & Developer Support – Toradex offers comprehensive documentation, developer tools, and dedicated support, accelerating time-to-market.
With Toradex’s Linux support for SMARC, developers get a scalable, secure, and high-performance solution for industrial, medical, and AI-driven applications.
Do you have a specific project or application in mind where you're considering SMARC? We can help with Free Compatibility Check and help you with quick time-to-market
For more information: https://www.toradex.com/computer-on-modules/smarc-arm-family
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell
With expertise in data architecture, performance tracking, and revenue forecasting, Andrew Marnell plays a vital role in aligning business strategies with data insights. Andrew Marnell’s ability to lead cross-functional teams ensures businesses achieve sustainable growth and operational excellence.
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...SOFTTECHHUB
I started my online journey with several hosting services before stumbling upon Ai EngineHost. At first, the idea of paying one fee and getting lifetime access seemed too good to pass up. The platform is built on reliable US-based servers, ensuring your projects run at high speeds and remain safe. Let me take you step by step through its benefits and features as I explain why this hosting solution is a perfect fit for digital entrepreneurs.
Generative Artificial Intelligence (GenAI) in BusinessDr. Tathagat Varma
My talk for the Indian School of Business (ISB) Emerging Leaders Program Cohort 9. In this talk, I discussed key issues around adoption of GenAI in business - benefits, opportunities and limitations. I also discussed how my research on Theory of Cognitive Chasms helps address some of these issues
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...Alan Dix
Talk at the final event of Data Fusion Dynamics: A Collaborative UK-Saudi Initiative in Cybersecurity and Artificial Intelligence funded by the British Council UK-Saudi Challenge Fund 2024, Cardiff Metropolitan University, 29th April 2025
https://alandix.com/academic/talks/CMet2025-AI-Changes-Everything/
Is AI just another technology, or does it fundamentally change the way we live and think?
Every technology has a direct impact with micro-ethical consequences, some good, some bad. However more profound are the ways in which some technologies reshape the very fabric of society with macro-ethical impacts. The invention of the stirrup revolutionised mounted combat, but as a side effect gave rise to the feudal system, which still shapes politics today. The internal combustion engine offers personal freedom and creates pollution, but has also transformed the nature of urban planning and international trade. When we look at AI the micro-ethical issues, such as bias, are most obvious, but the macro-ethical challenges may be greater.
At a micro-ethical level AI has the potential to deepen social, ethnic and gender bias, issues I have warned about since the early 1990s! It is also being used increasingly on the battlefield. However, it also offers amazing opportunities in health and educations, as the recent Nobel prizes for the developers of AlphaFold illustrate. More radically, the need to encode ethics acts as a mirror to surface essential ethical problems and conflicts.
At the macro-ethical level, by the early 2000s digital technology had already begun to undermine sovereignty (e.g. gambling), market economics (through network effects and emergent monopolies), and the very meaning of money. Modern AI is the child of big data, big computation and ultimately big business, intensifying the inherent tendency of digital technology to concentrate power. AI is already unravelling the fundamentals of the social, political and economic world around us, but this is a world that needs radical reimagining to overcome the global environmental and human challenges that confront us. Our challenge is whether to let the threads fall as they may, or to use them to weave a better future.
Technology Trends in 2025: AI and Big Data AnalyticsInData Labs
At InData Labs, we have been keeping an ear to the ground, looking out for AI-enabled digital transformation trends coming our way in 2025. Our report will provide a look into the technology landscape of the future, including:
-Artificial Intelligence Market Overview
-Strategies for AI Adoption in 2025
-Anticipated drivers of AI adoption and transformative technologies
-Benefits of AI and Big data for your business
-Tips on how to prepare your business for innovation
-AI and data privacy: Strategies for securing data privacy in AI models, etc.
Download your free copy nowand implement the key findings to improve your business.
Artificial Intelligence is providing benefits in many areas of work within the heritage sector, from image analysis, to ideas generation, and new research tools. However, it is more critical than ever for people, with analogue intelligence, to ensure the integrity and ethical use of AI. Including real people can improve the use of AI by identifying potential biases, cross-checking results, refining workflows, and providing contextual relevance to AI-driven results.
News about the impact of AI often paints a rosy picture. In practice, there are many potential pitfalls. This presentation discusses these issues and looks at the role of analogue intelligence and analogue interfaces in providing the best results to our audiences. How do we deal with factually incorrect results? How do we get content generated that better reflects the diversity of our communities? What roles are there for physical, in-person experiences in the digital world?
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Impelsys Inc.
Impelsys provided a robust testing solution, leveraging a risk-based and requirement-mapped approach to validate ICU Connect and CritiXpert. A well-defined test suite was developed to assess data communication, clinical data collection, transformation, and visualization across integrated devices.
Semantic Cultivators : The Critical Future Role to Enable AIartmondano
By 2026, AI agents will consume 10x more enterprise data than humans, but with none of the contextual understanding that prevents catastrophic misinterpretations.
Mobile App Development Company in Saudi ArabiaSteve Jonas
EmizenTech is a globally recognized software development company, proudly serving businesses since 2013. With over 11+ years of industry experience and a team of 200+ skilled professionals, we have successfully delivered 1200+ projects across various sectors. As a leading Mobile App Development Company In Saudi Arabia we offer end-to-end solutions for iOS, Android, and cross-platform applications. Our apps are known for their user-friendly interfaces, scalability, high performance, and strong security features. We tailor each mobile application to meet the unique needs of different industries, ensuring a seamless user experience. EmizenTech is committed to turning your vision into a powerful digital product that drives growth, innovation, and long-term success in the competitive mobile landscape of Saudi Arabia.
Mobile App Development Company in Saudi ArabiaSteve Jonas
Ad
Architecting a next generation data platform
1. Hadoop Application
Architectures:
Architecting a Next
Generation Data Platform
Strata Data Conference, New York 2017
tiny.cloudera.com/app-arch-newyork
tiny.cloudera.com/nyquestions
Mark Grover | @mark_grover
Jonathan Seidman | @jseidman
Gwen Shapira | @gwenshap
2. Questions?
tiny.cloudera.com/nyquestions
Logistics
▪ Break at 3:00 – 3:30 PM
▪ Questions at the end of each section
▪ Slides at tiny.cloudera.com/app-arch-newyork
▪ Code at https://github.com/hadooparchitecturebook/Taxi360
4. Questions?
tiny.cloudera.com/nyquestions
About the presenters
▪ Product Manager at Lyft
▪ Formerly Software Engineer
on Spark at Cloudera
▪ Committer on Apache Bigtop,
PMC member on Apache
Sentry, Apache Spot
(incubating)
▪ Contributor to Apache Spark,
Hadoop, Hive, Sqoop, Pig,
Flume
Mark Grover
6. Questions?
tiny.cloudera.com/nyquestions
About the presenters
▪ Software Engineer at
Cloudera
▪ Contributor to Apache Sqoop.
▪ Previously Technical Lead on
the big data team at Orbitz,
co-founder of the Chicago
Hadoop User Group and
Chicago Big Data
Jonathan Seidman
20. Questions?
tiny.cloudera.com/nyquestions
Requirements
▪ To support all this, we need:
- Reliable ingestion of streaming and batch data.
- Ability to perform transformations on streaming data in flight.
- Ability to perform sophisticated processing of historical data.
25. Questions?
tiny.cloudera.com/nyquestions
Key to Customer 360 Success
Your project is only as good as the quality and variety of data sources
Geo-location/
Traffic Data
Customer DataMaintenance
Data
Other Data
Sources
Streaming
Vehicle Data
Files
CSV? XML?
JSON?
Twitter?
Mainframe?
Database Salesforce?
MQTT
26. Questions?
tiny.cloudera.com/nyquestions
Data Producers: Flume vs. Kafka
▪ Flume – well integrated with Hadoop.
▪ Part of Hadoop ecosystem
▪ Great choice when ingesting data into HDFS.
▪ Can support simple transformations.
▪ Minimal coding – built in support for common data sources.
▪ Kafka – flexible, get-everything pipe
▪ Producers in ~ 20 languages
▪ REST API
▪ Huge connector ecosystem
37. Questions?
tiny.cloudera.com/nyquestions
Buffering Data
▪ What do we mean by “buffering” and why do we need it?
event,event,event,event,event,event…
This is bad!
▪ Network partitions happen
▪ Producers and Consumers
work at different rates
▪ Reliable storage is hard
Stream processing is hard
Lets do one at a time
40. Questions?
tiny.cloudera.com/nyquestions
What is Kafka?
▪ It’s like a message queue, right?
- Actually, it’s a “distributed commit log”
- Or “streaming data platform”
0 1 2 3 4 5 6 7 8
Data
Source
Data
Consumer
A
Data
Consumer
B
41. Questions?
tiny.cloudera.com/nyquestions
Topics and Partitions
▪ Messages are organized into topics, and each topic is split into partitions.
- Each partition is an immutable, time-sequenced log of messages on disk.
- Note that time ordering is guaranteed within, but not across, partitions.
0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8
0 1 2 3 4 5 6 7 8
Partition 0
Partition 1
Partition 2
Data
Source
Topic
45. Questions?
tiny.cloudera.com/nyquestions
Kafka Considerations – Reliability
▪ Different reliability levels for topics:
Taxi Trip Data
Kafka
taxi-trip-input
Twitter customer-sentiment
100% – dups
are ok
(“At least
once”)
<=100%
(“At most
once”)
News Flash:
Kafka’s Exactly Once
Producer is on the way
56. Questions?
tiny.cloudera.com/nyquestions
How many partitions?
§ Adding partitions late in the game is painful
§ Basic formula:
total desired throughput / throughput of slowest consumer or producer
§ Or ~25GB disk space
§ Not too many because:
- Each partition takes broker heap memory and file handles
- Each partition slows down node shutdown / recovery
- 1000 – 4000 partitions per broker max
- Producers will produce smaller batches – lower throughput
58. Questions?
tiny.cloudera.com/nyquestions
Guarding Against Message Loss
§ Producer – What happens if the producer loses connection to Kafka and the buffer overflows?
- You get an exception. You can choose to… block? Write to file?
§ Source – What happens if events are lost before getting sent to producer?
- Once again use some kind of buffer to provide sufficient retention of data.
63. Questions?
tiny.cloudera.com/nyquestions
What do we mean by streaming?
Constant low
milliseconds & under
Low milliseconds to
seconds, delay in
case of failures
10s of seconds or
more, re-run in case
of failures
Real-time Near real-time Batch
64. Questions?
tiny.cloudera.com/nyquestions
What do we mean by streaming?
Constant low
milliseconds & under
Low milliseconds to
seconds, delay in
case of failures
10s of seconds or
more, re-run in case
of failures
Real-time Near real-time Batch
65. Questions?
tiny.cloudera.com/nyquestions
But, there’s no free lunch
Constant low
milliseconds & under
Low milliseconds to
seconds, delay in
case of failures
10s of seconds or
more, re-run in case
of failures
Real-time Near real-time Batch
“Difficult” architectures, lower
latency
“Easier” architectures, higher
latency
72. Questions?
tiny.cloudera.com/nyquestions
#1 – Simple Ingestion
1. Zero transformation
- No transformation, plain ingest
- Keep the original format – SequenceFile, Text, etc.
- Allows to store data that may have errors in the schema
2. Format transformation
- Simply change the format of the field
- To a structured format, say, Avro, for example
- Can do schema validation
3. Atomic transformation
- Mask a credit card number
74. Questions?
tiny.cloudera.com/nyquestions
Where to store the context?
1. Locally Broadcast Cached Dim Data
- Local to Process (On Heap, Off Heap)
- Local to Node (Off Process)
2. Partitioned Cache
- Shuffle to move new data to partitioned cache
3. External Fetch Data (e.g. HBase, Memcached)
81. Questions?
tiny.cloudera.com/nyquestions
Delivery Types
▪ At most once
- Not good for many cases
- Only where performance/SLA is more important than accuracy
▪ Exactly once
- Expensive to achieve but desirable
▪ At least once
- Easiest to achieve
91. Questions?
tiny.cloudera.com/nyquestions
Spark Streaming - Gaps
§Not as low of a latency
- Efforts towards reducing latency e.g. RISElab’s Drizzle
§Global consistent execution state
- Stop overall execution of distributed computation
- Eagerly persist records in transit meaning larger snapshots
92. Questions?
tiny.cloudera.com/nyquestions
Flink
▪ True “streaming” system, but not as feature rich as Spark
▪ Much better event time handling
▪ Good built-in backpressure support
▪ Allows stateful transformations
▪ Lower Latency
- No Micro Batching
- Asynchronous Barrier Snapshotting (ABS)
100. Questions?
tiny.cloudera.com/nyquestions
Kafka Streams
▪ Good integration with Kafka
▪ Light-weight library (not a framework)
▪ No micro-batching, uses Kafka as internal messaging layer
▪ Maintains local state per node (in RocksDB, or in memory
hash map)
▪ Handles late events
▪ Stream-to-stream joins
115. Questions?
tiny.cloudera.com/nyquestions
Compression Codecs
- Snappy: 2x-3x : Fast Read, Fast Write
- Lzo : 2x-3x : Fast Read, Fast Write
- Gzip : ~8x: ~Fast Read, Normal Write
- Default : ~8x: ~Fast Read, Normal Write
- BZip2 : ~10x ~Fast Read, Slow Write
- Others ..
- Always be skeptical
- All data compresses differently
- Use your own data
116. Questions?
tiny.cloudera.com/nyquestions
Introducing the Hive Metastore
- Hive Metastore
- Adds a table like metadata layer over a file system, block store, NoSql, or other
- Allows for SQL access
- Allows for greater security options
- Allows for external metadata
- Allows for partitioning
120. Questions?
tiny.cloudera.com/nyquestions
Thinking about Object/Tables
1. Lets start off easy
1. Use Case: We are a Netflix type company and we have a log of users and movies watched
that looks something like this:
User ID Age Account Start
Date
Category Of User Movie Watched Movie Category Start Time Events List
Bob 42 12/12/2012 Basic Die Hard Action 5/4/2016 12:00 Play 0, pause at
15, FF at 40 to 55,
E at 90
Kat 31 12/12/2012 Platum Beauty and the
Beast
Family 5/4/2016 12:00 Play 0, pause at
15, FF at 40 to 55,
E at 90
121. Questions?
tiny.cloudera.com/nyquestions
Thinking about Object/Tables
1. To make this into objects we need to do some separation
User
User_id
Age
St_dt
Category
Movie
Movie_id
Title
Category
Watch_session
Watch_id
St_dt
En_dt
User_id
Movie_id
Watch_Events
Watch_id
St_dt
Type
Duration
Category_Typ
Category_id
Stream_rt
Is_feature_enabled
1 *
*
1
1
*
1*
122. Questions?
tiny.cloudera.com/nyquestions
Query Considerations
- Data is normally big so
- Partition respectively to access patterns
- Join with care
- Consider sampling or local testing before experimenting
- Data is files
- Latency to accessibility it high – seconds, minutes or more.
123. Questions?
tiny.cloudera.com/nyquestions
Look for big tables
User
User_id
Age
St_dt
Category
Movie
Movie_id
Title
Category
Watch_session
Watch_id
St_dt
En_dt
User_id
Movie_id
Watch_Events
Watch_id
St_dt
Type
Duration
Category_Typ
Category_id
Stream_rt
Is_feature_enabled
1 *
*
1
1
*
1*
126. Questions?
tiny.cloudera.com/nyquestions
View Strategies
Hive Relational Model
Hive Nested Model
Models
Hive Normal Views
Hive Materialized Table
Views
Use in the cases where the view requires
a join that is done through a shuffle
Use only for tables that filter
records/columns or use for marking fields
133. Questions?
tiny.cloudera.com/nyquestions
Nested
▪ Less Space than Denormalization
▪ Still have tables but the cost of joins is all but gone
▪ Also great for cartesian joins
- N x M vs N + M
▪ Not really supported yet with Kudu or HBase with SQL
134. Questions?
tiny.cloudera.com/nyquestions
Nested Example
CREATE TABLE fact_contacts (id BIGINT, name STRING, address
STRING) STORED AS PARQUET;
CREATE TABLE dim_phones
(
contact_id BIGINT
, category STRING
, international_code STRING
, area_code STRING
, exchange STRING
, extension STRING
, mobile BOOLEAN
, carrier STRING
, current BOOLEAN
, service_start_date TIMESTAMP
, service_end_date TIMESTAMP
)
136. Questions?
tiny.cloudera.com/nyquestions
De-normalized vs Nested
- Nested Pros
- Co-location
- Faster to group by
- Faster to window
- Joins are free
- Less data
- Better compression
- Tables and Columns can be read with out penalty from one not read
- Great for limiting the effort are Cartesian Joins
- Nested Cons
- Size limitation of parent row
- Adding child requires the re-write the the whole parent record
160. Questions?
tiny.cloudera.com/nyquestions
Hash Map
- There is a Key and a Value
- It is really fast to grab a key/value
- It is really fast to add a key/value
- Iteration is also possible
Key Value
A 1
B 1
C 1
D 1
E 1
F 1
G 1
Client
161. Questions?
tiny.cloudera.com/nyquestions
Log with Compactions
- When new records come in they don t rewrite the old
- They compact in
Key Time Value
A 1 101
B 1 101
C 1 101
D 1 101
E 1 101
F 1 101
G 1 101
Key Time Value
A 2 102
D 2 102
F 2 102
F 3 103
H 3 103
Key Time Value
A 2 102
B 1 101
C 1 101
D 2 102
E 1 101
F 3 103
G 1 101
H 3 103
162. Questions?
tiny.cloudera.com/nyquestions
HDFS
Log with Compactions
- Write Path
- Get Local for Record (Cached)
- First to WAL
- Then to Memstore
- Sorting & batching
- Flush to New Hfile
- Later Hfiles will be compacted
Client
Master
RegionServer
Memstore
HFiles New HFiles
HFiles
WAL
163. Questions?
tiny.cloudera.com/nyquestions
HDFS
Ordered
- All Records Columns are ordered
- Ordering allows for simpler indexing
- Ordering allows for simpler compactions
- We will also use this ordering
- Windowing
- Time series
- Local scanning
Client
Master
RegionServer
Memstore
HFiles New HFiles
HFiles
165. Questions?
tiny.cloudera.com/nyquestions
So what about SQL
- Well SQL could totally work
- CQL for cassandra
- Hive and SparkSQL on HBase
- Why is it not the best idea
- Built more for point look ups
- Scans are not as fast as parquet
- However the mutability may be more important than speed
- Partitioning is not simple
- It must be put into the key
167. Questions?
tiny.cloudera.com/nyquestions
HBase Model
Client
Master
Region Server 1
Region Server 2
- Region Server owns range splits
- Region Server 1 fails
- Master needs to figure that out
- Master needs to assign new Region Server to own splits
- Region Server 2 has to get organized
- Region Server 2 is read to server reads and writes
175. Questions?
tiny.cloudera.com/nyquestions
Lucene Indexing (Facets)
- Facets are a side effect of out wonderful indexes
- It allows us to counts all the document that below to given indexes to produce
- Grouped Counts
- Charts and Graphs (kibana or Banana)
- People will also call this access pattern cubing a dataset
177. Questions?
tiny.cloudera.com/nyquestions
Lucene Indexing (Facets Example)
- Time Series Example
Document
ID
Hour of Day User State Event
1 12 4201 MD click
2 12 4202 VA click
3 12 4203 VA click
4 1 4201 MD click
5 1 4202 VA view
6 2 4204 CA click
7 2 4205 VA view
8 2 4201 MD click
178. Questions?
tiny.cloudera.com/nyquestions
Lucene Indexing (Facets Example)
Hour of
Day
12 1 2 3
1 4 5
2 6 7 8 9
Document
ID
Hour of
Day
User State Event
1 12 4201 MD click
2 12 4202 VA click
3 12 4203 VA click
4 1 4201 MD click
5 1 4202 VA view
6 2 4204 CA click
7 2 4205 VA view
8 2 4201 MD click
9 2 4204 CA click
User
4201 1 4 8
4202 2 5
4203 3
4204 6 9
4205 7
State
MD 1 4 8
VA 2 3 5 7
CA 6 9
Event
click 1 2 3 4 6 8 9
view 5 7
181. Questions?
tiny.cloudera.com/nyquestions
- Note the bucketing and ordered pattern
Lucene Indexing (Facets Example)
Hour of
Day 2
State
MD
State
VA
State CA
6 1 2 6
7 4 3 9
8 8 5
9 7State
MD 1 4 8
VA 2 3 5 7
CA 6 9
Hour of
Day
12 1 2 3
1 4 5
2 6 7 8 9
182. Questions?
tiny.cloudera.com/nyquestions
- Note the bucketing and ordered pattern
Lucene Indexing (Facets Example)
Hour of
Day 2
State
MD
State
VA
State CA
6 1 2 6
7 4 3 9
8 8 5
9 7
Hour of
Day 2
State
MD
State
VA
State CA
6 1 2 6
7 4 3 9
8 8 5
9 7
+1 CA
183. Questions?
tiny.cloudera.com/nyquestions
- Note the bucketing and ordered pattern
Lucene Indexing (Facets Example)
Hour of
Day 2
State
MD
State
VA
State CA
6 1 2 6
7 4 3 9
8 8 5
9 7
+1VA
Hour of
Day 2
State
MD
State
VA
State CA
6 1 2 6
7 4 3 9
8 8 5
9 7
+1 MD
Hour of
Day 2
State
MD
State
VA
State CA
6 1 2 6
7 4 3 9
8 8 5
9 7
+1 CA
185. Questions?
tiny.cloudera.com/nyquestions
Writing Latency
- Lucene Indexing is more expensive then NoSQL work
- Think of it as micro batching
- Larger batches ~= better throughput
- Compaction is also invalid
- Deletes impact storage and performance until they are compacted
190. Questions?
tiny.cloudera.com/nyquestions
BSP Bulk Synchronous Parallel
- Process every Node Atomically
- Node gets all messages sent to it
- Nodes can mutate them selves and their edges
- Nodes can send messages to other nodes
- But nothing is received yet
- BSP waits until all the Node processing is done
- Then send messages to the right partition
- Repeat
194. Questions?
tiny.cloudera.com/nyquestions
Why have batch processing?
▪ When you need a larger context
- Say, to train a model
▪ Complex periodic job that does something
- Convert data to a nested structure for reduced number of shuffles
▪ In our use-case,
- Kudu -> HDFS Nested is batch processing
- KMeans calculation is also in bash
204. Questions?
tiny.cloudera.com/nyquestions
Why have REST server?
▪ Tired of business people telling us how to access data
▪ Serves as an interface between the data engineers and business folks
▪ Lets business folks decide access patterns
▪ Engineers to optimize those patterns
▪ Brownie points from your boss
▪ And, it’s not that difficult to write!
205. Questions?
tiny.cloudera.com/nyquestions
Don’t believe me?
import org.mortbay.jetty.Server
import org.mortbay.jetty.servlet.{Context, ServletHolder}
…
val server = new Server(port)
val sh = new ServletHolder(classOf[ServletContainer])
sh.setInitParameter("com.sun.jersey.config.property.resourceConfigClass",
"com.sun.jersey.api.core.PackagesResourceConfig")
sh.setInitParameter("com.sun.jersey.config.property.packages",
"com.hadooparchitecturebook.taxi360.server.hbase")
sh.setInitParameter("com.sun.jersey.api.json.POJOMappingFeature", "true”)
val context = new Context(server, "/", Context.SESSIONS)
context.addServlet(sh, "/*”)
server.start()
server.join()
227. Questions?
tiny.cloudera.com/nyquestions
Other Sessions
▪ Ask Us Anything session – Thursday, 1:15 PM
▪ The Three Realities of Modern Programming: the Cloud, Microservices, and the
Explosion of Data (Gwen) – Thursday 11:20 AM
▪ One Cluster Does Not Fit All: Architecture Patterns for Multicluster Apache Kafka
Deployments (Gwen) – Thursday 2:05 PM
▪ Managing Successful Big Data Projects (Ted Malaska and Jonathan) – Thursday
4:35 PM