Slides on Gobblin that I used for presentation at VLDB 2015 in the Industrial 2 track. Checkout Gobblin at https://github.com/linkedin/gobblin.
The document discusses running Gobblin, an open source data ingestion framework, on YARN. It provides an overview of the motivations and architecture when running Gobblin on YARN, including better resource utilization, support for Gobblin as a continuous long-running service, and better fit for streaming ingestion. Key implementation details covered include the use of Apache Helix for distributed task execution and coordination, log aggregation, and security/token management.
Big Data Ingestion @ Flipkart Data PlatformNavneet Gupta
The document discusses Flipkart's Data Platform (FDP), which ingests and processes large amounts of data from across Flipkart's teams. FDP is divided into ingestion, processing, and consumption sub-teams, and acts as a broker between teams to exchange raw and processed data. It ingests around 2 billion records per day through various mechanisms like streaming endpoints, Hadoop, and tools. Ingested data is validated against schemas and sent to Kafka for temporary storage before being copied to Hadoop and consumed by batch and real-time processing systems. FDP provides capabilities for data processing and access but does not build applications on top of the data.
How NerdWallet uses Gobblin (https://github.com/linkedin/gobblin) today, some pending contributions, and our future roadmap asks.
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn confluent
(Celia Kung, LinkedIn) Kafka Summit SF 2018
For several years, LinkedIn has been using Kafka MirrorMaker as the mirroring solution for copying data between Kafka clusters across data centers. However, as LinkedIn data continued to grow, mirroring trillions of Kafka messages per day across data centers uncovered the scale limitations and operability challenges of Kafka MirrorMaker. To address such issues, we have developed a new mirroring solution, built on top our stream ingestion service, Brooklin. Brooklin MirrorMaker aims to provide improved performance and stability, while facilitating better management through finer control of data pipelines. Through flushless Kafka produce, dynamic management of data pipelines, per-partition error handling and flow control, we are able to increase throughput, better withstand consume and produce failures and reduce overall operating costs. As a result, we have eliminated the major pain points of Kafka MirrorMaker. In this talk, we will dive deeper into the challenges LinkedIn has faced with Kafka MirrorMaker, how we tackled them with Brooklin MirrorMaker and our plans for iterating further on this new mirroring solution.
Lessons Learned Migrating from IBM BigInsights to Hortonworks Data PlatformDataWorks Summit
The document summarizes lessons learned from migrating from IBM BigInsights to Hortonworks Data Platform (HDP). It discusses challenges with the IBM platform around interoperability, support, and compatibility. It then outlines the scope of work for the migration project, including moving batch jobs, scripts, data, and environments to HDP. Finally, it discusses lessons learned around securing access, ingest frameworks, stakeholder engagement, data migration, and operational maturity when transitioning platforms.
This document discusses Pinot, Uber's real-time analytics platform. It provides an overview of Pinot's architecture and data ingestion process, describes a case study on modeling trip data in Pinot, and benchmarks Pinot's performance on ingesting large volumes of data and answering queries in real-time.
Jim Dowling – Interactive Flink analytics with HopsWorks and ZeppelinFlink Forward
This document discusses interactive analytics with HopsWorks and Zeppelin. It summarizes HopsWorks, a frontend for Hops that supports true multi-tenancy, free-text search across metadata, and interactive analytics with Flink and Zeppelin. It also discusses how HopsFS and HopsYARN improve on HDFS and YARN architectures with metadata stored in a distributed database for consistency and global search.
Flink SQL & TableAPI in Large Scale Production at AlibabaDataWorks Summit
Search and recommendation system for Alibaba’s e-commerce platform use batch and streaming processing heavily. Flink SQL and Table API (which is a SQL-like DSL) provide simple, flexible, and powerful language to express the data processing logic. More importantly, it opens the door to unify the semantics of batch and streaming jobs.
Blink is a project at Alibaba which improves Apache Flink to make it ready for large scale production use. To support our products, we made lots of improvements to Flink SQL & TableAPI in Alibaba's Blink project. We added the support for User-Defined Table function (UDTF), User-Defined Aggregates (UDAGG), Window Aggregate, and retraction, etc. We are actively working with the Flink community to contribute these improvements back. In this talk, we will present the rationale, semantics, design and implementation of these improvements. We will also share the experience of running large scale Flink SQL and TableAPI jobs at Alibaba.
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...DataWorks Summit
This document discusses technologies for data ingestion, transformation, and analytics. It introduces Gobblin for scalable data ingestion from diverse sources, Cubert for converting data formats, WhereHows for data lineage tracking, and Pinot for real-time analytics. Gobblin provides a framework for extracting, converting, validating data in parallel tasks. Cubert allows converting data between formats using a domain-specific language. WhereHows tracks lineage metadata to answer questions about where data came from and how it flows. Pinot is a real-time distributed OLAP store for interactive queries on fresh data using a SQL-like interface.
Timeline Service v.2 (Hadoop Summit 2016)Sangjin Lee
This document summarizes the new YARN Timeline Service version 2, which was developed to address scalability, reliability, and usability challenges in version 1. Key highlights of version 2 include a distributed collector architecture for scalable and fault-tolerant writing of timeline data, an entity data model with first-class configuration and metrics support, and metrics aggregation capabilities. It stores data in HBase for scalability and provides a richer REST API for querying. Milestone goals include integration with more frameworks and production readiness.
Modern ETL Pipelines with Change Data CaptureDatabricks
In this talk we’ll present how at GetYourGuide we’ve built from scratch a completely new ETL pipeline using Debezium, Kafka, Spark and Airflow, which can automatically handle schema changes. Our starting point was an error prone legacy system that ran daily, and was vulnerable to breaking schema changes, which caused many sleepless on-call nights. As most companies, we also have traditional SQL databases that we need to connect to in order to extract relevant data.
This is done usually through either full or partial copies of the data with tools such as sqoop. However another approach that has become quite popular lately is to use Debezium as the Change Data Capture layer which reads databases binlogs, and stream these changes directly to Kafka. As having data once a day is not enough anymore for our bussiness, and we wanted our pipelines to be resilent to upstream schema changes, we’ve decided to rebuild our ETL using Debezium.
We’ll walk the audience through the steps we followed to architect and develop such solution using Databricks to reduce operation time. By building this new pipeline we are now able to refresh our data lake multiple times a day, giving our users fresh data, and protecting our nights of sleep.
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & KafkaFlink Forward
This document summarizes a presentation about Bouygues Telecom's use of Apache Flink for real-time data integration and processing of mobile network event logs. Bouygues Telecom processes over 4 billion logs per day from their network equipment to calculate mobile quality of experience (QoE) indicators within 60 seconds for business intelligence, diagnostics and alerting. They were previously using Hadoop for batch processing but needed a real-time solution. After evaluating Apache Spark and Flink, they chose Flink for its true streaming capabilities, backpressure handling, and high performance on limited resources. Flink helped them process a day's worth of logs in under an hour from 10 Kafka partitions across 10 TaskManagers, each with only
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...confluent
The Oak Ridge Leadership Facility (OLCF) in the National Center for Computational Sciences (NCCS) division at Oak Ridge National Laboratory (ORNL) houses world-class high-performance computing (HPC) resources and has a history of operating top-ranked supercomputers on the TOP500 list, including the world's current fastest, Summit, an IBM AC922 machine with a peak of 200 petaFLOPS. With the exascale era rapidly approaching, the need for a robust and scalable big data platform for operations data is more important than ever. In the past when a new HPC resource was added to the facility, pipelines from data sources spanned multiple data sinks which oftentimes resulted in data silos, slow operational data onboarding, and non-scalable data pipelines for batch processing. Using Apache Kafka as the message bus of the division's new big data platform has allowed for easier decoupling of scalable data pipelines, faster data onboarding, and stream processing with the goal to continuously improve insight into the HPC resources and their supporting systems. This talk will focus on the NCCS division's transition to Apache Kafka over the past few years to enhance the OLCF's current capabilities and prepare for Frontier, OLCF's future exascale system; including the development and deployment of a full big data platform in a Kubernetes environment from both a technical and cultural shift perspective. This talk will also cover the mission of the OLCF, the operational data insights related to high-performance computing that the organization strives for, and several use-cases that exist in production today.
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Facebook, Airbnb, Netflix, Uber, Twitter, Bloomberg, and FINRA, Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments in the last few years.
Inspired by the increasingly complex SQL queries run by the Presto user community, engineers at Facebook and Starburst have recently focused on cost-based query optimization. In this talk we will present the initial design and implementation of the CBO, support for connector-provided statistics, estimating selectivity, and choosing efficient query plans. Then, our detailed experimental evaluation will illustrate the performance gains for several classes of queries achieved thanks to the optimizer. Finally, we will discuss our future work enhancing the initial CBO and present the general Presto roadmap for 2018 and beyond.
Speakers
Kamil Bajda-Pawlikowski, Starburst Data, CTO & Co-Founder
Martin Traverso
Stream data processing is increasingly required to support business needs for faster actionable insight with growing volume of information from more sources. Apache Apex is a true stream processing framework for low-latency, high-throughput and reliable processing of complex analytics pipelines on clusters. Apex is designed for quick time-to-production, and is used in production by large companies for real-time and batch processing at scale.
This session will use an Apex production use case to walk through the incremental transition from a batch pipeline with hours of latency to an end-to-end streaming architecture with billions of events per day which are processed to deliver real-time analytical reports. The example is representative for many similar extract-transform-load (ETL) use cases with other data sets that can use a common library of building blocks. The transform (or analytics) piece of such pipelines varies in complexity and often involves business logic specific, custom components.
Topics include:
* Pipeline functionality from event source through queryable state for real-time insights.
* API for application development and development process.
* Library of building blocks including connectors for sources and sinks such as Kafka, JMS, Cassandra, HBase, JDBC and how they enable end-to-end exactly-once results.
* Stateful processing with event time windowing.
* Fault tolerance with exactly-once result semantics, checkpointing, incremental recovery
* Scalability and low-latency, high-throughput processing with advanced engine features for auto-scaling, dynamic changes, compute locality.
* Who is using Apex in production, and roadmap.
Following the session attendees will have a high level understanding of Apex and how it can be applied to use cases at their own organizations.
LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016Carl Steinbach
Hadoop at LinkedIn has grown significantly over time, from 1 cluster with 20 nodes in 2008 to over 10 clusters with over 10,000 nodes now. The number of users and workflows has also increased dramatically. While hardware scaling is difficult, scaling human infrastructure and managing dependencies between data producers, consumers, and infrastructure providers is even harder. The Dali system aims to abstract away physical data details and make data easier to access and manage through a dataset API, views, and lineage tracking. Views allow decoupling data APIs from the underlying datasets and enable safe evolution of these APIs through versioning. Contracts expressed as logical constraints on views provide clear, understandable, and modifiable agreements between producers and consumers. This approach has helped large projects
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...HostedbyConfluent
The document describes Apache Pinot, an open source distributed real-time analytics platform used at LinkedIn. It discusses the challenges of building user-facing real-time analytics systems at scale. It initially describes LinkedIn's use of Apache Kafka for ingestion and Apache Pinot for queries, but notes challenges with Pinot's initial Kafka consumer group-based approach for real-time ingestion, such as incorrect results, limited scalability, and high storage overhead. It then presents Pinot's new partition-level consumption approach which addresses these issues by taking control of partition assignment and checkpointing, allowing for independent and flexible scaling of individual partitions across servers.
How to use Parquet as a Sasis for ETL and AnalyticsDataWorks Summit
Parquet is a columnar storage format that provides efficient compression and querying capabilities. It aims to store data efficiently for analysis while supporting interoperability across systems. Parquet uses column-oriented storage with efficient encodings and statistics to enable fast querying of large datasets. It integrates with many query engines and frameworks like Hive, Impala, Spark and MapReduce to allow projection and predicate pushdown for optimized queries.
HBaseCon 2015: Industrial Internet Case Study using HBase and TSDBHBaseCon
This document summarizes an industrial internet case study using HBase and TSDB for time series data storage and analytics. It describes an aviation use case involving jet engine sensor data collection and analysis to detect problems and reduce downtime. The system ingests large volumes of sensor data from aircraft into an industrial data lake architecture using HBase for storage and TSDB and SQL interfaces for analytics. Performance tests showed the horizontal data model of storing each flight parameter as a row performed better than a vertical model for retrieval from HBase.
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent
Apache Hudi is a data lake platform, that provides streaming primitives (upserts/deletes/change streams) on top of data lake storage. Hudi powers very large data lakes at Uber, Robinhood and other companies, while being pre-installed on four major cloud platforms.
Hudi supports exactly-once, near real-time data ingestion from Apache Kafka to cloud storage, which is typically used in-place of a S3/HDFS sink connector to gain transactions and mutability. While this approach is scalable and battle-tested, it can only ingest data in mini batches, leading to lower data freshness. In this talk, we introduce a Kafka Connect Sink Connector for Apache Hudi, which writes data straight into Hudi's log format, making the data immediately queryable, while Hudi's table services like indexing, compaction, clustering work behind the scenes, to further re-organize for better query performance.
Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache FlinkFlink Forward
Suneel Marthi gave a talk about BigPetStore, a blueprint for Apache Flink applications that uses synthetic data generators. BigPetStore includes data generators, examples using tools like MapReduce, Spark and Flink to process the generated data, and tests for integration. It is used for templates, education, testing, demos and benchmarking. The talk outlined the history and components of BigPetStore and described upcoming work to expand it for Flink, including batch and table API examples and machine learning algorithms.
Tapad's data pipeline is an elastic combination of technologies (Kafka, Hadoop, Avro, Scalding) that forms a reliable system for analytics, realtime and batch graph-building, and logging. In this talk, I will speak about the creation and evolution of the pipeline, and a concrete example – a day in the life of an event tracking pixel. We'll also talk about common challenges that we've overcome such as integrating different pieces of the system, schema evolution, queuing, and data retention policies.
Streaming all over the world Real life use cases with Kafka Streamsconfluent
This document discusses using Apache Kafka Streams for stream processing. It begins with an overview of Apache Kafka and Kafka Streams. It then presents several real-life use cases that have been implemented with Kafka Streams, including data conversions from XML to Avro, stream-table joins for event propagation, duplicate elimination, and detecting absence of events. The document concludes with recommendations for developing and operating Kafka Streams applications.
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...Flink Forward
This document discusses using Apache Flink for personalization analytics with MongoDB data. It describes the personalization process, evolving user profiles over time, and benefits of separating data into services. Flink allows iterative clustering algorithms like K-means to run efficiently on streaming data. The document recommends starting small, focusing on a proof of concept, and exploring Flink's capabilities for aggregation, connectors, and extending functionality for new use cases.
Streaming data in the cloud with Confluent and MongoDB Atlas | Robert Waters,...HostedbyConfluent
This document discusses streaming data between Confluent Cloud and MongoDB Atlas. It provides an overview of MongoDB Atlas and its fully managed database capabilities in the cloud. It then demonstrates how to stream data from a Python generator application to MongoDB Atlas using Confluent Cloud and its connectors. The document promotes using MongoDB Atlas as a turnkey database as a service solution and shows how it can be integrated with Confluent Cloud for streaming data workflows.
Symantec: Cassandra Data Modelling techniques in actionDataStax Academy
Our product presents an aggregated view of metadata collected for billions of objects (files, emails, sharepoint objects etc.). We used Cassandra to store those billions of objects along with aggregated view of that metadata. Customers can analyse the corpus of data in real time by searching in completely flexible way i.e. be able to get summary aggregates for many billions of objects, and then be able to further drill down to items by filtering using various facets of the metadata. We achieve this using a combination of Cassandra and ElasticSearch. This presentation will talk about various data modelling techniques we use to aggregate and then further summarise all that metadata and be able to search the summary in real t
Embracing Database Diversity with Kafka and DebeziumFrank Lyaruu
There was a time not long ago when we used relational databases for everything. Even if the data wasn’t particularly relational, we shoehorned it into relational tables, often because that was the only database we had. Thank god these dark times are over and now we have many different kinds of NoSQL databases: Document, realtime, graph, column, but that does not solve the problem that the same data might be a graph from one perspective, but a collection of documents from another.
It would be really nice if we can access that same data in many different ways, depending on the context of what we want to achieve in our current task.
As software architects this is not easy to solve but definitely possible: We can design an architecture using Event Sourcing: Capture the data with Debezium, post it to a Kafka queue, use Kafka Streams to model the data the way we like, and store the data in various different data sources, so we can synchronize data between data sources.
Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...confluent
(Bob Lehmann, Bayer) Kafka Summit SF 2018
You’ve built your streaming data platform. The early adopters are “all in” and have developed producers, consumers and stream processing apps for a number of use cases. A large percentage of the enterprise, however, has expressed interest but hasn’t made the leap. Why?
In 2014, Bayer Crop Science (formerly Monsanto) adopted a cloud first strategy and started a multi-year transition to the cloud. A Kafka-based cross-datacenter DataHub was created to facilitate this migration and to drive the shift to real-time stream processing. The DataHub has seen strong enterprise adoption and supports a myriad of use cases. Data is ingested from a wide variety of sources and the data can move effortlessly between an on premise datacenter, AWS and Google Cloud. The DataHub has evolved continuously over time to meet the current and anticipated needs of our internal customers. The “cost of admission” for the platform has been lowered dramatically over time via our DataHub Portal and technologies such as Kafka Connect, Kubernetes and Presto. Most operations are now self-service, onboarding of new data sources is relatively painless and stream processing via KSQL and other technologies is being incorporated into the core DataHub platform.
In this talk, Bob Lehmann will describe the origins and evolution of the Enterprise DataHub with an emphasis on steps that were taken to drive user adoption. Bob will also talk about integrations between the DataHub and other key data platforms at Bayer, lessons learned and the future direction for streaming data and stream processing at Bayer.
Data Ingestion, Extraction & Parsing on Hadoopskaluska
The document discusses options for ingesting, extracting, parsing, and transforming data on Hadoop using Informatica products. It outlines Informatica's current capabilities for data integration with Hadoop and its roadmap to enhance capabilities for processing data directly on Hadoop in the first half of 2012. This will allow users to design data processing flows visually and execute them on Hadoop for optimized performance.
High Speed Continuous & Reliable Data Ingest into HadoopDataWorks Summit
This talk will explore the area of real-time data ingest into Hadoop and present the architectural trade-offs as well as demonstrate alternative implementations that strike the appropriate balance across the following common challenges: * Decentralized writes (multiple data centers and collectors) * Continuous Availability, High Reliability * No loss of data * Elasticity of introducing more writers * Bursts in Speed per syslog emitter * Continuous, real-time collection * Flexible Write Targets (local FS, HDFS etc.)
Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecos...DataWorks Summit
This document discusses technologies for data ingestion, transformation, and analytics. It introduces Gobblin for scalable data ingestion from diverse sources, Cubert for converting data formats, WhereHows for data lineage tracking, and Pinot for real-time analytics. Gobblin provides a framework for extracting, converting, validating data in parallel tasks. Cubert allows converting data between formats using a domain-specific language. WhereHows tracks lineage metadata to answer questions about where data came from and how it flows. Pinot is a real-time distributed OLAP store for interactive queries on fresh data using a SQL-like interface.
Timeline Service v.2 (Hadoop Summit 2016)Sangjin Lee
This document summarizes the new YARN Timeline Service version 2, which was developed to address scalability, reliability, and usability challenges in version 1. Key highlights of version 2 include a distributed collector architecture for scalable and fault-tolerant writing of timeline data, an entity data model with first-class configuration and metrics support, and metrics aggregation capabilities. It stores data in HBase for scalability and provides a richer REST API for querying. Milestone goals include integration with more frameworks and production readiness.
Modern ETL Pipelines with Change Data CaptureDatabricks
In this talk we’ll present how at GetYourGuide we’ve built from scratch a completely new ETL pipeline using Debezium, Kafka, Spark and Airflow, which can automatically handle schema changes. Our starting point was an error prone legacy system that ran daily, and was vulnerable to breaking schema changes, which caused many sleepless on-call nights. As most companies, we also have traditional SQL databases that we need to connect to in order to extract relevant data.
This is done usually through either full or partial copies of the data with tools such as sqoop. However another approach that has become quite popular lately is to use Debezium as the Change Data Capture layer which reads databases binlogs, and stream these changes directly to Kafka. As having data once a day is not enough anymore for our bussiness, and we wanted our pipelines to be resilent to upstream schema changes, we’ve decided to rebuild our ETL using Debezium.
We’ll walk the audience through the steps we followed to architect and develop such solution using Databricks to reduce operation time. By building this new pipeline we are now able to refresh our data lake multiple times a day, giving our users fresh data, and protecting our nights of sleep.
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & KafkaFlink Forward
This document summarizes a presentation about Bouygues Telecom's use of Apache Flink for real-time data integration and processing of mobile network event logs. Bouygues Telecom processes over 4 billion logs per day from their network equipment to calculate mobile quality of experience (QoE) indicators within 60 seconds for business intelligence, diagnostics and alerting. They were previously using Hadoop for batch processing but needed a real-time solution. After evaluating Apache Spark and Flink, they chose Flink for its true streaming capabilities, backpressure handling, and high performance on limited resources. Flink helped them process a day's worth of logs in under an hour from 10 Kafka partitions across 10 TaskManagers, each with only
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...confluent
The Oak Ridge Leadership Facility (OLCF) in the National Center for Computational Sciences (NCCS) division at Oak Ridge National Laboratory (ORNL) houses world-class high-performance computing (HPC) resources and has a history of operating top-ranked supercomputers on the TOP500 list, including the world's current fastest, Summit, an IBM AC922 machine with a peak of 200 petaFLOPS. With the exascale era rapidly approaching, the need for a robust and scalable big data platform for operations data is more important than ever. In the past when a new HPC resource was added to the facility, pipelines from data sources spanned multiple data sinks which oftentimes resulted in data silos, slow operational data onboarding, and non-scalable data pipelines for batch processing. Using Apache Kafka as the message bus of the division's new big data platform has allowed for easier decoupling of scalable data pipelines, faster data onboarding, and stream processing with the goal to continuously improve insight into the HPC resources and their supporting systems. This talk will focus on the NCCS division's transition to Apache Kafka over the past few years to enhance the OLCF's current capabilities and prepare for Frontier, OLCF's future exascale system; including the development and deployment of a full big data platform in a Kubernetes environment from both a technical and cultural shift perspective. This talk will also cover the mission of the OLCF, the operational data insights related to high-performance computing that the organization strives for, and several use-cases that exist in production today.
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Facebook, Airbnb, Netflix, Uber, Twitter, Bloomberg, and FINRA, Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments in the last few years.
Inspired by the increasingly complex SQL queries run by the Presto user community, engineers at Facebook and Starburst have recently focused on cost-based query optimization. In this talk we will present the initial design and implementation of the CBO, support for connector-provided statistics, estimating selectivity, and choosing efficient query plans. Then, our detailed experimental evaluation will illustrate the performance gains for several classes of queries achieved thanks to the optimizer. Finally, we will discuss our future work enhancing the initial CBO and present the general Presto roadmap for 2018 and beyond.
Speakers
Kamil Bajda-Pawlikowski, Starburst Data, CTO & Co-Founder
Martin Traverso
Stream data processing is increasingly required to support business needs for faster actionable insight with growing volume of information from more sources. Apache Apex is a true stream processing framework for low-latency, high-throughput and reliable processing of complex analytics pipelines on clusters. Apex is designed for quick time-to-production, and is used in production by large companies for real-time and batch processing at scale.
This session will use an Apex production use case to walk through the incremental transition from a batch pipeline with hours of latency to an end-to-end streaming architecture with billions of events per day which are processed to deliver real-time analytical reports. The example is representative for many similar extract-transform-load (ETL) use cases with other data sets that can use a common library of building blocks. The transform (or analytics) piece of such pipelines varies in complexity and often involves business logic specific, custom components.
Topics include:
* Pipeline functionality from event source through queryable state for real-time insights.
* API for application development and development process.
* Library of building blocks including connectors for sources and sinks such as Kafka, JMS, Cassandra, HBase, JDBC and how they enable end-to-end exactly-once results.
* Stateful processing with event time windowing.
* Fault tolerance with exactly-once result semantics, checkpointing, incremental recovery
* Scalability and low-latency, high-throughput processing with advanced engine features for auto-scaling, dynamic changes, compute locality.
* Who is using Apex in production, and roadmap.
Following the session attendees will have a high level understanding of Apex and how it can be applied to use cases at their own organizations.
LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016Carl Steinbach
Hadoop at LinkedIn has grown significantly over time, from 1 cluster with 20 nodes in 2008 to over 10 clusters with over 10,000 nodes now. The number of users and workflows has also increased dramatically. While hardware scaling is difficult, scaling human infrastructure and managing dependencies between data producers, consumers, and infrastructure providers is even harder. The Dali system aims to abstract away physical data details and make data easier to access and manage through a dataset API, views, and lineage tracking. Views allow decoupling data APIs from the underlying datasets and enable safe evolution of these APIs through versioning. Contracts expressed as logical constraints on views provide clear, understandable, and modifiable agreements between producers and consumers. This approach has helped large projects
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...HostedbyConfluent
The document describes Apache Pinot, an open source distributed real-time analytics platform used at LinkedIn. It discusses the challenges of building user-facing real-time analytics systems at scale. It initially describes LinkedIn's use of Apache Kafka for ingestion and Apache Pinot for queries, but notes challenges with Pinot's initial Kafka consumer group-based approach for real-time ingestion, such as incorrect results, limited scalability, and high storage overhead. It then presents Pinot's new partition-level consumption approach which addresses these issues by taking control of partition assignment and checkpointing, allowing for independent and flexible scaling of individual partitions across servers.
How to use Parquet as a Sasis for ETL and AnalyticsDataWorks Summit
Parquet is a columnar storage format that provides efficient compression and querying capabilities. It aims to store data efficiently for analysis while supporting interoperability across systems. Parquet uses column-oriented storage with efficient encodings and statistics to enable fast querying of large datasets. It integrates with many query engines and frameworks like Hive, Impala, Spark and MapReduce to allow projection and predicate pushdown for optimized queries.
HBaseCon 2015: Industrial Internet Case Study using HBase and TSDBHBaseCon
This document summarizes an industrial internet case study using HBase and TSDB for time series data storage and analytics. It describes an aviation use case involving jet engine sensor data collection and analysis to detect problems and reduce downtime. The system ingests large volumes of sensor data from aircraft into an industrial data lake architecture using HBase for storage and TSDB and SQL interfaces for analytics. Performance tests showed the horizontal data model of storing each flight parameter as a row performed better than a vertical model for retrieval from HBase.
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent
Apache Hudi is a data lake platform, that provides streaming primitives (upserts/deletes/change streams) on top of data lake storage. Hudi powers very large data lakes at Uber, Robinhood and other companies, while being pre-installed on four major cloud platforms.
Hudi supports exactly-once, near real-time data ingestion from Apache Kafka to cloud storage, which is typically used in-place of a S3/HDFS sink connector to gain transactions and mutability. While this approach is scalable and battle-tested, it can only ingest data in mini batches, leading to lower data freshness. In this talk, we introduce a Kafka Connect Sink Connector for Apache Hudi, which writes data straight into Hudi's log format, making the data immediately queryable, while Hudi's table services like indexing, compaction, clustering work behind the scenes, to further re-organize for better query performance.
Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache FlinkFlink Forward
Suneel Marthi gave a talk about BigPetStore, a blueprint for Apache Flink applications that uses synthetic data generators. BigPetStore includes data generators, examples using tools like MapReduce, Spark and Flink to process the generated data, and tests for integration. It is used for templates, education, testing, demos and benchmarking. The talk outlined the history and components of BigPetStore and described upcoming work to expand it for Flink, including batch and table API examples and machine learning algorithms.
Tapad's data pipeline is an elastic combination of technologies (Kafka, Hadoop, Avro, Scalding) that forms a reliable system for analytics, realtime and batch graph-building, and logging. In this talk, I will speak about the creation and evolution of the pipeline, and a concrete example – a day in the life of an event tracking pixel. We'll also talk about common challenges that we've overcome such as integrating different pieces of the system, schema evolution, queuing, and data retention policies.
Streaming all over the world Real life use cases with Kafka Streamsconfluent
This document discusses using Apache Kafka Streams for stream processing. It begins with an overview of Apache Kafka and Kafka Streams. It then presents several real-life use cases that have been implemented with Kafka Streams, including data conversions from XML to Avro, stream-table joins for event propagation, duplicate elimination, and detecting absence of events. The document concludes with recommendations for developing and operating Kafka Streams applications.
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...Flink Forward
This document discusses using Apache Flink for personalization analytics with MongoDB data. It describes the personalization process, evolving user profiles over time, and benefits of separating data into services. Flink allows iterative clustering algorithms like K-means to run efficiently on streaming data. The document recommends starting small, focusing on a proof of concept, and exploring Flink's capabilities for aggregation, connectors, and extending functionality for new use cases.
Streaming data in the cloud with Confluent and MongoDB Atlas | Robert Waters,...HostedbyConfluent
This document discusses streaming data between Confluent Cloud and MongoDB Atlas. It provides an overview of MongoDB Atlas and its fully managed database capabilities in the cloud. It then demonstrates how to stream data from a Python generator application to MongoDB Atlas using Confluent Cloud and its connectors. The document promotes using MongoDB Atlas as a turnkey database as a service solution and shows how it can be integrated with Confluent Cloud for streaming data workflows.
Symantec: Cassandra Data Modelling techniques in actionDataStax Academy
Our product presents an aggregated view of metadata collected for billions of objects (files, emails, sharepoint objects etc.). We used Cassandra to store those billions of objects along with aggregated view of that metadata. Customers can analyse the corpus of data in real time by searching in completely flexible way i.e. be able to get summary aggregates for many billions of objects, and then be able to further drill down to items by filtering using various facets of the metadata. We achieve this using a combination of Cassandra and ElasticSearch. This presentation will talk about various data modelling techniques we use to aggregate and then further summarise all that metadata and be able to search the summary in real t
Embracing Database Diversity with Kafka and DebeziumFrank Lyaruu
There was a time not long ago when we used relational databases for everything. Even if the data wasn’t particularly relational, we shoehorned it into relational tables, often because that was the only database we had. Thank god these dark times are over and now we have many different kinds of NoSQL databases: Document, realtime, graph, column, but that does not solve the problem that the same data might be a graph from one perspective, but a collection of documents from another.
It would be really nice if we can access that same data in many different ways, depending on the context of what we want to achieve in our current task.
As software architects this is not easy to solve but definitely possible: We can design an architecture using Event Sourcing: Capture the data with Debezium, post it to a Kafka queue, use Kafka Streams to model the data the way we like, and store the data in various different data sources, so we can synchronize data between data sources.
Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...confluent
(Bob Lehmann, Bayer) Kafka Summit SF 2018
You’ve built your streaming data platform. The early adopters are “all in” and have developed producers, consumers and stream processing apps for a number of use cases. A large percentage of the enterprise, however, has expressed interest but hasn’t made the leap. Why?
In 2014, Bayer Crop Science (formerly Monsanto) adopted a cloud first strategy and started a multi-year transition to the cloud. A Kafka-based cross-datacenter DataHub was created to facilitate this migration and to drive the shift to real-time stream processing. The DataHub has seen strong enterprise adoption and supports a myriad of use cases. Data is ingested from a wide variety of sources and the data can move effortlessly between an on premise datacenter, AWS and Google Cloud. The DataHub has evolved continuously over time to meet the current and anticipated needs of our internal customers. The “cost of admission” for the platform has been lowered dramatically over time via our DataHub Portal and technologies such as Kafka Connect, Kubernetes and Presto. Most operations are now self-service, onboarding of new data sources is relatively painless and stream processing via KSQL and other technologies is being incorporated into the core DataHub platform.
In this talk, Bob Lehmann will describe the origins and evolution of the Enterprise DataHub with an emphasis on steps that were taken to drive user adoption. Bob will also talk about integrations between the DataHub and other key data platforms at Bayer, lessons learned and the future direction for streaming data and stream processing at Bayer.
Data Ingestion, Extraction & Parsing on Hadoopskaluska
The document discusses options for ingesting, extracting, parsing, and transforming data on Hadoop using Informatica products. It outlines Informatica's current capabilities for data integration with Hadoop and its roadmap to enhance capabilities for processing data directly on Hadoop in the first half of 2012. This will allow users to design data processing flows visually and execute them on Hadoop for optimized performance.
High Speed Continuous & Reliable Data Ingest into HadoopDataWorks Summit
This talk will explore the area of real-time data ingest into Hadoop and present the architectural trade-offs as well as demonstrate alternative implementations that strike the appropriate balance across the following common challenges: * Decentralized writes (multiple data centers and collectors) * Continuous Availability, High Reliability * No loss of data * Elasticity of introducing more writers * Bursts in Speed per syslog emitter * Continuous, real-time collection * Flexible Write Targets (local FS, HDFS etc.)
Vinod Nayal presented on options for ingesting data into Hadoop, including batch loading from relational databases using Sqoop or vendor-specific tools. Data from files can be FTP'd to edge nodes and loaded using ETL tools like Informatica or Talend. Real-time data can be ingested using Flume for transport with light enrichment or Storm with Kafka for a queue to enable low-latency continuous ingestion with more in-flight processing. The choice between Flume and Storm depends on the amount of required in-flight processing.
This document discusses data collection and ingestion tools. It begins with an overview of data collection versus ingestion, with collection happening at the source and ingestion receiving the data. Examples of data collection tools include rsyslog, Scribe, Flume, Logstash, Heka, and Fluentd. Examples of ingestion tools include RabbitMQ, Kafka, and Fluentd. The document concludes with a case study of asynchronous application logging and challenges to consider.
This document provides an overview of Flume and Spark Streaming. It describes how Flume is used to reliably ingest streaming data into Hadoop using an agent-based architecture. Events are collected by sources, stored reliably in channels, and sent to sinks. The Flume connector allows ingested data to be processed in real-time using Spark Streaming's micro-batch architecture, where streams of data are processed through RDD transformations. This combined Flume + Spark Streaming approach provides a scalable and fault-tolerant way to reliably ingest and process streaming data.
Gobblin is a unified data ingestion framework developed by LinkedIn to ingest large volumes of data from diverse sources into Hadoop. It provides a scalable and fault-tolerant workflow that extracts data, applies transformations, checks for quality, and writes outputs. Gobblin addresses challenges of operating multiple heterogeneous data pipelines by standardizing various ingestion tasks and metadata handling through its pluggable components.
Intel IT empowers business units to easily make rapid, impactful business decisions. Ingesting a variety of internal/external data sources has challenges. This slideset covers how Intel IT overcame the issues with Hadoop and Gobblin. Learn more at http://www.intel.com/itcenter
Meson: Building a Machine Learning Orchestration Framework on MesosAntony Arokiasamy
This document discusses building a machine learning orchestration framework on Mesos. It describes challenges with machine learning pipelines like heterogeneous environments and separate orchestration/execution. It then introduces Meson, a general purpose workflow orchestration system that delegates execution to resource managers like Mesos. Meson is optimized for machine learning pipelines and visualization. The document outlines how Mesos is used for executors, custom executors, executor caching, executor cleanup, framework messages, multi-tenancy and cluster management.
Meson: Heterogeneous Workflows with Spark at NetflixAntony Arokiasamy
Netflix uses Meson, an in-house workflow system, to orchestrate and schedule machine learning pipelines using Spark. Meson allows for heterogeneous environments, multi-tenancy, and optimized workflows for machine learning tasks like parameter sweeping with over 30,000 docker containers. It delegates execution to resource managers like Mesos and supports standard and custom steps, parameter passing, structured constructs, and two-way communication between steps.
The document discusses the Internet of Things ecosystem and how to unlock business value from connected devices. It defines IoT and provides projections on growth. It outlines the complex IoT ecosystem and stakeholders involved. It presents a business value framework focused on financial metrics, operating metrics, and relationships. Common value drivers of cost reduction and risk management are discussed. Strategies to unlock more value through revenue generation and innovation are suggested, including focusing on product/customer lifecycles. Overcoming security and privacy challenges is also addressed.
This document discusses building a distributed data ingestion system using RabbitMQ. It begins with an introduction to RabbitMQ and its key features like being multi-protocol, open source, polyglot and written in Erlang. It then discusses the problem of distributing data across multiple servers. RabbitMQ federation is proposed as a solution, allowing replication across servers. Federation uses queues and exchanges to replicate data and can be configured using parameters and policies. The document also discusses scaling the system using sharded queues and federated queues to load balance consumers.
Cloudera Morphlines is a new open source framework, recently added to the CDK, that reduces the time and skills necessary to integrate, build, and change Hadoop processing applications that extract, transform, and load data into Apache Solr, Apache HBase, HDFS, enterprise data warehouses, or analytic online dashboards.
Some notes about spark streming positioning give the current players: Beam, Flink, Storm et al. Helpful if you have to choose an Streaming engine for your project.
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...Chris Fregly
* Title *
Spark After Dark 1.5: Deep Dive Into Latest Perf and Scale Improvements in Spark Ecosystem
* Abstract *
Combining the most popular and technically-deep material from his wildly popular Advanced Apache Spark Meetup, Chris Fregly will provide code-level deep dives into the latest performance and scalability advancements within the Apache Spark Ecosystem by exploring the following:
1) Building a Scalable and Performant Spark SQL/DataFrames Data Source Connector such as Spark-CSV, Spark-Cassandra, Spark-ElasticSearch, and Spark-Redshift
2) Speeding Up Spark SQL Queries using Partition Pruning and Predicate Pushdowns with CSV, JSON, Parquet, Avro, and ORC
3) Tuning Spark Streaming Performance and Fault Tolerance with KafkaRDD and KinesisRDD
4) Maintaining Stability during High Scale Streaming Ingestion using Approximations and Probabilistic Data Structures from Spark, Redis, and Twitter's Algebird
5) Building Effective Machine Learning Models using Feature Engineering, Dimension Reduction, and Natural Language Processing with MLlib/GraphX, ML Pipelines, DIMSUM, Locality Sensitive Hashing, and Stanford's CoreNLP
6) Tuning Core Spark Performance by Acknowledging Mechanical Sympathy for the Physical Limitations of OS and Hardware Resources such as CPU, Memory, Network, and Disk with Project Tungsten, Asynchronous Netty, and Linux epoll
* Demos *
This talk features many interesting and audience-interactive demos - as well as code-level deep dives into many of the projects listed above.
All demo code is available on Github at the following link: https://github.com/fluxcapacitor/pipeline/wiki
In addition, the entire demo environment has been Dockerized and made available for download on Docker Hub at the following link: https://hub.docker.com/r/fluxcapacitor/pipeline/
* Speaker Bio *
Chris Fregly is a Principal Data Solutions Engineer for the newly-formed IBM Spark Technology Center, an Apache Spark Contributor, a Netflix Open Source Committer, as well as the Organizer of the global Advanced Apache Spark Meetup and Author of the Upcoming Book, Advanced Spark.
Previously, Chris was a Data Solutions Engineer at Databricks and a Streaming Data Engineer at Netflix.
When Chris isn’t contributing to Spark and other open source projects, he’s creating book chapters, slides, and demos to share knowledge with his peers at meetups and conferences throughout the world.
Spark Streaming allows processing of live data streams using Spark. It works by receiving data streams, chopping them into batches, and processing the batches using Spark. This presentation covered Spark Streaming concepts like the lifecycle of a streaming application, best practices for aggregations, operationalization through checkpointing, and achieving high throughput. It also discussed debugging streaming jobs and the benefits of combining streaming with batch, machine learning, and SQL processing.
Workshop - How to Build Recommendation Engine using Spark 1.6 and HDP
Hands-on - Build a Data analytics application using SPARK, Hortonworks, and Zeppelin. The session explains RDD concepts, DataFrames, sqlContext, use SparkSQL for working with DataFrames and explore graphical abilities of Zeppelin.
b) Follow along - Build a Recommendation Engine - This will show how to build a predictive analytics (MLlib) recommendation engine with scoring This will give a better understanding of architecture and coding in Spark for ML.
This talk was prepared for the November 2013 DataPhilly Meetup: Data in Practice ( http://www.meetup.com/DataPhilly/events/149515412/ )
Map Reduce: Beyond Word Count by Jeff Patti
Have you ever wondered what map reduce can be used for beyond the word count example you see in all the introductory articles about map reduce? Using Python and mrjob, this talk will cover a few simple map reduce algorithms that in part power Monetate's information pipeline
Bio: Jeff Patti is a backend engineer at Monetate with a passion for algorithms, big data, and long walks on the beach. Prior to working at Monetate he performed software R&D for Lockheed Martin, where he worked on projects ranging from social network analysis to robotics.
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Spark Summit
This document summarizes key aspects of running Spark Streaming applications in production, including fault tolerance, performance, and monitoring. It discusses how Spark Streaming receives data streams in batches and processes them across executors. It describes how driver and executor failures can be handled through checkpointing saved DAG information and write ahead logs that replicate received data blocks. Restarting the driver from checkpoints allows recovering the application state.
Scala - The Simple Parts, SFScala presentationMartin Odersky
These are the slides of the talk I gave on May 22, 2014 to the San Francisco Scala user group. Similar talks were given before at GOTO Chicago, keynote, at Gilt Groupe and Hunter College in New York, at JAX Mainz and at FlatMap Oslo.
How Kafka and Modern Databases Benefit Apps and AnalyticsSingleStore
This document provides an overview of how Kafka and modern databases like MemSQL can benefit applications and analytics. It discusses how businesses now require faster data access and intra-day processing to drive real-time decisions. Traditional database solutions struggle to meet these demands. MemSQL is presented as a solution that provides scalable SQL, fast ingestion of streaming data, and high concurrency to enable both transactions and analytics on large datasets. The document demonstrates how MemSQL distributes data and queries across nodes and allows horizontal scaling through its architecture.
Slides presented during the Strata SF 2019 conference. Explaining how Lyft is building a multi-cluster solution for running Apache Spark on kubernetes at scale to support diverse workloads and overcome challenges.
YugaByte DB is a transactional database that provides SQL and NoSQL interfaces in a single platform. It was created to address the complexity of building applications using separate SQL and NoSQL databases. YugaByte DB integrates with PKS to enable deployment on Kubernetes clusters. The presentation provides an overview of YugaByte DB's architecture and capabilities, demonstrates its integration with PKS, and discusses several real-world use cases.
Cloud-Native Patterns for Data-Intensive ApplicationsVMware Tanzu
Are you interested in learning how to schedule batch jobs in container runtimes?
Maybe you’re wondering how to apply continuous delivery in practice for data-intensive applications? Perhaps you’re looking for an orchestration tool for data pipelines?
Questions like these are common, so rest assured that you’re not alone.
In this webinar, we’ll cover the recent feature improvements in Spring Cloud Data Flow. More specifically, we’ll discuss data processing use cases and how they simplify the overall orchestration experience in cloud runtimes like Cloud Foundry and Kubernetes.
Please join us and be part of the community discussion!
Presenters :
Sabby Anandan, Product Manager
Mark Pollack, Software Engineer, Pivotal
GPU-Accelerating UDFs in PySpark with Numba and PyGDFKeith Kraus
This document discusses accelerating Python user-defined functions (UDFs) in PySpark using Numba and PyGDF. It describes how data movement between the JVM and Python workers is currently a bottleneck for PySpark Python UDFs. With Apache Arrow, data can be transferred in a columnar format without serialization, improving performance. PyGDF enables defining UDFs that operate directly on GPU data frames using Numba for further acceleration. This allows leveraging GPUs to optimize complex UDFs in PySpark. Future work includes optimizing joins in PyGDF and supporting distributed GPU processing.
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQueryMárton Kodok
Teaser: provide developers a new way of understanding advanced analytics and choosing the right cloud architecture
The new buzzword is #serverless, as there are many great services that helps us abstract away the complexity associated with managing servers. In this session we will see how serverless helps on large data analytics backends.
We will see how to architect for Cloud and implement into an existing project components that will take us into the #serverless architecture that will ingest our streaming data, run advanced analytics on petabytes of data using BigQuery on Google Cloud Platform - all this next to an existing stack, without being forced to reengineer our app.
BigQuery enables super-fast, SQL/Javascript queries against petabytes of data using the processing power of Google’s infrastructure. We will cover its core features, SQL 2011 standard, working with streaming inserts, User Defined Functions written in Javascript, reference external JS libraries, and several use cases for everyday backend developer: funnel analytics, email heatmap, custom data processing, building dashboards, extracting data using JS functions, emitting rows based on business logic.
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftChester Chen
Talk 1. Scaling Apache Spark on Kubernetes at Lyft
As part of this mission Lyft invests heavily in open source infrastructure and tooling. At Lyft Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark at Lyft has evolved to solve both Machine Learning and large scale ETL workloads. By combining the flexibility of Kubernetes with the data processing power of Apache Spark, Lyft is able to drive ETL data processing to a different level. In this talk, We will talk about challenges the Lyft team faced and solutions they developed to support Apache Spark on Kubernetes in production and at scale. Topics Include: - Key traits of Apache Spark on Kubernetes. - Deep dive into Lyft's multi-cluster setup and operationality to handle petabytes of production data. - How Lyft extends and enhances Apache Spark to support capabilities such as Spark pod life cycle metrics and state management, resource prioritization, and queuing and throttling. - Dynamic job scale estimation and runtime dynamic job configuration. - How Lyft powers internal Data Scientists, Business Analysts, and Data Engineers via a multi-cluster setup.
Speaker: Li Gao
Li Gao is the tech lead in the cloud native spark compute initiative at Lyft. Prior to Lyft, Li worked at Salesforce, Fitbit, Marin Software, and a few startups etc. on various technical leadership positions on cloud native and hybrid cloud data platforms at scale. Besides Spark, Li has scaled and productionized other open source projects, such as Presto, Apache HBase, Apache Phoenix, Apache Kafka, Apache Airflow, Apache Hive, and Apache Cassandra.
1. The document discusses Project Geode, an open source distributed in-memory database for big data applications. It provides scale-out performance, consistent operations across nodes, high availability, powerful developer features, and easy administration of distributed nodes.
2. The document outlines Geode's architecture and roadmap. It also discusses why the project is being open sourced under Apache and describes some key use cases and customers of Geode.
3. The presentation includes a demo of Geode's capabilities including partitioning, queries, indexing, colocation, and transactions.
Strata Singapore 2017 business use case section
"Big Telco Real-Time Network Analytics"
https://conferences.oreilly.com/strata/strata-sg/public/schedule/detail/62797
Managing Apache Spark Workload and Automatic OptimizingDatabricks
eBay is highly using Spark as one of most significant data engines. In data warehouse domain, there are millions of batch queries running every day against 6000+ key DW tables, which contains over 22PB data (compressed) and still keeps booming every year. In machine learning domain, it is playing a more and more significant role. We have introduced our great achievement in migration work from MPP database to Apache Spark last year in Europe Summit. Furthermore, from the vision of the entire infrastructure, it is still a big challenge on managing workload and efficiency for all Spark jobs upon our data center. Our team is leading the whole infrastructure of big data platform and the management tools upon it, helping our customers -- not only DW engineers and data scientists, but also AI engineers -- to leverage on the same page. In this session, we will introduce how to benefit all of them within a self-service workload management portal/system. First, we will share the basic architecture of this system to illustrate how it collects metrics from multiple data centers and how it detects the abnormal workload real-time. We develop a component called Profiler which is to enhance the current Spark core to support customized metric collection. Next, we will demonstrate some real user stories in eBay to show how the self-service system reduces the efforts both in customer side and infra-team side. That's the highlight part about Spark job analysis and diagnosis. Finally, some incoming advanced features will be introduced to describe an automatic optimizing workflow rather than just alerting.
Speaker: Lantao Jin
Avast Premium Security 24.12.9725 + License Key Till 2050asfadnew
DIRECT LINK BELOW🎁✔👇
https://serialhax.com/after-verification-click-go-to-download-page/
☝☝✅👉 Note: >> Please copy the link and paste it into Google New Tab now Download link And Enjoy 😍
Serif Affinity Photo Crack 2.3.1.2217 + Serial Key [Latest]hyby22543
DIRECT LINK BELOW🎁✔👇
https://serialhax.com/after-verification-click-go-to-download-page/
☝☝✅👉 Note: >> Please copy the link and paste it into Google New Tab now Download link And Enjoy 😍
FastStone Capture 10.4 Crack + Serial Key [Latest]hyby22543
DIRECT LINK BELOW🎁✔👇
https://serialhax.com/after-verification-click-go-to-download-page/
☝☝✅👉 Note: >> Please copy the link and paste it into Google New Tab now Download link And Enjoy 😍
DIRECT LINK BELOW🎁✔👇
https://serialhax.com/after-verification-click-go-to-download-page/
☝☝✅👉 Note: >> Please copy the link and paste it into Google New Tab now Download link And Enjoy 😍
MiniTool Partition Wizard Crack 12.8 + Serial Key Downloaddrewgye
DIRECT LINK BELOW🎁✔👇
https://serialhax.com/after-verification-click-go-to-download-page/
☝☝✅👉 Note: >> Please copy the link and paste it into Google New Tab now Download link And Enjoy 😍
4K Video Downloader Crack (2025) + License Key Freeboyjake527
DIRECT LINK BELOW🎁✔👇
https://serialhax.com/after-verification-click-go-to-download-page/
☝☝✅👉Note: >> Please copy the link and paste it into Google New Tab now Download link And Enjoy ??
DIRECT LINK BELOW🎁✔👇
https://serialhax.com/after-verification-click-go-to-download-page/
☝☝✅👉Note: >> Please copy the link and paste it into Google New Tab now Download link And Enjoy ??
Adobe Photoshop CC 26.3 Crack + Serial Key [Latest 2025]mushtaqcheema932
DIRECT LINK BELOW🎁✔👇
https://serialhax.com/after-verification-click-go-to-download-page/
☝☝✅👉Note: >> Please copy the link and paste it into Google New Tab now Download link And Enjoy ??
DIRECT LINK BELOW🎁✔👇
https://serialhax.com/after-verification-click-go-to-download-page/
☝☝✅👉Note: >> Please copy the link and paste it into Google New Tab now Download link And Enjoy ??
At Opsio, we specialize in delivering advanced cloud services that enable businesses to scale, transform, and modernize with confidence. Our core offerings focus on cloud management, digital transformation, and cloud modernization — all designed to help organizations unlock the full potential of their technology infrastructure.We take a client-first approach, blending industry-leading hosted technologies with strategic expertise to create tailored, future-ready solutions. Leveraging AI, automation, and emerging technologies, our services simplify IT operations, enhance agility, and accelerate business outcomes. Whether you're migrating to the cloud or optimizing existing cloud environments, Opsio is your partner in achieving sustainable, measurable success.
apidays New York 2025 - Two tales of API Change Management by Eric Koleda (Coda)apidays
Two tales of API Change Management from my time at Google
Eric Koleda, Developer Advocate at Coda
apidays New York 2025
API Management for Surfing the Next Innovation Waves: GenAI and Open Banking
Convene 360 Madison, New York
May 14 & 15, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
SAP Extended Warehouse Management (EWM) is a part of SAP S/4HANA offering advanced warehouse and logistics capabilities. It enables efficient handling of goods movement, storage, and inventory in real-time.
apidays Singapore 2025 - Building Finance Innovation Ecosystems by Umang Moon...apidays
Building Finance Innovation Ecosystems
Umang Moondra, CEO at APIX
apidays Singapore 2025
Where APIs Meet AI: Building Tomorrow's Intelligent Ecosystems
April 15 & 16, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
apidays New York 2025 - Lessons From Two Technical Transformations by Leah Hu...apidays
You Can't Outrun Complexity - But You Can Orchestrate It: Lessons From Two Technical Transformations
Leah Hurwich Adler, Senior Staff Product Manager at Apollo GraphQL
apidays New York 2025
API Management for Surfing the Next Innovation Waves: GenAI and Open Banking
May 14 & 15, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
apidays New York 2025 - The Challenge is Not the Pattern, But the Best Integr...apidays
The Challenge is Not the Pattern, But the Best Integration
Yisrael Gross, CEO at Ammune.ai
apidays New York 2025
API Management for Surfing the Next Innovation Waves: GenAI and Open Banking
May 14 & 15, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
AG-FIRMA FINCOME ARTICLE AI AGENT RAG.pdfAnass Nabil
AI CHAT BOT Design of a multilingual AI assistant to optimize agricultural practices in Morocco
Delivery service status checking
Mobile architecture + orchestrator LLM + expert agents (RAG, weather,sensors).
apidays New York 2025 - Building Agentic Workflows with FDC3 Intents by Nick ...apidays
Building Agentic Workflows with FDC3 Intents
Nick Kolba, Co-founder & CEO at Connectifi
apidays New York 2025
API Management for Surfing the Next Innovation Waves: GenAI and Open Banking
Convene 360 Madison, New York
May 14 & 15, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
apidays New York 2025 - Why an SDK is Needed to Protect APIs from Mobile Apps...apidays
Why an SDK is Needed to Protect APIs from Mobile Apps
Pearce Erensel, Global VP of Sales at Approov Mobile Security
apidays New York 2025
API Management for Surfing the Next Innovation Waves: GenAI and Open Banking
May 14 & 15, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
apidays New York 2025 - Using GraphQL SDL files as executable API Contracts b...apidays
Using GraphQL SDL files as executable API Contracts
Hari Krishnan, Co-founder & CTO at Specmatic
apidays New York 2025
API Management for Surfing the Next Innovation Waves: GenAI and Open Banking
May 14 & 15, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
apidays New York 2025 - Building Scalable AI Systems by Sai Prasad Veluru (Ap...apidays
Building Scalable AI Systems: Cloud Architecture for Performance
Sai Prasad Veluru, Software Engineer at Apple Inc
apidays New York 2025
API Management for Surfing the Next Innovation Waves: GenAI and Open Banking
May 14 & 15, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
10. (18)
Case Study – Filtering Sensitive Data
9
Has Sensitive
Data?
no
Source
Extractor
WorkUnit
Converter and
Quality Checker
Fork and Branching
Writer
DataPublisher
Writer
Sensitive Data
Filtering Converter
yes
12. (18)
State and Metadata Mgmt.
11
State Store
- Stores runtime metadata, e.g., checkpoints
(a.k.a. watermarks)
~ Carried over between job runs
- Default impl: serializes job/task states into
files, one per run.
- Allows other implementations that conform
to the interface to be plugged in.
State Store
job run #2
job run #3job run #1
SEP
2
SEP
3
SEP
2 SEP
3
EXAMPLE
14. (18)
Running Modes
13
Standalone
Runs in a single
JVM; tasks run in a
thread pool.
Scale-out with
MapReduce
Each job run launches
a MR job, using
mappers as containers
to run tasks.
Scale-out with
General
Distributed
Resource Manager
Supports long-running
continuous ingestion,
with better resource
utilization and SLA
guarantees.
YARN
*in progress
15. (18)
Gobblin in Production @ LinkedIn
• In production since 2014
• Usages
– Internal sources HDFS
• Kafka, MySQL, Dropbox, etc.
– External sources HDFS
• Salesforce, GoogleAnalytics, S3, etc.
– HDFS HDFS
• Closed member data purging
– Egress from HDFS (future work)
• Data volume
– Over a dozen data sources,
– thousands of datasets,
– tens ofTBs,
… daily.
14