Spark Seattle meetup - Breaking ETL barrier with Spark Streaming

Jul 16, 20155 likes1,591 views

Santosh Sahoo

A presentation to discuss how we transform our legacy ETL based reporting solution to real time streaming platform.

About us
Concur (now part of SAP) provides travel and
expense management services to
businesses.
Data Insights team is building solutions to
provide customer access to data,
visualization and reporting.

Numbers
7K OLTP database sources
14K OLAP Reporting dbs
28K ETL Jobs
300M rows (Compacted), 2B row changes
Only ~20 failure a night

Batch ETL challenges
Scheduled (High latency)
Processing time
Hard to scale.
Not fault tolerance
Monolithic
High maintenance

Moving forward
Scheduled (High latency) Streaming, real time
Hard to scale Scalable
Monolithic Modular
Not fault tolerant Fault tolerant
ACID Consistent, Normalized Eventual Consistency
High maintenance
(Single Tenant)
Reduce maintenance overhead
(Multi tenant)

Source
Flow
Manager
Streaming
Processor Storage
Reporting
Streaming Data Pipeline
Applications
Mobile Devices
Sensors
IOT - Internet of things
Database Log
scrapping
Alert
Message Queues
Kafka
Flume
Azure Event hub
AWS Kinesis
HDFS
Storm
Spark Streaming
Azure Stream
analytics
Samza
Flink
RDBMS
NoSQL
HDFS
Redshift
Custom App D3
Tableau
Cognos
Excel

Spark Streaming
What? A data processing
framework to build scalable
fault-tolerant streaming
applications.
Why? It lets you reuse the
same code for batch
processing, join streams
against historical data, or run
ad-hoc queries on stream
state.

Kafka - Flow Management
No nonsense logging
100K/s throughput vs 20k of RabbitMQ
Log compaction
Durable persistence
Partition tolerance
Replication
Best in class integration with Spark

Spark Streaming Architecture
Worker
Worker
Worker
Receiver
Driver Master
Executor
Executor
Executor
Source
D1 D2
D3 D4
WAL
D1 D2
Replication
Data
Store
TASK
DStream- Discretized Stream of RDD
RDD - Resilient Distributed Datasets

Optimized Direct Kafka API
https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html

OLTP
Reporting
Cognos
Tableau ?
Stream
Processor
Spark
HDFS
Import
FTP
HTTP
SMTP
P
Protobuf
Json
Broker
Kafka
Hive/
Spark SQL
OLAP
Load balance
Failover
HANA
HANA
OLAP
Replication
Service bus
Normalization
Extract
Compensate
Data {Quality, Correction, Analytics}
Migrate method
API/SQL
Expense
Travel
TTX
API
Reporting Next Gen Architecture
C
Tachyon

Can Spark Streaming
survive Chaos Monkey?
http://techblog.netflix.com/2015/03/can-spark-streaming-survive-chaos-monkey.html

Blagoy Kaloferov presented on building a data warehouse at Edmunds.com using Spark SQL. He discussed how Spark SQL simplified ETL and enabled business analysts to build data marts more quickly. He showed how Spark SQL was used to optimize a dealer leads dataset in Platfora, reducing build time from hours to minutes. Finally, he proposed an approach using Spark SQL to automate OEM ad revenue billing by modeling complex rules through collaboration between analysts and developers.

Quark Virtualization Engine for Analytics DataWorks Summit/Hadoop Summit

Rajat Venkatesh from Qubole presented on Quark, a virtualization engine for analytics. Quark uses a multi-store architecture to optimize queries using materialized views, predicate injection, and denormalized/sorted tables. It supports multiple SQL and storage engines. The roadmap includes improvements to the cost-based optimizer, support for OLAP cubes, and developing Quark as a service. Coordinates for the Quark GitHub and mailing list were provided.

Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...Databricks

From Batch to Streaming ET(L) with Apache ApexDataWorks Summit

Stream data processing is increasingly required to support business needs for faster actionable insight with growing volume of information from more sources. Apache Apex is a true stream processing framework for low-latency, high-throughput and reliable processing of complex analytics pipelines on clusters. Apex is designed for quick time-to-production, and is used in production by large companies for real-time and batch processing at scale. This session will use an Apex production use case to walk through the incremental transition from a batch pipeline with hours of latency to an end-to-end streaming architecture with billions of events per day which are processed to deliver real-time analytical reports. The example is representative for many similar extract-transform-load (ETL) use cases with other data sets that can use a common library of building blocks. The transform (or analytics) piece of such pipelines varies in complexity and often involves business logic specific, custom components. Topics include: * Pipeline functionality from event source through queryable state for real-time insights. * API for application development and development process. * Library of building blocks including connectors for sources and sinks such as Kafka, JMS, Cassandra, HBase, JDBC and how they enable end-to-end exactly-once results. * Stateful processing with event time windowing. * Fault tolerance with exactly-once result semantics, checkpointing, incremental recovery * Scalability and low-latency, high-throughput processing with advanced engine features for auto-scaling, dynamic changes, compute locality. * Who is using Apex in production, and roadmap. Following the session attendees will have a high level understanding of Apex and how it can be applied to use cases at their own organizations.

Scalable And Incremental Data Profiling With SparkJen Aman

This document discusses how Trifacta uses Spark to enable scalable and incremental data profiling. It describes challenges in profiling large datasets, such as performance and generating flexible jobs. Trifacta addresses these by building a Spark profiling job server that takes profiling specifications as JSON, runs jobs on Spark, and outputs results to HDFS. This pay-as-you-go approach allows profiling to scale to large datasets and different user needs in a flexible manner.

When OLAP Meets Real-Time, What Happens in eBay?DataWorks Summit

OLAP Cube is about pre-aggregations, it reduces the query latency by spending more time and resources on data preparation. But for real-time analytics, data preparation and visibility latency are critical. What happens when OLAP cube meets real-time use cases? Can we pre-build the cubes in real-time with a quick and more cost effective way? This is hard but still doable. In eBay，we built our own real-time OLAP solution based on Apache Kylin & Apache Kafka. We read unbounded events from Kafka cluster then divide the streaming data into 3 stages, In-Memory Stage (Continuously In-Memory Aggregations) , On Disk Stage (Flush to disk, columnar based storage and indexes) and Full Cubing Stage (with MR or Spark, save to HBase). Data are aggregated to different layers in different stage, but all query able. Data will be transformed from 1 stage to another stage automatically and transparent to user. This solution is built to support quite a few realtime analytics use cases in eBay, we will share some use cases like site speed monitoring and eBay site deal performance in this session as well. Speaker: Qiaoneng Qian, Senior Product Manager, eBay

AWS_Data_PipelineAhasan Habib

AWS Data Pipeline is a web service that allows users to design data driven workflows to move and transform data between different AWS services reliably and in a cost effective manner. It allows users to schedule, run, and manage recurring data processing workloads. Data Pipeline includes components like pipeline definitions, schedules, task runners, and objects like shell command activities and S3 data nodes to design extract, transform, load (ETL) processes. It works with services like DynamoDB, RDS, Redshift, S3, and EC2. Pipelines are created by composing definition objects in a file and can be accessed through the AWS Management Console, CLI, SDKs, and APIs.

Realtime streaming architecture in INFINARIOJozo Kovac

Building real time data-driven productsLars Albertsson

This presentation will describe how to go beyond a "Hello world" stream application and build a real-time data-driven product. We will present architectural patterns, go through tradeoffs and considerations when deciding on technology and implementation strategy, and describe how to put the pieces together. We will also cover necessary practical pieces for building real products: testing streaming applications, and how to evolve products over time. Presented at highloadstrategy.com 2016 by Øyvind Løkling (Schibsted Products & Technology), joint work with Lars Albertsson (independent, www.mapflat.com).

Building a Data Pipeline from Scratch - Joe CrobakHakka Labs

A data pipeline is a unified system for capturing events for analysis and building products. It involves capturing user events from various sources, storing them in a centralized data warehouse, and performing analysis and building products using tools like Hadoop. Key components of a data pipeline include an event framework, message bus, data serialization, data persistence, workflow management, and batch processing. A Lambda architecture allows for both batch and real-time processing of data captured by the pipeline.

End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaDatabricks

Hopsworks is an open-source data platform that can be used to both develop and operate horizontally scalable machine learning pipelines. A key part of our pipelines is the world’s first open-source Feature Store, based on Apache Hive, that acts as a data warehouse for features, providing a natural API between data engineers – who write feature engineering code in Spark (in Scala or Python) – and Data Scientists, who select features from the feature store to generate training/test data for models. In this talk, we will discuss how Databricks Delta solves several of the key challenges in building both feature engineering pipelines that feed our Feature Store and in managing the feature data itself. Firstly, we will show how expectations and schema enforcement in Databricks Delta can be used to provide data validation, ensuring that feature data does not have missing or invalid values that could negatively affect model training. Secondly, time-travel in Databricks Delta can be used to provide version management and experiment reproducability for training/test datasets. That is, given a model, you can re-run the training experiment for that model using the same version of the data that was used to train the model. We will also discuss the next steps needed to take this work to the next level. Finally, we will perform a live demo, showing how Delta can be used in end-to-end ML pipelines using Spark on Hopsworks.

The Revolution Will be StreamedDatabricks

Real time, streaming advanced analytics, approximations, and recommendations ...DataWorks Summit/Hadoop Summit

This document discusses various techniques for building recommendation systems using Apache Spark. It begins with an overview of scaling techniques using parallelism and composability. Various similarity measures are then covered, including Euclidean, cosine, Jaccard, and word embeddings. Recommendation approaches like item-to-item graphs and personalized PageRank are demonstrated. The document also discusses feature engineering, modeling techniques, and evaluating recommendations. Live demos are provided of word similarity, movie recommendations, sentiment analysis and more.

Efficiently Building Machine Learning Models for Predictive Maintenance in th...Databricks

Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptxLex Avstreikh

This document discusses building a feature store using Apache Spark and dataframes. It provides examples of major feature store concepts like feature groups, training/test datasets, and joins. Feature store implementations from companies like Uber, Airbnb and Netflix are also mentioned. The document outlines the architecture of storing both online and offline feature groups and describes the evolution of the feature store API to better support concepts like feature versioning, multiple stores, complex joins and time travel. Use cases demonstrated include fraud detection in banking and modeling crop yields using joined weather and agricultural data.

Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...Spark Summit

This document discusses how to visualize streaming data using Spark. It describes how Spark Streaming can be used to process streaming data in real-time and integrate it with visualization tools. Key points include: - Spark Streaming receives streaming data from sources like Kafka and processes it using in-memory computations in a single JVM cluster. - The processed data can be stored in buffers like MongoDB or output to systems like MemSQL, Solr to enable interactive visualizations that update in real-time. - A demo is shown of Twitter data being streamed and analyzed using Spark Streaming with results stored in MemSQL and Solr for visualization. - Benefits of this approach include being able to work with streaming data

Dynamic Partition Pruning in Apache SparkDatabricks

In data analytics frameworks such as Spark it is important to detect and avoid scanning data that is irrelevant to the executed query, an optimization which is known as partition pruning. Dynamic partition pruning occurs when the optimizer is unable to identify at parse time the partitions it has to eliminate. In particular, we consider a star schema which consists of one or multiple fact tables referencing any number of dimension tables. In such join operations, we can prune the partitions the join reads from a fact table by identifying those partitions that result from filtering the dimension tables. In this talk we present a mechanism for performing dynamic partition pruning at runtime by reusing the dimension table broadcast results in hash joins and we show significant improvements for most TPCDS queries.

Healthcare Claim Reimbursement using Apache SparkDatabricks

The document discusses rewriting a claims reimbursement system using Spark. It describes how Spark provides better performance, scalability and cost savings compared to the previous Oracle-based system. Key points include using Spark for ETL to load data into a Delta Lake data lake, implementing the business logic in a reusable Java library, and seeing significant increases in processing volumes and speeds compared to the prior system. Challenges and tips for adoption are also provided.

Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...Khai Tran

This document discusses LinkedIn's transition from an offline metrics platform to a near real-time "nearline" architecture using Apache Calcite and Apache Samza. It overviews LinkedIn's metrics platform and needs, and then details how the new nearline architecture works by translating Pig jobs into optimized Samza jobs using Calcite's relational algebra and query planning. An example production use case for analyzing storylines on the LinkedIn platform is also presented. The nearline architecture allows metrics to be computed with latencies of 5-30 minutes rather than 3-6 hours previously.

Spark Streaming and IoT by Mike FreedmanSpark Summit

This document discusses using Spark Streaming for IoT applications and the challenges involved. It notes that while Spark simplifies programming across different processing intervals from batch to stream, programming models alone are not sufficient as IoT data streams can have varying rates and delays. It proposes a unified data infrastructure with abstractions like data series that support joining real-time and historical data while handling delays transparently. It also suggests approaches for Spark Streaming to better support processing many independent low-volume IoT streams concurrently and improving resource utilization for such applications. Finally, it introduces the Device-Model-Infra framework for addressing these IoT analytics challenges through combined programming models and data abstractions.

Proud to be Polyglot - Riviera Dev 2015Tugdual Grall

The document discusses the benefits of using multiple programming languages and data stores, or a "polyglot" approach, for modern applications. A polyglot approach allows using the right tool for each task, rather than trying to force a single technology to fit all needs. This improves performance, scalability, and the ability to adapt applications to changing requirements compared to traditional monolithic architectures. The document provides examples of when to use different languages and data stores and concludes that a polyglot approach makes applications easier to maintain over time.

Keeping Identity Graphs In Sync With Apache SparkDatabricks

The online advertising industry is based on identifying users with cookies, and showing relevant ads to interested users. But there are many data providers, many places to target ads and many people browsing online. How can we identify users across data providers? The first step in solving this is by cookie mapping: a chain of server calls that pass identifiers across providers. Sadly, chains break, servers break, providers can be flaky or use caching and you may never see the whole of the chain. The solution to this problem is constructing an identity graph with the data we see: in our case, cookie ids are nodes, edges are relations and connected components of the graph are users. In this talk I will explain how Hybrid Theory leverages Spark and GraphFrames to construct and maintain a 2000 million node identity graph with minimal computational cost.

Modern ETL Pipelines with Change Data CaptureDatabricks

In this talk we’ll present how at GetYourGuide we’ve built from scratch a completely new ETL pipeline using Debezium, Kafka, Spark and Airflow, which can automatically handle schema changes. Our starting point was an error prone legacy system that ran daily, and was vulnerable to breaking schema changes, which caused many sleepless on-call nights. As most companies, we also have traditional SQL databases that we need to connect to in order to extract relevant data. This is done usually through either full or partial copies of the data with tools such as sqoop. However another approach that has become quite popular lately is to use Debezium as the Change Data Capture layer which reads databases binlogs, and stream these changes directly to Kafka. As having data once a day is not enough anymore for our bussiness, and we wanted our pipelines to be resilent to upstream schema changes, we’ve decided to rebuild our ETL using Debezium. We’ll walk the audience through the steps we followed to architect and develop such solution using Databricks to reduce operation time. By building this new pipeline we are now able to refresh our data lake multiple times a day, giving our users fresh data, and protecting our nights of sleep.

Spark Summit EU talk by Stephan KesslerSpark Summit

This document summarizes a talk given by Stephan Kessler at the Spark Summit Europe 2016 about integrating business functionality and specialized engines into Apache Spark using SAP HANA Vora. Key points discussed include using currency conversion and time series query capabilities directly in Spark by pushing computations to the relevant data sources via Spark extensions. SAP HANA Vora allows moving parts of the Spark logical query plan to various data sources like HANA, graph and document stores to perform analysis close to the data.

Hugfr SPARK & RIAK -20160114_hug_franceModern Data Stack France

Presentation faite lors du Hadoop User Group France du 14 janvier 2016. L’analytique temps réel avec Riak et Spark par Michael Carney (Basho) et Olivier Girardot de Lateral Thoughts Selon un rapport de Salesforce, le nombre de sources de données analysées par les entreprises progressera de 83% au cours des cinq prochaines années, ainsi les organisations veulent désormais fournir des connaissances en temps réel même sur les appareils mobiles. Le traitement temps réel est donc, le futur de l’analyse big data. Ce talk présentera des nouveautés en matière de l’analyse temps réel autour de la famille SGBD Riak et Spark. Michael Carney est le Directeur Commercial de Basho pour le Sud d’Europe. Fondateur de MySQL France et de MariaDB, Michael a rejoint Basho en janvier 2015 pour explorer le monde de données sans tables ! Olivier Girardot est le CTO de Lateral Thoughts, il est développeur et formateur au sujet de Spark et également spécialiste de Java/Python dans le domaine de la finance de marché.

A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)Spark Summit

This document describes BBVA's implementation of a Big Data Lake using Apache Spark for log collection, storage, and analytics. It discusses: 1) Using Syslog-ng for log collection from over 2,000 applications and devices, distributing logs to Kafka. 2) Storing normalized logs in HDFS and performing analytics using Spark, with outputs to analytics, compliance, and indexing systems. 3) Choosing Spark because it allows interactive, batch, and stream processing with one system using RDDs, SQL, streaming, and machine learning.

Writing Continuous Applications with Structured Streaming PySpark APIDatabricks

"We're amidst the Big Data Zeitgeist era in which data comes at us fast, in myriad forms and formats at intermittent intervals or in a continuous stream, and we need to respond to streaming data immediately. This need has created a notion of writing a streaming application that’s continuous, reacts and interacts with data in real-time. We call this continuous application. In this tutorial we'll explore the concepts and motivations behind the continuous application, how Structured Streaming Python APIs in Apache Spark™ enable writing continuous applications, examine the programming model behind Structured Streaming, and look at the APIs that support them. Through presentation, code examples, and notebooks, I will demonstrate how to write an end-to-end Structured Streaming application that reacts and interacts with both real-time and historical data to perform advanced analytics using Spark SQL, DataFrames and Datasets APIs. You’ll walk away with an understanding of what’s a continuous application, appreciate the easy-to-use Structured Streaming APIs, and why Structured Streaming in Apache Spark is a step forward in developing new kinds of streaming applications. This tutorial will be both instructor-led and hands-on interactive session. Instructions in how to get tutorial materials will be covered in class. WHAT YOU’LL LEARN: – Understand the concepts and motivations behind Structured Streaming – How to use DataFrame APIs – How to use Spark SQL and create tables on streaming data – How to write a simple end-to-end continuous application PREREQUISITES – A fully-charged laptop (8-16GB memory) with Chrome or Firefox –Pre-register for Databricks Community Edition" Speaker: Jules Damji

New Developments in SparkDatabricks

Realtime Reporting using Spark StreamingSantosh Sahoo

Introduction to apache kafka, confluent and why they matterPaolo Castagna

More Related Content

What's hot (20)

Building real time data-driven productsLars Albertsson

Building a Data Pipeline from Scratch - Joe CrobakHakka Labs

End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaDatabricks

The Revolution Will be StreamedDatabricks

Real time, streaming advanced analytics, approximations, and recommendations ...DataWorks Summit/Hadoop Summit

Efficiently Building Machine Learning Models for Predictive Maintenance in th...Databricks

Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptxLex Avstreikh

Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...Spark Summit

Dynamic Partition Pruning in Apache SparkDatabricks

Healthcare Claim Reimbursement using Apache SparkDatabricks

Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...Khai Tran

Spark Streaming and IoT by Mike FreedmanSpark Summit

Proud to be Polyglot - Riviera Dev 2015Tugdual Grall

Keeping Identity Graphs In Sync With Apache SparkDatabricks

Modern ETL Pipelines with Change Data CaptureDatabricks

Spark Summit EU talk by Stephan KesslerSpark Summit

Hugfr SPARK & RIAK -20160114_hug_franceModern Data Stack France

A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)Spark Summit

Writing Continuous Applications with Structured Streaming PySpark APIDatabricks

New Developments in SparkDatabricks

Building real time data-driven productsLars Albertsson

Building a Data Pipeline from Scratch - Joe CrobakHakka Labs

End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaDatabricks

The Revolution Will be StreamedDatabricks

Real time, streaming advanced analytics, approximations, and recommendations ...DataWorks Summit/Hadoop Summit

Efficiently Building Machine Learning Models for Predictive Maintenance in th...Databricks

Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptxLex Avstreikh

Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...Spark Summit

Dynamic Partition Pruning in Apache SparkDatabricks

Healthcare Claim Reimbursement using Apache SparkDatabricks

Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...Khai Tran

Spark Streaming and IoT by Mike FreedmanSpark Summit

Proud to be Polyglot - Riviera Dev 2015Tugdual Grall

Keeping Identity Graphs In Sync With Apache SparkDatabricks

Modern ETL Pipelines with Change Data CaptureDatabricks

Spark Summit EU talk by Stephan KesslerSpark Summit

Hugfr SPARK & RIAK -20160114_hug_franceModern Data Stack France

A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)Spark Summit

Writing Continuous Applications with Structured Streaming PySpark APIDatabricks

New Developments in SparkDatabricks

Similar to Spark Seattle meetup - Breaking ETL barrier with Spark Streaming (20)

Realtime Reporting using Spark StreamingSantosh Sahoo

Introduction to apache kafka, confluent and why they matterPaolo Castagna

Webinar Think Right - Shift Left - 19-03-2025.pptxconfluent

Unify Analytics: Combine Strengths of Data Lake and Data WarehousePaige_Roberts

Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureLuan Moreno Medeiros Maciel

Unifying AnalyticsData Con LA

Data Con LA 2020 Description The data warehouse has been an analytics workhorse for decades. Unprecedented volumes of data, new types of data, and the need for advanced analyses like machine learning brought on the age of the data lake. But Hadoop by itself doesn't really live up to the hype. Now, many companies have a data lake, a data warehouse, or a mishmash of both, possibly combined with a mandate to go to the cloud. The end result can be a sprawling mess, a lot of duplicated effort, a lot of missed opportunities, a lot of projects that never made it into production, and a lot of financial investment without return. Technical and spiritual unification of the two opposed camps can make a powerful impact on the effectiveness of analytics for the business overall. Over time, different organizations with massive IoT workloads have found practical ways to bridge the artificial gap between these two data management strategies. Look under the hood at how companies have gotten IoT ML projects working, and how their data architectures have changed over time. Learn about new architectures that successfully supply the needs of both business analysts and data scientists. Get a peek at the future. In this area, no one likes surprises. *Look at successful data architectures from companies like Philips, Anritsu, Uber, *Learn to eliminate duplication of effort between data science and BI data engineering teams *Avoid some of the traps that have caused so many big data analytics implementations to fail *Get AI and ML projects into production where they have real impact, without bogging down essential BI *Study analytics architectures that work, why and how they work, and where they're going from here Speaker Paige Roberts,Vertica, Open Source Relations Manager

Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association

This document discusses Infobip's journey towards enabling real-time querying of aggregated data. Initially, Infobip had a monolithic architecture with a single database that became a bottleneck. They introduced multiple databases and microservices but querying spanned databases and results had to be joined. A data warehouse (GREEN) provided reporting but was not real-time. To enable real-time queries, Infobip implemented a lambda architecture using Kafka as the real-time data pipeline and Druid for real-time querying and aggregations, achieving sub-second responses and less than 2 seconds of data delay. This allows real-time insights from ingested messaging data while GREEN remains the batch/serving layer.

Big Data Analytics Platforms by KTH and RISE SICSBig Data Value Association

This document describes Hopsworks, an end-to-end data platform for analytics and machine learning built by KTH and RISE SICS. It provides data ingestion, preparation, experimentation, model training, and deployment capabilities. The platform is built on Apache technologies like Apache Beam, Spark, Flink, Kafka, and uses Kubernetes for orchestration. It also includes a feature store for ML features. The document then discusses Apache Flink and its use for stream processing applications. It provides examples of using Flink's APIs like SQL, CEP, and machine learning. Finally, it introduces the concept of continuous deep analytics and the Arcon framework for unified analytics across streams, tensors, graphs and more through an intermediate

A Hudi Live Event: Shaping a Database Experience within the Data Lake with Ap...nadine39280

Discover the evolution of Apache Hudi within the open-source realm - a community and project pushing the boundaries of data lake possibilities. This presentation delves into Apache Hudi 1.0, a pivotal release reimagining its transactional database layer while honoring its foundational principles. Join us in this transformative journey! Join the Apache Hudi Community https://join.slack.com/t/apache-hudi/shared_invite/zt-20r833rxh-627NWYDUyR8jRtMa2mZ~gg. Follow us on LinkedIn and Twitter https://www.linkedin.com/company/apache-hudi/ https://twitter.com/apachehudi

Overview SQL Server 2019Juan Fabian

The document outlines the roadmap for SQL Server, including enhancements to performance, security, availability, development tools, and big data capabilities. Key updates include improved intelligent query processing, confidential computing with secure enclaves, high availability options on Kubernetes, machine learning services, and tools in Azure Data Studio. The roadmap aims to make SQL Server the most secure, high performing, and intelligent data platform across on-premises, private cloud and public cloud environments.

Don't Cross The Streams - Data Streaming And Apache FlinkJohn Gorman (BSc, CISSP)

Along with the arrival of BigData, a parallel yet less well known but significant change to the way we process data has occurred. Data is getting faster! Business models are changing radically based on the ability to be first to know insights and act appropriately to keep the customer, prevent the breakdown or save the patient. In essence, knowing something now is overriding knowing everything later. Stream processing engines allow us to blend event streams from different internal and external sources to gain insights in real time. This talk will discuss the need for streaming, business models it can change, new applications it allows and why Apache Flink enables these applications. Apache Flink is a top Level Apache Project for real time stream processing at scale. It is a high throughput, low latency, fault tolerant, distributed, state based stream processing engine. Flink has associated Polyglot APIs (Scala, Python, Java) for manipulating streams, a Complex Event Processor for monitoring and alerting on the streams and integration points with other big data ecosystem tooling.

SnappyData @ Seattle Spark MeetupSnappyData

SnappyData is a new open source project started by Pivotal GemFire founders to build a unified cluster capable of OLTP, OLAP, and streaming analytics using Spark. SnappyData fuses an elastic, highly available in-memory store for OLTP with Spark's memory manager and query engine to provide a single system for mixed workloads with fast ingestion, high concurrency and the ability to work with live, mutable data.

Data Con LA 2022 KeynoteData Con LA

- Project Lightspeed is the next generation of Apache Spark Structured Streaming that aims to provide faster and simpler stream processing with predictable low latency. - It targets reducing tail latency by up to 2x through faster bookkeeping and offset management. It also enhances functionality with advanced capabilities like new operators and easy to use APIs. - Project Lightspeed also aims to simplify deployment, operations, monitoring and troubleshooting of streaming applications. It seeks to improve ecosystem support for connectors, authentication and authorization. - Some specific improvements include faster micro-batch processing, enhancing Python as a first class citizen, and making debugging of streaming jobs easier through visualizations.

Oracle Cep Xstreams adapterMohit Thatte

An XStreams adapter allows Oracle CEP to ingest and process events from an XStreams data stream. The adapter plugs into an event processing network to enable CEP queries over the streaming data. It can be used for applications like algorithmic trading, telecommunications monitoring, and RFID tracking where low latency event processing is important. A demo uses the adapter to run the Linear Road benchmark over sensor data from a simulated variable toll expressway system.

An Architect's guide to real time big data systemsRaja SP

MongoDB World 2019: Streaming ETL on the Shoulders of GiantsMongoDB

This document discusses streaming ETL using Apache Kafka and MongoDB as a modern data platform. It provides an overview of streaming data and how it can help with speed and agility compared to traditional batch ETL processes. It then discusses how Apache Kafka acts as a streaming platform and messaging system that can be used to build streaming data applications and integrate data from various sources using Kafka Connect. The document announces the availability of the MongoDB connector for Kafka Connect, which allows streaming data between Kafka and MongoDB collections. It concludes with a demo scenario showing how this could work in practice.

The advantages of Arista/OVH configurations, and the technologies behind buil...OVHcloud

Lightbend Fast Data PlatformLightbend

Lambda Architecture Using SQLSATOSHI TAGOMORI

Building big data solutions on azureEyal Ben Ivri

Realtime Reporting using Spark StreamingSantosh Sahoo

Introduction to apache kafka, confluent and why they matterPaolo Castagna

Webinar Think Right - Shift Left - 19-03-2025.pptxconfluent

Unify Analytics: Combine Strengths of Data Lake and Data WarehousePaige_Roberts

Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureLuan Moreno Medeiros Maciel

Unifying AnalyticsData Con LA

Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association

Big Data Analytics Platforms by KTH and RISE SICSBig Data Value Association

A Hudi Live Event: Shaping a Database Experience within the Data Lake with Ap...nadine39280

Overview SQL Server 2019Juan Fabian

Don't Cross The Streams - Data Streaming And Apache FlinkJohn Gorman (BSc, CISSP)

SnappyData @ Seattle Spark MeetupSnappyData

Data Con LA 2022 KeynoteData Con LA

Oracle Cep Xstreams adapterMohit Thatte

An Architect's guide to real time big data systemsRaja SP

MongoDB World 2019: Streaming ETL on the Shoulders of GiantsMongoDB

The advantages of Arista/OVH configurations, and the technologies behind buil...OVHcloud

Lightbend Fast Data PlatformLightbend

Lambda Architecture Using SQLSATOSHI TAGOMORI

Building big data solutions on azureEyal Ben Ivri

Recently uploaded (20)

CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...ThanushsaranS

Ppt. Nikhil.pptxnshwuudgcudisisshvehsjkspanchariyasahil

Deloitte Analytics - Applying Process Mining in an audit contextProcess mining Evangelist

Mieke Jans is a Manager at Deloitte Analytics Belgium. She learned about process mining from her PhD supervisor while she was collaborating with a large SAP-using company for her dissertation. Mieke extended her research topic to investigate the data availability of process mining data in SAP and the new analysis possibilities that emerge from it. It took her 8-9 months to find the right data and prepare it for her process mining analysis. She needed insights from both process owners and IT experts. For example, one person knew exactly how the procurement process took place at the front end of SAP, and another person helped her with the structure of the SAP-tables. She then combined the knowledge of these different persons.

Medical Dataset including visualizationsvishrut8750588758

How iCode cybertech Helped Me Recover My Lost Fundsireneschmid345

I was devastated when I realized that I had fallen victim to an online fraud, losing a significant amount of money in the process. After countless hours of searching for a solution, I came across iCode cybertech. From the moment I reached out to their team, I felt a sense of hope that I can recommend iCode Cybertech enough for anyone who has faced similar challenges. Their commitment to helping clients and their exceptional service truly set them apart. Thank you, iCode cybertech, for turning my situation around! [email protected]

LLM finetuning for multiple choice google bertChadapornK

Secure_File_Storage_Hybrid_Cryptography.pptx..yuvarajreddy2002

AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsContify

md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptxfatimalazaar2004

Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...Abodahab

chapter3 Central Tendency statistics.pptjustinebandajbn

Cleaned_Lecture 6666666_Simulation_I.pdfalcinialbob1234

VKS-Python Basics for Beginners and advance.pptxVinod Srivastava

Developing Security Orchestration, Automation, and Response ApplicationsVICTOR MAESTRE RAMIREZ

chapter 4 Variability statistical research .pptxjustinebandajbn

FPET_Implementation_2_MA to 360 Engage Direct.pptxssuser4ef83d

03 Daniel 2-notes.ppt seminario escatologiaAlexander Romero Arosquipa

computer organization and assembly language.docxalisoftwareengineer1

Data Science Courses in India iim skillsdharnathakur29

This comprehensive Data Science course is designed to equip learners with the essential skills and knowledge required to analyze, interpret, and visualize complex data. Covering both theoretical concepts and practical applications, the course introduces tools and techniques used in the data science field, such as Python programming, data wrangling, statistical analysis, machine learning, and data visualization.

Flip flop presenation-Presented By Mubahir khan.pptxmubashirkhan45461

CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...ThanushsaranS

Ppt. Nikhil.pptxnshwuudgcudisisshvehsjkspanchariyasahil

Deloitte Analytics - Applying Process Mining in an audit contextProcess mining Evangelist

Medical Dataset including visualizationsvishrut8750588758

How iCode cybertech Helped Me Recover My Lost Fundsireneschmid345

LLM finetuning for multiple choice google bertChadapornK

Secure_File_Storage_Hybrid_Cryptography.pptx..yuvarajreddy2002

AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsContify

md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptxfatimalazaar2004

Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...Abodahab

chapter3 Central Tendency statistics.pptjustinebandajbn

Cleaned_Lecture 6666666_Simulation_I.pdfalcinialbob1234

VKS-Python Basics for Beginners and advance.pptxVinod Srivastava

Developing Security Orchestration, Automation, and Response ApplicationsVICTOR MAESTRE RAMIREZ

chapter 4 Variability statistical research .pptxjustinebandajbn

FPET_Implementation_2_MA to 360 Engage Direct.pptxssuser4ef83d

03 Daniel 2-notes.ppt seminario escatologiaAlexander Romero Arosquipa

computer organization and assembly language.docxalisoftwareengineer1

Data Science Courses in India iim skillsdharnathakur29

Flip flop presenation-Presented By Mubahir khan.pptxmubashirkhan45461

Spark Seattle meetup - Breaking ETL barrier with Spark Streaming

1. Breaking ETL barrier with Real-time reporting using Kafka, Spark Streaming Santosh Sahoo Architect at Concur

2. About us Concur (now part of SAP) provides travel and expense management services to businesses. Data Insights team is building solutions to provide customer access to data, visualization and reporting.

3. Stack so far.. OLAP ReportETL OLTP App

4. Numbers 7K OLTP database sources 14K OLAP Reporting dbs 28K ETL Jobs 300M rows (Compacted), 2B row changes Only ~20 failure a night

5. Batch ETL challenges Scheduled (High latency) Processing time Hard to scale. Not fault tolerance Monolithic High maintenance

6. Moving forward Scheduled (High latency) Streaming, real time Hard to scale Scalable Monolithic Modular Not fault tolerant Fault tolerant ACID Consistent, Normalized Eventual Consistency High maintenance (Single Tenant) Reduce maintenance overhead (Multi tenant)

7. Source Flow Manager Streaming Processor Storage Reporting Streaming Data Pipeline Applications Mobile Devices Sensors IOT - Internet of things Database Log scrapping Alert Message Queues Kafka Flume Azure Event hub AWS Kinesis HDFS Storm Spark Streaming Azure Stream analytics Samza Flink RDBMS NoSQL HDFS Redshift Custom App D3 Tableau Cognos Excel

8. Spark Streaming What? A data processing framework to build scalable fault-tolerant streaming applications. Why? It lets you reuse the same code for batch processing, join streams against historical data, or run ad-hoc queries on stream state.

9. Demo ….

10. Kafka - Flow Management No nonsense logging 100K/s throughput vs 20k of RabbitMQ Log compaction Durable persistence Partition tolerance Replication Best in class integration with Spark

11. Spark Streaming Architecture Worker Worker Worker Receiver Driver Master Executor Executor Executor Source D1 D2 D3 D4 WAL D1 D2 Replication Data Store TASK DStream- Discretized Stream of RDD RDD - Resilient Distributed Datasets

12. Optimized Direct Kafka API https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html

13. Architecture

14. OLTP Reporting Cognos Tableau ? Stream Processor Spark HDFS Import FTP HTTP SMTP P Protobuf Json Broker Kafka Hive/ Spark SQL OLAP Load balance Failover HANA HANA OLAP Replication Service bus Normalization Extract Compensate Data {Quality, Correction, Analytics} Migrate method API/SQL Expense Travel TTX API Reporting Next Gen Architecture C Tachyon

15. Can Spark Streaming survive Chaos Monkey? http://techblog.netflix.com/2015/03/can-spark-streaming-survive-chaos-monkey.html

16. QnA

17. concur.com/en-us/careers We are hiring

18. Thank you!

Spark Seattle meetup - Breaking ETL barrier with Spark Streaming

Recommended

More Related Content

What's hot (20)

Similar to Spark Seattle meetup - Breaking ETL barrier with Spark Streaming (20)

Recently uploaded (20)

Spark Seattle meetup - Breaking ETL barrier with Spark Streaming