Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Spark Summit
Blagoy Kaloferov presented on building a data warehouse at Edmunds.com using Spark SQL. He discussed how Spark SQL simplified ETL and enabled business analysts to build data marts more quickly. He showed how Spark SQL was used to optimize a dealer leads dataset in Platfora, reducing build time from hours to minutes. Finally, he proposed an approach using Spark SQL to automate OEM ad revenue billing by modeling complex rules through collaboration between analysts and developers.
Rajat Venkatesh from Qubole presented on Quark, a virtualization engine for analytics. Quark uses a multi-store architecture to optimize queries using materialized views, predicate injection, and denormalized/sorted tables. It supports multiple SQL and storage engines. The roadmap includes improvements to the cost-based optimizer, support for OLAP cubes, and developing Quark as a service. Coordinates for the Quark GitHub and mailing list were provided.
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...Databricks
Building a curated data lake on real time data is an emerging data warehouse pattern with delta. However in the real world, what we many times face ourselves with is dynamically changing schemas which pose a big challenge to incorporate without downtimes.
Stream data processing is increasingly required to support business needs for faster actionable insight with growing volume of information from more sources. Apache Apex is a true stream processing framework for low-latency, high-throughput and reliable processing of complex analytics pipelines on clusters. Apex is designed for quick time-to-production, and is used in production by large companies for real-time and batch processing at scale.
This session will use an Apex production use case to walk through the incremental transition from a batch pipeline with hours of latency to an end-to-end streaming architecture with billions of events per day which are processed to deliver real-time analytical reports. The example is representative for many similar extract-transform-load (ETL) use cases with other data sets that can use a common library of building blocks. The transform (or analytics) piece of such pipelines varies in complexity and often involves business logic specific, custom components.
Topics include:
* Pipeline functionality from event source through queryable state for real-time insights.
* API for application development and development process.
* Library of building blocks including connectors for sources and sinks such as Kafka, JMS, Cassandra, HBase, JDBC and how they enable end-to-end exactly-once results.
* Stateful processing with event time windowing.
* Fault tolerance with exactly-once result semantics, checkpointing, incremental recovery
* Scalability and low-latency, high-throughput processing with advanced engine features for auto-scaling, dynamic changes, compute locality.
* Who is using Apex in production, and roadmap.
Following the session attendees will have a high level understanding of Apex and how it can be applied to use cases at their own organizations.
Scalable And Incremental Data Profiling With SparkJen Aman
This document discusses how Trifacta uses Spark to enable scalable and incremental data profiling. It describes challenges in profiling large datasets, such as performance and generating flexible jobs. Trifacta addresses these by building a Spark profiling job server that takes profiling specifications as JSON, runs jobs on Spark, and outputs results to HDFS. This pay-as-you-go approach allows profiling to scale to large datasets and different user needs in a flexible manner.
When OLAP Meets Real-Time, What Happens in eBay?DataWorks Summit
OLAP Cube is about pre-aggregations, it reduces the query latency by spending more time and resources on data preparation. But for real-time analytics, data preparation and visibility latency are critical. What happens when OLAP cube meets real-time use cases?
Can we pre-build the cubes in real-time with a quick and more cost effective way? This is hard but still doable.
In eBay,we built our own real-time OLAP solution based on Apache Kylin & Apache Kafka. We read unbounded events from Kafka cluster then divide the streaming data into 3 stages, In-Memory Stage (Continuously In-Memory Aggregations) , On Disk Stage (Flush to disk, columnar based storage and indexes) and Full Cubing Stage (with MR or Spark, save to HBase). Data are aggregated to different layers in different stage, but all query able. Data will be transformed from 1 stage to another stage automatically and transparent to user.
This solution is built to support quite a few realtime analytics use cases in eBay, we will share some use cases like site speed monitoring and eBay site deal performance in this session as well.
Speaker:
Qiaoneng Qian, Senior Product Manager, eBay
AWS Data Pipeline is a web service that allows users to design data driven workflows to move and transform data between different AWS services reliably and in a cost effective manner. It allows users to schedule, run, and manage recurring data processing workloads. Data Pipeline includes components like pipeline definitions, schedules, task runners, and objects like shell command activities and S3 data nodes to design extract, transform, load (ETL) processes. It works with services like DynamoDB, RDS, Redshift, S3, and EC2. Pipelines are created by composing definition objects in a file and can be accessed through the AWS Management Console, CLI, SDKs, and APIs.
Realtime streaming architecture in INFINARIOJozo Kovac
About our experience with realtime analyses on never-ending stream of user events. Discuss Lambda architecture, Kappa, Apache Kafka and our own approach.
This presentation will describe how to go beyond a "Hello world" stream application and build a real-time data-driven product. We will present architectural patterns, go through tradeoffs and considerations when deciding on technology and implementation strategy, and describe how to put the pieces together. We will also cover necessary practical pieces for building real products: testing streaming applications, and how to evolve products over time.
Presented at highloadstrategy.com 2016 by Øyvind Løkling (Schibsted Products & Technology), joint work with Lars Albertsson (independent, www.mapflat.com).
Building a Data Pipeline from Scratch - Joe CrobakHakka Labs
A data pipeline is a unified system for capturing events for analysis and building products. It involves capturing user events from various sources, storing them in a centralized data warehouse, and performing analysis and building products using tools like Hadoop. Key components of a data pipeline include an event framework, message bus, data serialization, data persistence, workflow management, and batch processing. A Lambda architecture allows for both batch and real-time processing of data captured by the pipeline.
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaDatabricks
Hopsworks is an open-source data platform that can be used to both develop and operate horizontally scalable machine learning pipelines. A key part of our pipelines is the world’s first open-source Feature Store, based on Apache Hive, that acts as a data warehouse for features, providing a natural API between data engineers – who write feature engineering code in Spark (in Scala or Python) – and Data Scientists, who select features from the feature store to generate training/test data for models. In this talk, we will discuss how Databricks Delta solves several of the key challenges in building both feature engineering pipelines that feed our Feature Store and in managing the feature data itself.
Firstly, we will show how expectations and schema enforcement in Databricks Delta can be used to provide data validation, ensuring that feature data does not have missing or invalid values that could negatively affect model training. Secondly, time-travel in Databricks Delta can be used to provide version management and experiment reproducability for training/test datasets. That is, given a model, you can re-run the training experiment for that model using the same version of the data that was used to train the model.
We will also discuss the next steps needed to take this work to the next level. Finally, we will perform a live demo, showing how Delta can be used in end-to-end ML pipelines using Spark on Hopsworks.
The modern data customer wants data now. Batch workloads are not going anywhere, but at Scribd the future of our data platform requires more and more streaming data sets.
This document discusses various techniques for building recommendation systems using Apache Spark. It begins with an overview of scaling techniques using parallelism and composability. Various similarity measures are then covered, including Euclidean, cosine, Jaccard, and word embeddings. Recommendation approaches like item-to-item graphs and personalized PageRank are demonstrated. The document also discusses feature engineering, modeling techniques, and evaluating recommendations. Live demos are provided of word similarity, movie recommendations, sentiment analysis and more.
Efficiently Building Machine Learning Models for Predictive Maintenance in th...Databricks
For each drilling site, there are thousands of different equipment operating simultaneously 24/7. For the oil & gas industry, the downtime can cost millions of dollars daily. As current standard practice, the majority of the equipment are on scheduled maintenance with standby units to reduce the downtime.
This document discusses building a feature store using Apache Spark and dataframes. It provides examples of major feature store concepts like feature groups, training/test datasets, and joins. Feature store implementations from companies like Uber, Airbnb and Netflix are also mentioned. The document outlines the architecture of storing both online and offline feature groups and describes the evolution of the feature store API to better support concepts like feature versioning, multiple stores, complex joins and time travel. Use cases demonstrated include fraud detection in banking and modeling crop yields using joined weather and agricultural data.
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...Spark Summit
This document discusses how to visualize streaming data using Spark. It describes how Spark Streaming can be used to process streaming data in real-time and integrate it with visualization tools. Key points include:
- Spark Streaming receives streaming data from sources like Kafka and processes it using in-memory computations in a single JVM cluster.
- The processed data can be stored in buffers like MongoDB or output to systems like MemSQL, Solr to enable interactive visualizations that update in real-time.
- A demo is shown of Twitter data being streamed and analyzed using Spark Streaming with results stored in MemSQL and Solr for visualization.
- Benefits of this approach include being able to work with streaming data
Dynamic Partition Pruning in Apache SparkDatabricks
In data analytics frameworks such as Spark it is important to detect and avoid scanning data that is irrelevant to the executed query, an optimization which is known as partition pruning. Dynamic partition pruning occurs when the optimizer is unable to identify at parse time the partitions it has to eliminate. In particular, we consider a star schema which consists of one or multiple fact tables referencing any number of dimension tables. In such join operations, we can prune the partitions the join reads from a fact table by identifying those partitions that result from filtering the dimension tables. In this talk we present a mechanism for performing dynamic partition pruning at runtime by reusing the dimension table broadcast results in hash joins and we show significant improvements for most TPCDS queries.
Healthcare Claim Reimbursement using Apache SparkDatabricks
The document discusses rewriting a claims reimbursement system using Spark. It describes how Spark provides better performance, scalability and cost savings compared to the previous Oracle-based system. Key points include using Spark for ETL to load data into a Delta Lake data lake, implementing the business logic in a reusable Java library, and seeing significant increases in processing volumes and speeds compared to the prior system. Challenges and tips for adoption are also provided.
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...Khai Tran
This document discusses LinkedIn's transition from an offline metrics platform to a near real-time "nearline" architecture using Apache Calcite and Apache Samza. It overviews LinkedIn's metrics platform and needs, and then details how the new nearline architecture works by translating Pig jobs into optimized Samza jobs using Calcite's relational algebra and query planning. An example production use case for analyzing storylines on the LinkedIn platform is also presented. The nearline architecture allows metrics to be computed with latencies of 5-30 minutes rather than 3-6 hours previously.
Spark Streaming and IoT by Mike FreedmanSpark Summit
This document discusses using Spark Streaming for IoT applications and the challenges involved. It notes that while Spark simplifies programming across different processing intervals from batch to stream, programming models alone are not sufficient as IoT data streams can have varying rates and delays. It proposes a unified data infrastructure with abstractions like data series that support joining real-time and historical data while handling delays transparently. It also suggests approaches for Spark Streaming to better support processing many independent low-volume IoT streams concurrently and improving resource utilization for such applications. Finally, it introduces the Device-Model-Infra framework for addressing these IoT analytics challenges through combined programming models and data abstractions.
Proud to be Polyglot - Riviera Dev 2015Tugdual Grall
The document discusses the benefits of using multiple programming languages and data stores, or a "polyglot" approach, for modern applications. A polyglot approach allows using the right tool for each task, rather than trying to force a single technology to fit all needs. This improves performance, scalability, and the ability to adapt applications to changing requirements compared to traditional monolithic architectures. The document provides examples of when to use different languages and data stores and concludes that a polyglot approach makes applications easier to maintain over time.
Keeping Identity Graphs In Sync With Apache SparkDatabricks
The online advertising industry is based on identifying users with cookies, and showing relevant ads to interested users. But there are many data providers, many places to target ads and many people browsing online. How can we identify users across data providers? The first step in solving this is by cookie mapping: a chain of server calls that pass identifiers across providers. Sadly, chains break, servers break, providers can be flaky or use caching and you may never see the whole of the chain. The solution to this problem is constructing an identity graph with the data we see: in our case, cookie ids are nodes, edges are relations and connected components of the graph are users.
In this talk I will explain how Hybrid Theory leverages Spark and GraphFrames to construct and maintain a 2000 million node identity graph with minimal computational cost.
Modern ETL Pipelines with Change Data CaptureDatabricks
In this talk we’ll present how at GetYourGuide we’ve built from scratch a completely new ETL pipeline using Debezium, Kafka, Spark and Airflow, which can automatically handle schema changes. Our starting point was an error prone legacy system that ran daily, and was vulnerable to breaking schema changes, which caused many sleepless on-call nights. As most companies, we also have traditional SQL databases that we need to connect to in order to extract relevant data.
This is done usually through either full or partial copies of the data with tools such as sqoop. However another approach that has become quite popular lately is to use Debezium as the Change Data Capture layer which reads databases binlogs, and stream these changes directly to Kafka. As having data once a day is not enough anymore for our bussiness, and we wanted our pipelines to be resilent to upstream schema changes, we’ve decided to rebuild our ETL using Debezium.
We’ll walk the audience through the steps we followed to architect and develop such solution using Databricks to reduce operation time. By building this new pipeline we are now able to refresh our data lake multiple times a day, giving our users fresh data, and protecting our nights of sleep.
Spark Summit EU talk by Stephan KesslerSpark Summit
This document summarizes a talk given by Stephan Kessler at the Spark Summit Europe 2016 about integrating business functionality and specialized engines into Apache Spark using SAP HANA Vora. Key points discussed include using currency conversion and time series query capabilities directly in Spark by pushing computations to the relevant data sources via Spark extensions. SAP HANA Vora allows moving parts of the Spark logical query plan to various data sources like HANA, graph and document stores to perform analysis close to the data.
Presentation faite lors du Hadoop User Group France du 14 janvier 2016.
L’analytique temps réel avec Riak et Spark par Michael Carney (Basho) et Olivier Girardot de Lateral Thoughts
Selon un rapport de Salesforce, le nombre de sources de données analysées par les entreprises progressera de 83% au cours des cinq prochaines années, ainsi les organisations veulent désormais fournir des connaissances en temps réel même sur les appareils mobiles. Le traitement temps réel est donc, le futur de l’analyse big data.
Ce talk présentera des nouveautés en matière de l’analyse temps réel autour de la famille SGBD Riak et Spark.
Michael Carney est le Directeur Commercial de Basho pour le Sud d’Europe. Fondateur de MySQL France et de MariaDB, Michael a rejoint Basho en janvier 2015 pour explorer le monde de données sans tables !
Olivier Girardot est le CTO de Lateral Thoughts, il est développeur et formateur au sujet de Spark et également spécialiste de Java/Python dans le domaine de la finance de marché.
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)Spark Summit
This document describes BBVA's implementation of a Big Data Lake using Apache Spark for log collection, storage, and analytics. It discusses:
1) Using Syslog-ng for log collection from over 2,000 applications and devices, distributing logs to Kafka.
2) Storing normalized logs in HDFS and performing analytics using Spark, with outputs to analytics, compliance, and indexing systems.
3) Choosing Spark because it allows interactive, batch, and stream processing with one system using RDDs, SQL, streaming, and machine learning.
Writing Continuous Applications with Structured Streaming PySpark APIDatabricks
"We're amidst the Big Data Zeitgeist era in which data comes at us fast, in myriad forms and formats at intermittent intervals or in a continuous stream, and we need to respond to streaming data immediately. This need has created a notion of writing a streaming application that’s continuous, reacts and interacts with data in real-time. We call this continuous application.
In this tutorial we'll explore the concepts and motivations behind the continuous application, how Structured Streaming Python APIs in Apache Spark™ enable writing continuous applications, examine the programming model behind Structured Streaming, and look at the APIs that support them.
Through presentation, code examples, and notebooks, I will demonstrate how to write an end-to-end Structured Streaming application that reacts and interacts with both real-time and historical data to perform advanced analytics using Spark SQL, DataFrames and Datasets APIs.
You’ll walk away with an understanding of what’s a continuous application, appreciate the easy-to-use Structured Streaming APIs, and why Structured Streaming in Apache Spark is a step forward in developing new kinds of streaming applications.
This tutorial will be both instructor-led and hands-on interactive session. Instructions in how to get tutorial materials will be covered in class.
WHAT YOU’LL LEARN:
– Understand the concepts and motivations behind Structured Streaming
– How to use DataFrame APIs
– How to use Spark SQL and create tables on streaming data
– How to write a simple end-to-end continuous application
PREREQUISITES
– A fully-charged laptop (8-16GB memory) with Chrome or Firefox
–Pre-register for Databricks Community Edition"
Speaker: Jules Damji
Introduction to apache kafka, confluent and why they matterPaolo Castagna
This is a short and introductory presentation on Apache Kafka (including Kafka Connect APIs, Kafka Streams APIs, both part of Apache Kafka) and other open source components part of the Confluent platform (such as KSQL).
This was the first Kafka Meetup in South Africa.
This presentation will describe how to go beyond a "Hello world" stream application and build a real-time data-driven product. We will present architectural patterns, go through tradeoffs and considerations when deciding on technology and implementation strategy, and describe how to put the pieces together. We will also cover necessary practical pieces for building real products: testing streaming applications, and how to evolve products over time.
Presented at highloadstrategy.com 2016 by Øyvind Løkling (Schibsted Products & Technology), joint work with Lars Albertsson (independent, www.mapflat.com).
Building a Data Pipeline from Scratch - Joe CrobakHakka Labs
A data pipeline is a unified system for capturing events for analysis and building products. It involves capturing user events from various sources, storing them in a centralized data warehouse, and performing analysis and building products using tools like Hadoop. Key components of a data pipeline include an event framework, message bus, data serialization, data persistence, workflow management, and batch processing. A Lambda architecture allows for both batch and real-time processing of data captured by the pipeline.
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaDatabricks
Hopsworks is an open-source data platform that can be used to both develop and operate horizontally scalable machine learning pipelines. A key part of our pipelines is the world’s first open-source Feature Store, based on Apache Hive, that acts as a data warehouse for features, providing a natural API between data engineers – who write feature engineering code in Spark (in Scala or Python) – and Data Scientists, who select features from the feature store to generate training/test data for models. In this talk, we will discuss how Databricks Delta solves several of the key challenges in building both feature engineering pipelines that feed our Feature Store and in managing the feature data itself.
Firstly, we will show how expectations and schema enforcement in Databricks Delta can be used to provide data validation, ensuring that feature data does not have missing or invalid values that could negatively affect model training. Secondly, time-travel in Databricks Delta can be used to provide version management and experiment reproducability for training/test datasets. That is, given a model, you can re-run the training experiment for that model using the same version of the data that was used to train the model.
We will also discuss the next steps needed to take this work to the next level. Finally, we will perform a live demo, showing how Delta can be used in end-to-end ML pipelines using Spark on Hopsworks.
The modern data customer wants data now. Batch workloads are not going anywhere, but at Scribd the future of our data platform requires more and more streaming data sets.
This document discusses various techniques for building recommendation systems using Apache Spark. It begins with an overview of scaling techniques using parallelism and composability. Various similarity measures are then covered, including Euclidean, cosine, Jaccard, and word embeddings. Recommendation approaches like item-to-item graphs and personalized PageRank are demonstrated. The document also discusses feature engineering, modeling techniques, and evaluating recommendations. Live demos are provided of word similarity, movie recommendations, sentiment analysis and more.
Efficiently Building Machine Learning Models for Predictive Maintenance in th...Databricks
For each drilling site, there are thousands of different equipment operating simultaneously 24/7. For the oil & gas industry, the downtime can cost millions of dollars daily. As current standard practice, the majority of the equipment are on scheduled maintenance with standby units to reduce the downtime.
This document discusses building a feature store using Apache Spark and dataframes. It provides examples of major feature store concepts like feature groups, training/test datasets, and joins. Feature store implementations from companies like Uber, Airbnb and Netflix are also mentioned. The document outlines the architecture of storing both online and offline feature groups and describes the evolution of the feature store API to better support concepts like feature versioning, multiple stores, complex joins and time travel. Use cases demonstrated include fraud detection in banking and modeling crop yields using joined weather and agricultural data.
Interactive Visualization of Streaming Data Powered by Spark by Ruhollah Farc...Spark Summit
This document discusses how to visualize streaming data using Spark. It describes how Spark Streaming can be used to process streaming data in real-time and integrate it with visualization tools. Key points include:
- Spark Streaming receives streaming data from sources like Kafka and processes it using in-memory computations in a single JVM cluster.
- The processed data can be stored in buffers like MongoDB or output to systems like MemSQL, Solr to enable interactive visualizations that update in real-time.
- A demo is shown of Twitter data being streamed and analyzed using Spark Streaming with results stored in MemSQL and Solr for visualization.
- Benefits of this approach include being able to work with streaming data
Dynamic Partition Pruning in Apache SparkDatabricks
In data analytics frameworks such as Spark it is important to detect and avoid scanning data that is irrelevant to the executed query, an optimization which is known as partition pruning. Dynamic partition pruning occurs when the optimizer is unable to identify at parse time the partitions it has to eliminate. In particular, we consider a star schema which consists of one or multiple fact tables referencing any number of dimension tables. In such join operations, we can prune the partitions the join reads from a fact table by identifying those partitions that result from filtering the dimension tables. In this talk we present a mechanism for performing dynamic partition pruning at runtime by reusing the dimension table broadcast results in hash joins and we show significant improvements for most TPCDS queries.
Healthcare Claim Reimbursement using Apache SparkDatabricks
The document discusses rewriting a claims reimbursement system using Spark. It describes how Spark provides better performance, scalability and cost savings compared to the previous Oracle-based system. Key points include using Spark for ETL to load data into a Delta Lake data lake, implementing the business logic in a reusable Java library, and seeing significant increases in processing volumes and speeds compared to the prior system. Challenges and tips for adoption are also provided.
Conquering the Lambda architecture in LinkedIn metrics platform with Apache C...Khai Tran
This document discusses LinkedIn's transition from an offline metrics platform to a near real-time "nearline" architecture using Apache Calcite and Apache Samza. It overviews LinkedIn's metrics platform and needs, and then details how the new nearline architecture works by translating Pig jobs into optimized Samza jobs using Calcite's relational algebra and query planning. An example production use case for analyzing storylines on the LinkedIn platform is also presented. The nearline architecture allows metrics to be computed with latencies of 5-30 minutes rather than 3-6 hours previously.
Spark Streaming and IoT by Mike FreedmanSpark Summit
This document discusses using Spark Streaming for IoT applications and the challenges involved. It notes that while Spark simplifies programming across different processing intervals from batch to stream, programming models alone are not sufficient as IoT data streams can have varying rates and delays. It proposes a unified data infrastructure with abstractions like data series that support joining real-time and historical data while handling delays transparently. It also suggests approaches for Spark Streaming to better support processing many independent low-volume IoT streams concurrently and improving resource utilization for such applications. Finally, it introduces the Device-Model-Infra framework for addressing these IoT analytics challenges through combined programming models and data abstractions.
Proud to be Polyglot - Riviera Dev 2015Tugdual Grall
The document discusses the benefits of using multiple programming languages and data stores, or a "polyglot" approach, for modern applications. A polyglot approach allows using the right tool for each task, rather than trying to force a single technology to fit all needs. This improves performance, scalability, and the ability to adapt applications to changing requirements compared to traditional monolithic architectures. The document provides examples of when to use different languages and data stores and concludes that a polyglot approach makes applications easier to maintain over time.
Keeping Identity Graphs In Sync With Apache SparkDatabricks
The online advertising industry is based on identifying users with cookies, and showing relevant ads to interested users. But there are many data providers, many places to target ads and many people browsing online. How can we identify users across data providers? The first step in solving this is by cookie mapping: a chain of server calls that pass identifiers across providers. Sadly, chains break, servers break, providers can be flaky or use caching and you may never see the whole of the chain. The solution to this problem is constructing an identity graph with the data we see: in our case, cookie ids are nodes, edges are relations and connected components of the graph are users.
In this talk I will explain how Hybrid Theory leverages Spark and GraphFrames to construct and maintain a 2000 million node identity graph with minimal computational cost.
Modern ETL Pipelines with Change Data CaptureDatabricks
In this talk we’ll present how at GetYourGuide we’ve built from scratch a completely new ETL pipeline using Debezium, Kafka, Spark and Airflow, which can automatically handle schema changes. Our starting point was an error prone legacy system that ran daily, and was vulnerable to breaking schema changes, which caused many sleepless on-call nights. As most companies, we also have traditional SQL databases that we need to connect to in order to extract relevant data.
This is done usually through either full or partial copies of the data with tools such as sqoop. However another approach that has become quite popular lately is to use Debezium as the Change Data Capture layer which reads databases binlogs, and stream these changes directly to Kafka. As having data once a day is not enough anymore for our bussiness, and we wanted our pipelines to be resilent to upstream schema changes, we’ve decided to rebuild our ETL using Debezium.
We’ll walk the audience through the steps we followed to architect and develop such solution using Databricks to reduce operation time. By building this new pipeline we are now able to refresh our data lake multiple times a day, giving our users fresh data, and protecting our nights of sleep.
Spark Summit EU talk by Stephan KesslerSpark Summit
This document summarizes a talk given by Stephan Kessler at the Spark Summit Europe 2016 about integrating business functionality and specialized engines into Apache Spark using SAP HANA Vora. Key points discussed include using currency conversion and time series query capabilities directly in Spark by pushing computations to the relevant data sources via Spark extensions. SAP HANA Vora allows moving parts of the Spark logical query plan to various data sources like HANA, graph and document stores to perform analysis close to the data.
Presentation faite lors du Hadoop User Group France du 14 janvier 2016.
L’analytique temps réel avec Riak et Spark par Michael Carney (Basho) et Olivier Girardot de Lateral Thoughts
Selon un rapport de Salesforce, le nombre de sources de données analysées par les entreprises progressera de 83% au cours des cinq prochaines années, ainsi les organisations veulent désormais fournir des connaissances en temps réel même sur les appareils mobiles. Le traitement temps réel est donc, le futur de l’analyse big data.
Ce talk présentera des nouveautés en matière de l’analyse temps réel autour de la famille SGBD Riak et Spark.
Michael Carney est le Directeur Commercial de Basho pour le Sud d’Europe. Fondateur de MySQL France et de MariaDB, Michael a rejoint Basho en janvier 2015 pour explorer le monde de données sans tables !
Olivier Girardot est le CTO de Lateral Thoughts, il est développeur et formateur au sujet de Spark et également spécialiste de Java/Python dans le domaine de la finance de marché.
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)Spark Summit
This document describes BBVA's implementation of a Big Data Lake using Apache Spark for log collection, storage, and analytics. It discusses:
1) Using Syslog-ng for log collection from over 2,000 applications and devices, distributing logs to Kafka.
2) Storing normalized logs in HDFS and performing analytics using Spark, with outputs to analytics, compliance, and indexing systems.
3) Choosing Spark because it allows interactive, batch, and stream processing with one system using RDDs, SQL, streaming, and machine learning.
Writing Continuous Applications with Structured Streaming PySpark APIDatabricks
"We're amidst the Big Data Zeitgeist era in which data comes at us fast, in myriad forms and formats at intermittent intervals or in a continuous stream, and we need to respond to streaming data immediately. This need has created a notion of writing a streaming application that’s continuous, reacts and interacts with data in real-time. We call this continuous application.
In this tutorial we'll explore the concepts and motivations behind the continuous application, how Structured Streaming Python APIs in Apache Spark™ enable writing continuous applications, examine the programming model behind Structured Streaming, and look at the APIs that support them.
Through presentation, code examples, and notebooks, I will demonstrate how to write an end-to-end Structured Streaming application that reacts and interacts with both real-time and historical data to perform advanced analytics using Spark SQL, DataFrames and Datasets APIs.
You’ll walk away with an understanding of what’s a continuous application, appreciate the easy-to-use Structured Streaming APIs, and why Structured Streaming in Apache Spark is a step forward in developing new kinds of streaming applications.
This tutorial will be both instructor-led and hands-on interactive session. Instructions in how to get tutorial materials will be covered in class.
WHAT YOU’LL LEARN:
– Understand the concepts and motivations behind Structured Streaming
– How to use DataFrame APIs
– How to use Spark SQL and create tables on streaming data
– How to write a simple end-to-end continuous application
PREREQUISITES
– A fully-charged laptop (8-16GB memory) with Chrome or Firefox
–Pre-register for Databricks Community Edition"
Speaker: Jules Damji
Introduction to apache kafka, confluent and why they matterPaolo Castagna
This is a short and introductory presentation on Apache Kafka (including Kafka Connect APIs, Kafka Streams APIs, both part of Apache Kafka) and other open source components part of the Confluent platform (such as KSQL).
This was the first Kafka Meetup in South Africa.
Unify Analytics: Combine Strengths of Data Lake and Data WarehousePaige_Roberts
ODSC West Presentation Oct 2020: Technical and spiritual unification of BI and Data Science teams will benefit businesses powerfully. Data architectures evolution is making that possible.
Data Con LA 2020
Description
The data warehouse has been an analytics workhorse for decades. Unprecedented volumes of data, new types of data, and the need for advanced analyses like machine learning brought on the age of the data lake. But Hadoop by itself doesn't really live up to the hype. Now, many companies have a data lake, a data warehouse, or a mishmash of both, possibly combined with a mandate to go to the cloud. The end result can be a sprawling mess, a lot of duplicated effort, a lot of missed opportunities, a lot of projects that never made it into production, and a lot of financial investment without return. Technical and spiritual unification of the two opposed camps can make a powerful impact on the effectiveness of analytics for the business overall. Over time, different organizations with massive IoT workloads have found practical ways to bridge the artificial gap between these two data management strategies. Look under the hood at how companies have gotten IoT ML projects working, and how their data architectures have changed over time. Learn about new architectures that successfully supply the needs of both business analysts and data scientists. Get a peek at the future. In this area, no one likes surprises.
*Look at successful data architectures from companies like Philips, Anritsu, Uber,
*Learn to eliminate duplication of effort between data science and BI data engineering teams
*Avoid some of the traps that have caused so many big data analytics implementations to fail
*Get AI and ML projects into production where they have real impact, without bogging down essential BI
*Study analytics architectures that work, why and how they work, and where they're going from here
Speaker
Paige Roberts,Vertica, Open Source Relations Manager
This document discusses Infobip's journey towards enabling real-time querying of aggregated data. Initially, Infobip had a monolithic architecture with a single database that became a bottleneck. They introduced multiple databases and microservices but querying spanned databases and results had to be joined. A data warehouse (GREEN) provided reporting but was not real-time. To enable real-time queries, Infobip implemented a lambda architecture using Kafka as the real-time data pipeline and Druid for real-time querying and aggregations, achieving sub-second responses and less than 2 seconds of data delay. This allows real-time insights from ingested messaging data while GREEN remains the batch/serving layer.
This document describes Hopsworks, an end-to-end data platform for analytics and machine learning built by KTH and RISE SICS. It provides data ingestion, preparation, experimentation, model training, and deployment capabilities. The platform is built on Apache technologies like Apache Beam, Spark, Flink, Kafka, and uses Kubernetes for orchestration. It also includes a feature store for ML features. The document then discusses Apache Flink and its use for stream processing applications. It provides examples of using Flink's APIs like SQL, CEP, and machine learning. Finally, it introduces the concept of continuous deep analytics and the Arcon framework for unified analytics across streams, tensors, graphs and more through an intermediate
A Hudi Live Event: Shaping a Database Experience within the Data Lake with Ap...nadine39280
Discover the evolution of Apache Hudi within the open-source realm - a community and project pushing the boundaries of data lake possibilities. This presentation delves into Apache Hudi 1.0, a pivotal release reimagining its transactional database layer while honoring its foundational principles. Join us in this transformative journey!
Join the Apache Hudi Community
https://join.slack.com/t/apache-hudi/shared_invite/zt-20r833rxh-627NWYDUyR8jRtMa2mZ~gg.
Follow us on LinkedIn and Twitter
https://www.linkedin.com/company/apache-hudi/
https://twitter.com/apachehudi
The document outlines the roadmap for SQL Server, including enhancements to performance, security, availability, development tools, and big data capabilities. Key updates include improved intelligent query processing, confidential computing with secure enclaves, high availability options on Kubernetes, machine learning services, and tools in Azure Data Studio. The roadmap aims to make SQL Server the most secure, high performing, and intelligent data platform across on-premises, private cloud and public cloud environments.
Along with the arrival of BigData, a parallel yet less well known but significant change to the way we process data has occurred. Data is getting faster! Business models are changing radically based on the ability to be first to know insights and act appropriately to keep the customer, prevent the breakdown or save the patient. In essence, knowing something now is overriding knowing everything later. Stream processing engines allow us to blend event streams from different internal and external sources to gain insights in real time. This talk will discuss the need for streaming, business models it can change, new applications it allows and why Apache Flink enables these applications. Apache Flink is a top Level Apache Project for real time stream processing at scale. It is a high throughput, low latency, fault tolerant, distributed, state based stream processing engine. Flink has associated Polyglot APIs (Scala, Python, Java) for manipulating streams, a Complex Event Processor for monitoring and alerting on the streams and integration points with other big data ecosystem tooling.
SnappyData is a new open source project started by Pivotal GemFire founders to build a unified cluster capable of OLTP, OLAP, and streaming analytics using Spark. SnappyData fuses an elastic, highly available in-memory store for OLTP with Spark's memory manager and query engine to provide a single system for mixed workloads with fast ingestion, high concurrency and the ability to work with live, mutable data.
- Project Lightspeed is the next generation of Apache Spark Structured Streaming that aims to provide faster and simpler stream processing with predictable low latency.
- It targets reducing tail latency by up to 2x through faster bookkeeping and offset management. It also enhances functionality with advanced capabilities like new operators and easy to use APIs.
- Project Lightspeed also aims to simplify deployment, operations, monitoring and troubleshooting of streaming applications. It seeks to improve ecosystem support for connectors, authentication and authorization.
- Some specific improvements include faster micro-batch processing, enhancing Python as a first class citizen, and making debugging of streaming jobs easier through visualizations.
An XStreams adapter allows Oracle CEP to ingest and process events from an XStreams data stream. The adapter plugs into an event processing network to enable CEP queries over the streaming data. It can be used for applications like algorithmic trading, telecommunications monitoring, and RFID tracking where low latency event processing is important. A demo uses the adapter to run the Linear Road benchmark over sensor data from a simulated variable toll expressway system.
An Architect's guide to real time big data systemsRaja SP
Introduction to real time big data, stream computing using Infosphere Streams and Apache Storm. Presented in a Big Data Conference in Singapore, Jul 2014.
MongoDB World 2019: Streaming ETL on the Shoulders of GiantsMongoDB
This document discusses streaming ETL using Apache Kafka and MongoDB as a modern data platform. It provides an overview of streaming data and how it can help with speed and agility compared to traditional batch ETL processes. It then discusses how Apache Kafka acts as a streaming platform and messaging system that can be used to build streaming data applications and integrate data from various sources using Kafka Connect. The document announces the availability of the MongoDB connector for Kafka Connect, which allows streaming data between Kafka and MongoDB collections. It concludes with a demo scenario showing how this could work in practice.
The advantages of Arista/OVH configurations, and the technologies behind buil...OVHcloud
Arista will put an emphasis on the technologies behind building and operating datacentres, and the reasons they give the results expected from them (varied traffic spike management, increasing bandwidth, end points and security), including very large-scale production environments.
Lightbend Fast Data Platform - A Technical Overview
Dean Wampler, O’Reilly author and Big Data Strategist in the office of the CTO at Lightbend discusses practical tips for architecting stream-processing applications and explains how you can tame some of the complexity in moving from data at rest to data in motion.
Keynote of HadoopCon 2014 Taiwan:
* Data analytics platform architecture & designs
* Lambda architecture overview
* Using SQL as DSL for stream processing
* Lambda architecture using SQL
Mieke Jans is a Manager at Deloitte Analytics Belgium. She learned about process mining from her PhD supervisor while she was collaborating with a large SAP-using company for her dissertation.
Mieke extended her research topic to investigate the data availability of process mining data in SAP and the new analysis possibilities that emerge from it. It took her 8-9 months to find the right data and prepare it for her process mining analysis. She needed insights from both process owners and IT experts. For example, one person knew exactly how the procurement process took place at the front end of SAP, and another person helped her with the structure of the SAP-tables. She then combined the knowledge of these different persons.
How iCode cybertech Helped Me Recover My Lost Fundsireneschmid345
I was devastated when I realized that I had fallen victim to an online fraud, losing a significant amount of money in the process. After countless hours of searching for a solution, I came across iCode cybertech. From the moment I reached out to their team, I felt a sense of hope that I can recommend iCode Cybertech enough for anyone who has faced similar challenges. Their commitment to helping clients and their exceptional service truly set them apart. Thank you, iCode cybertech, for turning my situation around!
[email protected]
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsContify
AI competitor analysis helps businesses watch and understand what their competitors are doing. Using smart competitor intelligence tools, you can track their moves, learn from their strategies, and find ways to do better. Stay smart, act fast, and grow your business with the power of AI insights.
For more information please visit here https://www.contify.com/
computer organization and assembly language : its about types of programming language along with variable and array description..https://www.nfciet.edu.pk/
This comprehensive Data Science course is designed to equip learners with the essential skills and knowledge required to analyze, interpret, and visualize complex data. Covering both theoretical concepts and practical applications, the course introduces tools and techniques used in the data science field, such as Python programming, data wrangling, statistical analysis, machine learning, and data visualization.
Flip flop presenation-Presented By Mubahir khan.pptxmubashirkhan45461
Ad
Spark Seattle meetup - Breaking ETL barrier with Spark Streaming
1. Breaking ETL barrier with
Real-time reporting
using Kafka, Spark Streaming
Santosh Sahoo
Architect at Concur
2. About us
Concur (now part of SAP) provides travel and
expense management services to
businesses.
Data Insights team is building solutions to
provide customer access to data,
visualization and reporting.
8. Spark Streaming
What? A data processing
framework to build scalable
fault-tolerant streaming
applications.
Why? It lets you reuse the
same code for batch
processing, join streams
against historical data, or run
ad-hoc queries on stream
state.
10. Kafka - Flow Management
No nonsense logging
100K/s throughput vs 20k of RabbitMQ
Log compaction
Durable persistence
Partition tolerance
Replication
Best in class integration with Spark
12. Optimized Direct Kafka API
https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html