Here is my talk at Scala by the Bay 2016, Building a High-Performance Database with Scala, Akka, and Spark. Covers integration of Akka and Spark, when to use actors and futures, back pressure, reactive monitoring with Kamon, and more.
This document discusses Spark Job Server, an open source project that allows Spark jobs to be submitted and run via a REST API. It provides features like job monitoring, context sharing between jobs to reuse cached data, and asynchronous APIs. The document outlines motivations for the project, how to use it including submitting and monitoring jobs, and future plans like high availability and hot failover support.
Spark Streaming can be used to process streaming data from Kafka in real-time. There are two main approaches - the receiver-based approach where Spark receives data from Kafka receivers, and the direct approach where Spark directly reads data from Kafka. The document discusses using Spark Streaming to process tens of millions of transactions per minute from Kafka for an ad exchange system. It describes architectures where Spark Streaming is used to perform real-time aggregations and update databases, as well as save raw data to object storage for analytics and recovery. Stateful processing with mapWithState transformations is also demonstrated to update Cassandra in real-time.
An over-ambitious introduction to Spark programming, test and deployment. This slide tries to cover most core technologies and design patterns used in SpookyStuff, the fastest query engine for data collection/mashup from the deep web.
For more information please follow: https://github.com/tribbloid/spookystuff
A bug in PowerPoint used to cause transparent background color not being rendered properly. This has been fixed in a recent upload.
Solving low latency query over big data with Spark SQLJulien Pierre
This document provides an overview of client data, capabilities, and architecture for a data analytics platform. It discusses data size and query latency, processing and storage using Cosmos, SparkSQL and HDFS, a Mesos cluster architecture with Zookeeper, and interactive analytics using Zeppelin and Avocado notebooks. The platform aims to provide a unified environment for data ingestion, transformation, storage, processing and analytics to enable intelligent data products and experiences.
700 Updatable Queries Per Second: Spark as a Real-Time Web ServiceEvan Chan
700 Updatable Queries Per Second: Spark as a Real-Time Web Service. Find out how to use Apache Spark with FiloDb for low-latency queries - something you never thought possible with Spark. Scale it down, not just scale it up!
Apache Spark Streaming: Architecture and Fault ToleranceSachin Aggarwal
Agenda:
• Spark Streaming Architecture
• How different is Spark Streaming from other streaming applications
• Fault Tolerance
• Code Walk through & demo
• We will supplement theory concepts with sufficient examples
Speakers :
Paranth Thiruvengadam (Architect (STSM), Analytics Platform at IBM Labs)
Profile : https://in.linkedin.com/in/paranth-thiruvengadam-2567719
Sachin Aggarwal (Developer, Analytics Platform at IBM Labs)
Profile : https://in.linkedin.com/in/nitksachinaggarwal
Github Link: https://github.com/agsachin/spark-meetup
This document outlines a project to capture user location data and send it to a database for real-time analysis using Kafka and Spark streaming. It describes starting Zookeeper and Kafka servers, creating Kafka topics, producing and consuming messages with Java producers and consumers, using the Spark CLI, integrating Kafka and Spark for streaming, creating DataFrames and SQL queries, and saving data to PostgreSQL tables for further processing and analysis. The goal is to demonstrate real-time data streaming and analytics on user location data.
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.
The document discusses 5 common mistakes people make when writing Spark applications:
1) Not properly sizing executors for memory and cores.
2) Having shuffle blocks larger than 2GB which can cause jobs to fail.
3) Not addressing data skew which can cause joins and shuffles to be very slow.
4) Not properly managing the DAG to minimize shuffles and stages.
5) Classpath conflicts from mismatched dependencies causing errors.
Your data is getting bigger while your boss is getting anxious to have insights! This tutorial covers Apache Spark that makes data analytics fast to write and fast to run. Tackle big datasets quickly through a simple API in Python, and learn one programming paradigm in order to deploy interactive, batch, and streaming applications while connecting to data sources incl. HDFS, Hive, JSON, and S3.
Spark supports four cluster managers: Local, Standalone, YARN, and Mesos. YARN is highly recommended for production use. When running Spark on YARN, careful tuning of configuration settings like the number of executors, executor memory and cores, and dynamic allocation is important to optimize performance and resource utilization. Configuring queues also allows separating different applications by priority and resource needs.
The document discusses Akka 2.4 and commercial features available through the Reactive Platform. Key points include: Akka 2.4 requires Java 8 but provides backwards compatibility; Cluster Tools, Persistence, and Distributed PubSub are now stable features; Persistence allows cross-Scala version snapshot compatibility; a Split Brain Resolver is available in beta for cluster failure scenarios; and extended Java 6 support is provided through the Reactive Platform.
User Defined Functions is an important feature of Spark SQL which helps extend the language by adding custom constructs. UDFs are very useful for extending spark vocabulary but come with significant performance overhead. These are black boxes for Spark optimizer, blocking several helpful optimizations like WholeStageCodegen, Null optimization etc. They also come with a heavy processing cost associated with String functions requiring UTF-8 to UTF-16 conversions which slows down spark jobs and increases memory requirements. In this talk, we will go over how at Informatica we optimized UDFs to be as performant as Spark native functions both in terms of time and memory and allow these functions to participate in spark optimization steps.
Robust and Scalable ETL over Cloud Storage with Apache SparkDatabricks
The majority of reported Spark deployments are now in the cloud. In such an environment, it is preferable for Spark to access data directly from services such as Amazon S3, thereby decoupling storage and compute. However, there are limitations to object stores such as S3. Chained or concurrent ETL jobs often run into issues on S3 due to inconsistent file listings and the lack of atomic rename support. Metadata performance also becomes an issue when running jobs over many thousands to millions of files.
Speaker: Eric Liang
This talk was originally presented at Spark Summit East 2017.
Top 5 mistakes when writing Spark applicationshadooparchbook
This document discusses common mistakes people make when writing Spark applications and provides recommendations to address them. It covers issues related to executor configuration, application failures due to shuffle block sizes exceeding limits, slow jobs caused by data skew, and managing the DAG to avoid excessive shuffles and stages. Recommendations include using smaller executors, increasing the number of partitions, addressing skew through techniques like salting, and preferring ReduceByKey over GroupByKey and TreeReduce over Reduce to improve performance and resource usage.
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkEvan Chan
You want to ingest event, time-series, streaming data easily, yet have flexible, fast ad-hoc queries. Is this even possible? Yes! Find out how in this talk of combining Apache Cassandra and Apache Spark, using a new open-source database, FiloDB.
Beneath RDD in Apache Spark by Jacek LaskowskiSpark Summit
This document provides an overview of SparkContext and Resilient Distributed Datasets (RDDs) in Apache Spark. It discusses how to create RDDs using SparkContext functions like parallelize(), range(), and textFile(). It also covers DataFrames and converting between RDDs and DataFrames. The document discusses partitions and the level of parallelism in Spark, as well as the execution environment involving DAGScheduler, TaskScheduler, and SchedulerBackend. It provides examples of RDD lineage and describes Spark clusters like Spark Standalone and the Spark web UI.
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2L6bZbn
This CloudxLab Introduction to Spark Streaming & Apache Kafka tutorial helps you to understand Spark Streaming and Kafka in detail. Below are the topics covered in this tutorial:
1) Spark Streaming - Workflow
2) Use Cases - E-commerce, Real-time Sentiment Analysis & Real-time Fraud Detection
3) Spark Streaming - DStream
4) Word Count Hands-on using Spark Streaming
5) Spark Streaming - Running Locally Vs Running on Cluster
6) Introduction to Apache Kafka
7) Apache Kafka Hands-on on CloudxLab
8) Integrating Spark Streaming & Kafka
9) Spark Streaming & Kafka Hands-on
Spark Streaming & Kafka-The Future of Stream ProcessingJack Gudenkauf
Hari Shreedharan/Cloudera @Playtika. With its easy to use interfaces and native integration with some of the most popular ingest tools, such as Kafka, Flume, Kinesis etc, Spark Streaming has become go-to tool for stream processing. Code sharing with Spark also makes it attractive. In this talk, we will discuss the latest features in Spark Streaming and how it integrates with Kafka natively with no data loss, and even do exactly once processing!
Building production spark streaming applicationsJoey Echeverria
Designing, implementing, and testing an Apache Spark Streaming application is necessary to deploy to production but is not sufficient for long term management and monitoring. Simply learning the Spark Streaming APIs only gets you part of the way there. In this talk, I’ll be focusing on everything that happens after you’ve implemented your application in the context of a real-time alerting system for IT operational data.
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkEvan Chan
This document discusses using Spark and Cassandra together for interactive analytics. It describes how Evan Chan uses both technologies at Ooyala to solve the problem of generating analytics from raw data in Cassandra in a flexible and fast way. It outlines their architecture of using Spark to generate materialized views from Cassandra data and then powering queries with those cached views for low latency queries.
Tracing the Breadcrumbs: Apache Spark Workload DiagnosticsDatabricks
Have you ever hit mysterious random process hangs, performance regressions, or OOM errors that leave barely any useful traces, yet hard or expensive to reproduce? No matter how tricky the bugs are, they always leave some breadcrumbs along the way.
This document provides an overview and deep dive into Robinhood's RDS Data Lake architecture for ingesting data from their RDS databases into an S3 data lake. It discusses their prior daily snapshotting approach, and how they implemented a faster change data capture pipeline using Debezium to capture database changes and ingest them incrementally into a Hudi data lake. It also covers lessons learned around change data capture setup and configuration, initial table bootstrapping, data serialization formats, and scaling the ingestion process. Future work areas discussed include orchestrating thousands of pipelines and improving downstream query performance.
This presentation will be useful to those who would like to get acquainted with Apache Spark architecture, top features and see some of them in action, e.g. RDD transformations and actions, Spark SQL, etc. Also it covers real life use cases related to one of ours commercial projects and recall roadmap how we’ve integrated Apache Spark into it.
Was presented on Morning@Lohika tech talks in Lviv.
Design by Yarko Filevych: http://www.filevych.com/
Our product uses third generation Big Data technologies and Spark Structured Streaming to enable comprehensive Digital Transformation. It provides a unified streaming API that allows for continuous processing, interactive queries, joins with static data, continuous aggregations, stateful operations, and low latency. The presentation introduces Spark Structured Streaming's basic concepts including loading from stream sources like Kafka, writing to sinks, triggers, SQL integration, and mixing streaming with batch processing. It also covers continuous aggregations with windows, stateful operations with checkpointing, reading from and writing to Kafka, and benchmarks compared to other streaming frameworks.
Spark Streaming with Kafka allows processing streaming data from Kafka in real-time. There are two main approaches - receiver-based and direct. The receiver-based approach uses Spark receivers to read data from Kafka and write to write-ahead logs for fault tolerance. The direct approach reads Kafka offsets directly without a receiver for better performance but less fault tolerance. The document discusses using Spark Streaming to aggregate streaming data from Kafka in real-time, persisting aggregates to Cassandra and raw data to S3 for analysis. It also covers using stateful transformations to update Cassandra in real-time.
Apache Kafka® is the technology behind event streaming which is fast becoming the central nervous system of flexible, scalable, modern data architectures. Customers want to connect their databases, data warehouses, applications, microservices and more, to power the event streaming platform. To connect to Apache Kafka, you need a connector!
This online talk dives into the new Verified Integrations Program and the integration requirements, the Connect API and sources and sinks that use Kafka Connect. We cover the verification steps and provide code samples created by popular application and database companies. We will discuss the resources available to support you through the connector development process.
This is Part 2 of 2 in Building Kafka Connectors - The Why and How
This document outlines a project to capture user location data and send it to a database for real-time analysis using Kafka and Spark streaming. It describes starting Zookeeper and Kafka servers, creating Kafka topics, producing and consuming messages with Java producers and consumers, using the Spark CLI, integrating Kafka and Spark for streaming, creating DataFrames and SQL queries, and saving data to PostgreSQL tables for further processing and analysis. The goal is to demonstrate real-time data streaming and analytics on user location data.
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.
The document discusses 5 common mistakes people make when writing Spark applications:
1) Not properly sizing executors for memory and cores.
2) Having shuffle blocks larger than 2GB which can cause jobs to fail.
3) Not addressing data skew which can cause joins and shuffles to be very slow.
4) Not properly managing the DAG to minimize shuffles and stages.
5) Classpath conflicts from mismatched dependencies causing errors.
Your data is getting bigger while your boss is getting anxious to have insights! This tutorial covers Apache Spark that makes data analytics fast to write and fast to run. Tackle big datasets quickly through a simple API in Python, and learn one programming paradigm in order to deploy interactive, batch, and streaming applications while connecting to data sources incl. HDFS, Hive, JSON, and S3.
Spark supports four cluster managers: Local, Standalone, YARN, and Mesos. YARN is highly recommended for production use. When running Spark on YARN, careful tuning of configuration settings like the number of executors, executor memory and cores, and dynamic allocation is important to optimize performance and resource utilization. Configuring queues also allows separating different applications by priority and resource needs.
The document discusses Akka 2.4 and commercial features available through the Reactive Platform. Key points include: Akka 2.4 requires Java 8 but provides backwards compatibility; Cluster Tools, Persistence, and Distributed PubSub are now stable features; Persistence allows cross-Scala version snapshot compatibility; a Split Brain Resolver is available in beta for cluster failure scenarios; and extended Java 6 support is provided through the Reactive Platform.
User Defined Functions is an important feature of Spark SQL which helps extend the language by adding custom constructs. UDFs are very useful for extending spark vocabulary but come with significant performance overhead. These are black boxes for Spark optimizer, blocking several helpful optimizations like WholeStageCodegen, Null optimization etc. They also come with a heavy processing cost associated with String functions requiring UTF-8 to UTF-16 conversions which slows down spark jobs and increases memory requirements. In this talk, we will go over how at Informatica we optimized UDFs to be as performant as Spark native functions both in terms of time and memory and allow these functions to participate in spark optimization steps.
Robust and Scalable ETL over Cloud Storage with Apache SparkDatabricks
The majority of reported Spark deployments are now in the cloud. In such an environment, it is preferable for Spark to access data directly from services such as Amazon S3, thereby decoupling storage and compute. However, there are limitations to object stores such as S3. Chained or concurrent ETL jobs often run into issues on S3 due to inconsistent file listings and the lack of atomic rename support. Metadata performance also becomes an issue when running jobs over many thousands to millions of files.
Speaker: Eric Liang
This talk was originally presented at Spark Summit East 2017.
Top 5 mistakes when writing Spark applicationshadooparchbook
This document discusses common mistakes people make when writing Spark applications and provides recommendations to address them. It covers issues related to executor configuration, application failures due to shuffle block sizes exceeding limits, slow jobs caused by data skew, and managing the DAG to avoid excessive shuffles and stages. Recommendations include using smaller executors, increasing the number of partitions, addressing skew through techniques like salting, and preferring ReduceByKey over GroupByKey and TreeReduce over Reduce to improve performance and resource usage.
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkEvan Chan
You want to ingest event, time-series, streaming data easily, yet have flexible, fast ad-hoc queries. Is this even possible? Yes! Find out how in this talk of combining Apache Cassandra and Apache Spark, using a new open-source database, FiloDB.
Beneath RDD in Apache Spark by Jacek LaskowskiSpark Summit
This document provides an overview of SparkContext and Resilient Distributed Datasets (RDDs) in Apache Spark. It discusses how to create RDDs using SparkContext functions like parallelize(), range(), and textFile(). It also covers DataFrames and converting between RDDs and DataFrames. The document discusses partitions and the level of parallelism in Spark, as well as the execution environment involving DAGScheduler, TaskScheduler, and SchedulerBackend. It provides examples of RDD lineage and describes Spark clusters like Spark Standalone and the Spark web UI.
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2L6bZbn
This CloudxLab Introduction to Spark Streaming & Apache Kafka tutorial helps you to understand Spark Streaming and Kafka in detail. Below are the topics covered in this tutorial:
1) Spark Streaming - Workflow
2) Use Cases - E-commerce, Real-time Sentiment Analysis & Real-time Fraud Detection
3) Spark Streaming - DStream
4) Word Count Hands-on using Spark Streaming
5) Spark Streaming - Running Locally Vs Running on Cluster
6) Introduction to Apache Kafka
7) Apache Kafka Hands-on on CloudxLab
8) Integrating Spark Streaming & Kafka
9) Spark Streaming & Kafka Hands-on
Spark Streaming & Kafka-The Future of Stream ProcessingJack Gudenkauf
Hari Shreedharan/Cloudera @Playtika. With its easy to use interfaces and native integration with some of the most popular ingest tools, such as Kafka, Flume, Kinesis etc, Spark Streaming has become go-to tool for stream processing. Code sharing with Spark also makes it attractive. In this talk, we will discuss the latest features in Spark Streaming and how it integrates with Kafka natively with no data loss, and even do exactly once processing!
Building production spark streaming applicationsJoey Echeverria
Designing, implementing, and testing an Apache Spark Streaming application is necessary to deploy to production but is not sufficient for long term management and monitoring. Simply learning the Spark Streaming APIs only gets you part of the way there. In this talk, I’ll be focusing on everything that happens after you’ve implemented your application in the context of a real-time alerting system for IT operational data.
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkEvan Chan
This document discusses using Spark and Cassandra together for interactive analytics. It describes how Evan Chan uses both technologies at Ooyala to solve the problem of generating analytics from raw data in Cassandra in a flexible and fast way. It outlines their architecture of using Spark to generate materialized views from Cassandra data and then powering queries with those cached views for low latency queries.
Tracing the Breadcrumbs: Apache Spark Workload DiagnosticsDatabricks
Have you ever hit mysterious random process hangs, performance regressions, or OOM errors that leave barely any useful traces, yet hard or expensive to reproduce? No matter how tricky the bugs are, they always leave some breadcrumbs along the way.
This document provides an overview and deep dive into Robinhood's RDS Data Lake architecture for ingesting data from their RDS databases into an S3 data lake. It discusses their prior daily snapshotting approach, and how they implemented a faster change data capture pipeline using Debezium to capture database changes and ingest them incrementally into a Hudi data lake. It also covers lessons learned around change data capture setup and configuration, initial table bootstrapping, data serialization formats, and scaling the ingestion process. Future work areas discussed include orchestrating thousands of pipelines and improving downstream query performance.
This presentation will be useful to those who would like to get acquainted with Apache Spark architecture, top features and see some of them in action, e.g. RDD transformations and actions, Spark SQL, etc. Also it covers real life use cases related to one of ours commercial projects and recall roadmap how we’ve integrated Apache Spark into it.
Was presented on Morning@Lohika tech talks in Lviv.
Design by Yarko Filevych: http://www.filevych.com/
Our product uses third generation Big Data technologies and Spark Structured Streaming to enable comprehensive Digital Transformation. It provides a unified streaming API that allows for continuous processing, interactive queries, joins with static data, continuous aggregations, stateful operations, and low latency. The presentation introduces Spark Structured Streaming's basic concepts including loading from stream sources like Kafka, writing to sinks, triggers, SQL integration, and mixing streaming with batch processing. It also covers continuous aggregations with windows, stateful operations with checkpointing, reading from and writing to Kafka, and benchmarks compared to other streaming frameworks.
Spark Streaming with Kafka allows processing streaming data from Kafka in real-time. There are two main approaches - receiver-based and direct. The receiver-based approach uses Spark receivers to read data from Kafka and write to write-ahead logs for fault tolerance. The direct approach reads Kafka offsets directly without a receiver for better performance but less fault tolerance. The document discusses using Spark Streaming to aggregate streaming data from Kafka in real-time, persisting aggregates to Cassandra and raw data to S3 for analysis. It also covers using stateful transformations to update Cassandra in real-time.
Apache Kafka® is the technology behind event streaming which is fast becoming the central nervous system of flexible, scalable, modern data architectures. Customers want to connect their databases, data warehouses, applications, microservices and more, to power the event streaming platform. To connect to Apache Kafka, you need a connector!
This online talk dives into the new Verified Integrations Program and the integration requirements, the Connect API and sources and sinks that use Kafka Connect. We cover the verification steps and provide code samples created by popular application and database companies. We will discuss the resources available to support you through the connector development process.
This is Part 2 of 2 in Building Kafka Connectors - The Why and How
This document summarizes a presentation about productionizing streaming jobs with Spark Streaming. It discusses:
1. The lifecycle of a Spark streaming application including how data is received in batches and processed through transformations.
2. Best practices for aggregations including reducing over windows, incremental aggregation, and checkpointing.
3. How to achieve high throughput by increasing parallelism through more receivers and partitions.
4. Tips for debugging streaming jobs using the Spark UI and ensuring processing time is less than the batch interval.
Spring Day | Spring and Scala | Eberhard WolffJAX London
2011-10-31 | 09:45 AM - 10:30 AM
Spring is widely used in the Java world - but does it make any sense to combine it with Scala? This talk gives an answer and shows how and why Spring is useful in the Scala world. All areas of Spring such as Dependency Injection, Aspect-Oriented Programming and the Portable Service Abstraction as well as Spring MVC are covered.
This document summarizes a presentation about using Scala with the Spring framework. It discusses how Spring's core features like dependency injection, aspect oriented programming, and service abstraction can be used with Scala. It provides examples of implementing dependency injection with both XML configuration and annotations. It also discusses how to handle callbacks when using Spring's service abstraction in Scala. Some potential issues and areas for improvement are identified, such as better support for Scala collections and implicit conversions in Spring configuration.
Streams Don't Fail Me Now - Robustness Features in Kafka StreamsHostedbyConfluent
"Stream processing applications can experience downtime due to a variety of reasons, such as a Kafka broker or another part of the infrastructure breaking down, an unexpected record (known as a poison pill) that causes the processing logic to get stuck, or a poorly performed upgrade of the application that yields unintended consequences.
Apache Kafka's native stream processing solution, Kafka Streams, has been successfully used with little or no downtime in many companies. This has been made possible by several robustness features built into Streams over the years and best practices that have evolved from many years of experience with production-level workloads.
In this talk, I will present the unique solutions the community has found for making Streams robust, explain how to apply them to your workloads and discuss the remaining challenges. Specifically, I will talk about standby tasks and rack-aware assignments that can help with losing a single node or a whole data center. I will also demonstrate how custom exception handlers and dead letter queues can make a pipeline more resistant to bad data. Finally, I will discuss options to evolve stream topologies safely."
Tapad's data pipeline is an elastic combination of technologies (Kafka, Hadoop, Avro, Scalding) that forms a reliable system for analytics, realtime and batch graph-building, and logging. In this talk, I will speak about the creation and evolution of the pipeline, and a concrete example – a day in the life of an event tracking pixel. We'll also talk about common challenges that we've overcome such as integrating different pieces of the system, schema evolution, queuing, and data retention policies.
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...Lightbend
Akka Streams and its amazing handling of streaming with back-pressure should be no surprise to anyone. But it takes a couple of use cases to really see it in action - especially in use cases where the amount of work continues to increase as you’re processing it. This is where back-pressure really shines.
In this talk for Architects and Dev Managers by Akara Sucharitakul, Principal MTS for Global Platform Frameworks at PayPal, Inc., we look at how back-pressure based on Akka Streams and Kafka is being used at PayPal to handle very bursty workloads.
In addition, Akara will also share experiences in creating a platform based on Akka and Akka Streams that currently processes over 1 billion transactions per day (on just 8 VMs), with the aim of helping teams adopt these technologies. In this webinar, you will:
*Start with a sample web crawler use case to examine what happens when each processing pass expands to a larger and larger workload to process.
*Review how we use the buffering capabilities in Kafka and the back-pressure with asynchronous processing in Akka Streams to handle such bursts.
*Look at lessons learned, plus some constructive “rants” about the architectural components, the maturity, or immaturity you’ll expect, and tidbits and open source goodies like memory-mapped stream buffers that can be helpful in other Akka Streams and/or Kafka use cases.
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...Databricks
The SQL tab in the Spark UI provides a lot of information for analysing your spark queries, ranging from the query plan, to all associated statistics. However, many new Spark practitioners get overwhelmed by the information presented, and have trouble using it to their benefit. In this talk we want to give a gentle introduction to how to read this SQL tab. We will first go over all the common spark operations, such as scans, projects, filter, aggregations and joins; and how they relate to the Spark code written. In the second part of the talk we will show how to read the associated statistics to pinpoint performance bottlenecks.
This document discusses the process of rebalancing in Voldemort. It begins by outlining the high-level steps taken, including getting the current and target cluster states, planning partition movements in batches, changing cluster metadata and rebalancing states, migrating data with redundancy checks, and rolling back changes if failures occur. Key aspects like maintaining consistency through proxying requests and handling failure scenarios are also summarized.
Scaling web applications with cassandra presentationMurat Çakal
This document provides an introduction and overview of Cassandra, including:
- Cassandra is a distributed database modeled after Amazon Dynamo and Google Bigtable that is highly scalable and fault tolerant.
- It is used by many large companies for applications that require fast writes, high availability, and elastic scalability.
- Cassandra's data model uses a column-oriented design organized into keyspaces, column families, rows, and columns. It also supports super columns.
- The document discusses Cassandra's features like tunable consistency levels, replication, and its data distribution using consistent hashing.
- An overview of Cassandra's Thrift API and basic operations like get, batch mutate, and
The Pushdown of Everything by Stephan Kessler and Santiago MolaSpark Summit
Stephan Kessler and Santiago Mola presented SAP HANA Vora, which extends Spark SQL's data sources API to allow "pushing down" more of a SQL query's logical plan to the data source for execution. This "Pushdown of Everything" approach leverages data sources' capabilities to process less data and optimize query execution. They described how data sources can implement interfaces like TableScan, PrunedScan, and the new CatalystSource interface to support pushing down projections, filters, and more complex queries respectively. While this approach has advantages in performance, challenges include the complexity of implementing CatalystSource and ensuring compatibility across Spark versions. Future work aims to improve the API and provide utilities to simplify implementation.
Manchester Hadoop Meetup: Cassandra Spark internalsChristopher Batey
This document summarizes how the Spark Cassandra Connector works to read and write data between Spark and Cassandra in a distributed manner. It discusses how the connector partitions Spark RDDs based on Cassandra token ranges and nodes, retrieves data from Cassandra in batches using CQL, and writes data back to Cassandra in batches grouped by partition key. Key classes and configuration parameters that control this distributed processing are also outlined.
Kafka Connect is used to build data pipelines by integrating Kafka with other data systems. It uses plugins called connectors and transformations. Transformations allow modifying data going from Kafka to Elasticsearch. Single message transformations apply to individual messages while Kafka Streams is better for more complex transformations involving multiple messages. When using Kafka Connect to sink data to Elasticsearch, best practices include managing indices by day, removing unnecessary fields, and not overwriting the _id field. Custom transformations can be implemented if needed. The ordering of transformations matters as they are chained.
Deep Dive of ADBMS Migration to Apache Spark—Use Cases SharingDatabricks
eBay has been using enterprise ADBMS for over a decade, and our team is working on batch workload migration from ADBMS to Spark in 2018. There has been so many experiences and lessons we got during the whole migration journey (85% auto + 15% manual migration) - during which we exposed many unexpected issues and gaps between ADBMS and Spark SQL, we made a lot of decisions to fulfill the gaps in practice and contributed many fixes in Spark core in order to unblock ourselves. It will be a really interesting and should be helpful sharing for many folks especially data/software engineers to plan and execute their migration work. And during this session we will share many very specific issues each individually we encountered and how we resolve & work-around with team in real migration processes.
Staying Ahead of the Curve with Spring and Cassandra 4 (SpringOne 2020)Alexandre Dutra
Spring and Cassandra are two of the leading technologies for building cloud native applications. In this talk by the project leads for Spring Data and the Cassandra Java Driver, we’ll cover the recent improvements in the latest and greatest versions of Spring Boot, Spring Data Cassandra, Cassandra 4.0 and the Cassandra Java driver. Whether you’re a novice, intermediate, or expert developer, this content will help you get started or migrate your existing application to the latest innovations. We’ll illustrate these new concepts with code samples and snippets that you can find on GitHub to help you get things done faster with these tools.
Staying Ahead of the Curve with Spring and Cassandra 4VMware Tanzu
This document discusses updates to Apache Cassandra, the Cassandra Java driver, Spring Data Cassandra, and Spring Boot for Cassandra integration. Some key highlights include Apache Cassandra 4.0 adding features like zero copy streaming and improved repair, Cassandra driver 4.0 being asynchronous and non-blocking, Spring Data Cassandra 3.0 upgrading dependencies and adding support for embedded objects, and Spring Boot 2.3 requiring configuration under the spring.data.cassandra prefix. The document provides guidance on upgrading dependencies and configurations between the different versions.
KSQL is a stream processing SQL engine, which allows stream processing on top of Apache Kafka. KSQL is based on Kafka Stream and provides capabilities for consuming messages from Kafka, analysing these messages in near-realtime with a SQL like language and produce results again to a Kafka topic. By that, no single line of Java code has to be written and you can reuse your SQL knowhow. This lowers the bar for starting with stream processing significantly.
KSQL offers powerful capabilities of stream processing, such as joins, aggregations, time windows and support for event time. In this talk I will present how KSQL integrates with the Kafka ecosystem and demonstrate how easy it is to implement a solution using KSQL for most part. This will be done in a live demo on a fictitious IoT sample.
Building Kafka Connectors with Kotlin: A Step-by-Step Guide to Creation and D...HostedbyConfluent
"Kafka Connect, the framework for building scalable and reliable data pipelines, has gained immense popularity in the data engineering landscape. This session will provide a comprehensive guide to creating Kafka connectors using Kotlin, a language known for its conciseness and expressiveness.
In this session, we will explore a step-by-step approach to crafting Kafka connectors with Kotlin, from inception to deployment using an simple use case. The process includes the following key aspects:
Understanding Kafka Connect: We'll start with an overview of Kafka Connect and its architecture, emphasizing its importance in real-time data integration and streaming.
Connector Design: Delve into the design principles that govern connector creation. Learn how to choose between source and sink connectors and identify the data format that suits your use case.
Building a Source Connector: We'll start with building a Kafka source connector, exploring key considerations, such as data transformations, serialization, deserialization, error handling and delivery guarantees. You will see how Kotlin's concise syntax and type safety can simplify the implementation.
Testing: Learn how to rigorously test your connector to ensure its reliability and robustness, utilizing best practices for testing in Kotlin.
Connector Deployment: go through the process of deploying your connector in a Kafka Connect cluster, and discuss strategies for monitoring and scaling.
Real-World Use Cases: Explore real-world examples of Kafka connectors built with Kotlin.
By the end of this session, you will have a solid foundation for creating and deploying Kafka connectors using Kotlin, equipped with practical knowledge and insights to make your data integration processes more efficient and reliable. Whether you are a seasoned developer or new to Kafka Connect, this guide will help you harness the power of Kafka and Kotlin for seamless data flow in your applications."
Slides from my talk at MinneAnalytics 2024 - June 7, 2024
https://datatech2024.sched.com/event/1eO0m/time-state-analytics-a-new-paradigm
Across many domains, we see a growing need for complex analytics to track precise metrics at Internet scale to detect issues, identify mitigations, and analyze patterns. Think about delays in airlines (Logistics), food delivery tracking (Apps), detect fraudulent transactions (Fintech), flagging computers for intrusion (Cybersecurity), device health (IoT), and many more.
For instance, at Conviva, our customers want to analyze the buffering that users on some types of devices suffer, when using a specific CDN.
We refer to such problems as Multidimensional Time-State Analytics. Time-State here refers to the stateful context-sensitive analysis over event streams needed to capture metrics of interest, in contrast to simple aggregations. Multidimensional refers to the need to run ad hoc queries to drill down into subpopulations of interest. Furthermore, we need both real-time streaming and offline retrospective analysis capabilities.
In this talk, we will share our experiences to explain why state-of-art systems offer poor abstractions to tackle such workloads and why they suffer from poor cost-performance tradeoffs and significant complexity.
We will also describe Conviva’s architectural and algorithmic efforts to tackle these challenges. We present early evidence on how raising the level of abstraction can reduce developer effort, bugs, and cloud costs by (up to) an order of magnitude, and offer a unified framework to support both streaming and retrospective analysis. We will also discuss how our ideas can be plugged into existing pipelines and how our new ``visual'' abstraction can democratize analytics across many domains and to non-programmers.
Porting a Streaming Pipeline from Scala to RustEvan Chan
How we at Conviva ported a streaming data pipeline in months from Scala to Rust. What are the important human and technical factors in our port, and what did we learn?
Designing Stateful Apps for Cloud and KubernetesEvan Chan
Almost all applications have some kind of state. Some data processing apps and databases have huge amounts of state. How do we navigate a cloud-based world of containers where stateless and functions-as-a-service is all the rage? As a long-time architect, designer, and developer of very stateful apps (databases and data processing apps), I’d like to take you on a journey through the modern cloud world and Kubernetes, offering helpful design patterns, considerations, tips, and where things are going. How is Kubernetes shaking up stateful app design?
Slides for my talk at Monitorama PDX 2019. Histograms have the potential to give us tools to meet SLO/SLAs, quantile measurements, and very rich heatmap displays for debugging. Their promise has not been fulfilled by TSDB backends however. This talk talks about the concept of histograms as first class citizens in storage. What does accuracy mean for histograms? How can we store and compress rich histograms for evaluation and querying at massive scale? How can we fix some of the issues with histograms in Prometheus, such as proper aggregation, bucketing, avoiding clipping, etc.?
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleEvan Chan
My keynote presentation about how we developed FiloDB, a distributed, Prometheus-compatible time series database, productionized it at Apple and scaled it out to handle a huge amount of operational data, based on the stack of Kafka, Cassandra, Scala/Akka.
Breakthrough OLAP performance with Cassandra and SparkEvan Chan
Find out about breakthrough architectures for fast OLAP performance querying Cassandra data with Apache Spark, including a new open source project, FiloDB.
Productionizing Spark and the Spark Job ServerEvan Chan
You won't find this in many places - an overview of deploying, configuring, and running Apache Spark, including Mesos vs YARN vs Standalone clustering modes, useful config tuning parameters, and other tips from years of using Spark in production. Also, learn about the Spark Job Server and how it can help your organization deploy Spark as a RESTful service, track Spark jobs, and enable fast queries (including SQL!) of cached RDDs.
Everyone in the Scala world is using or looking into using Akka for low-latency, scalable, distributed or concurrent systems. I'd like to share my story of developing and productionizing multiple Akka apps, including low-latency ingestion and real-time processing systems, and Spark-based applications.
When does one use actors vs futures?
Can we use Akka with, or in place of, Storm?
How did we set up instrumentation and monitoring in production?
How does one use VisualVM to debug Akka apps in production?
What happens if the mailbox gets full?
What is our Akka stack like?
I will share best practices for building Akka and Scala apps, pitfalls and things we'd like to avoid, and a vision of where we would like to go for ideal Akka monitoring, instrumentation, and debugging facilities. Plus backpressure and at-least-once processing.
MIT lecture - Socrata Open Data ArchitectureEvan Chan
Socrata is a software company that provides an open data platform to enable governments to publish and share data with the public and developers in order to spur innovation; their platform allows users to find, explore, and analyze datasets through tools for visualization, analysis, and application building. The document discusses Socrata's architecture and technologies that power their open data platform and allow it to handle large volumes of data and queries in a scalable way.
How do you rapidly derive complex insights on top of really big data sets in Cassandra? This session draws upon Evan's experience building a distributed, interactive, columnar query engine on top of Cassandra and Spark. We will start by surveying the existing query landscape of Cassandra and discuss ways to integrate Cassandra and Spark. We will dive into the design and architecture of a fast, column-oriented query architecture for Spark, and why columnar stores are so advantageous for OLAP workloads. I will present a schema for Parquet-like storage of analytical datasets onCassandra. Find out why Cassandra and Spark are the perfect match for enabling fast, scalable, complex querying and storage of big analytical data.
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Evan Chan
This was a talk that Kelvin Chu and I just gave at the SF Bay Area Spark Meetup 5/14 at Palantir Technologies.
We discussed the Spark Job Server (http://github.com/ooyala/spark-jobserver), its history, example workflows, architecture, and exciting future plans to provide HA spark job contexts.
We also discussed the use case of the job server at Ooyala to facilitate fast query jobs using shared RDD and a shared job context, and how we integrate with Apache Cassandra.
Value Stream Mapping Worskshops for Intelligent Continuous SecurityMarc Hornbeek
This presentation provides detailed guidance and tools for conducting Current State and Future State Value Stream Mapping workshops for Intelligent Continuous Security.
This paper proposes a shoulder inverse kinematics (IK) technique. Shoulder complex is comprised of the sternum, clavicle, ribs, scapula, humerus, and four joints.
The Fluke 925 is a vane anemometer, a handheld device designed to measure wind speed, air flow (volume), and temperature. It features a separate sensor and display unit, allowing greater flexibility and ease of use in tight or hard-to-reach spaces. The Fluke 925 is particularly suitable for HVAC (heating, ventilation, and air conditioning) maintenance in both residential and commercial buildings, offering a durable and cost-effective solution for routine airflow diagnostics.
The idea behind this session is to equip you with a practical, collaborative method to deeply understand your domain — not just from a technical perspective, but through a lens that aligns with how the business actually works.
By the end, you’ll walk away with a new mindset and tools you can take back to your team.
The B.Tech in Computer Science and Engineering (CSE) at Lovely Professional University (LPU) is a four-year undergraduate program designed to equip students with strong theoretical and practical foundations in computing. The curriculum is industry-aligned and includes core subjects like programming, data structures, algorithms, operating systems, computer networks, databases, and software engineering. Students can also choose specializations such as Artificial Intelligence, Data Science, Cybersecurity, and Cloud Computing. LPU emphasizes hands-on learning through modern labs, live projects, and internships. The university has collaborations with tech giants like Google, Microsoft, and IBM, offering students excellent exposure and placement opportunities. With a vibrant campus life, international diversity, and a strong placement record, LPU's B.Tech CSE program prepares students to become future-ready professionals in the fast-evolving tech world.
"Heaters in Power Plants: Types, Functions, and Performance Analysis"Infopitaara
This presentation provides a detailed overview of heaters used in power plants, focusing mainly on feedwater heaters, their types, construction, and role in improving thermal efficiency. It explains the difference between open and closed feedwater heaters, highlights the importance of low-pressure and high-pressure heaters, and describes the orientation types—horizontal and vertical.
The PPT also covers major heater connections, the three critical heat transfer zones (desuperheating, condensing, and subcooling), and key performance indicators such as Terminal Temperature Difference (TTD) and Drain Cooler Approach (DCA). Additionally, it discusses common operational issues, monitoring parameters, and the arrangement of steam and drip flows.
Understanding and maintaining these heaters is crucial for ensuring optimum power plant performance, reducing fuel costs, and enhancing equipment life.
Sorting Order and Stability in Sorting.
Concept of Internal and External Sorting.
Bubble Sort,
Insertion Sort,
Selection Sort,
Quick Sort and
Merge Sort,
Radix Sort, and
Shell Sort,
External Sorting, Time complexity analysis of Sorting Algorithms.
☁️ GDG Cloud Munich: Build With AI Workshop - Introduction to Vertex AI! ☁️
Join us for an exciting #BuildWithAi workshop on the 28th of April, 2025 at the Google Office in Munich!
Dive into the world of AI with our "Introduction to Vertex AI" session, presented by Google Cloud expert Randy Gupta.
2. Who am I
User and contributor to Spark since 0.9,
Cassandra since 0.6
Created Spark Job Server and FiloDB
Talks at Spark Summit, Cassandra Summit,
Strata, Scala Days, etc.
http://velvia.github.io/
5. Why are Updates Important?
Appends
Streaming workloads. Add new data continuously.
Real data is *always* changing. Queries on live real-time
data has business benefits.
Updates
Idempotency = really simple ingestion pipelines
Simpler streaming later
update late events (See Spark 2.0 Structured Streaming)
6. Introducing FiloDB
A distributed, versioned, columnar analytics
database. With updates. Built for streaming.
http://www.github.com/filodb/FiloDB
7. Fast Analytics Storage
• Scan speeds competitive with Apache Parquet
• In-memory version significantly faster
• Flexible filtering along two dimensions
• Much more efficient and flexible partition key filtering
• Efficient columnar storage using dictionary encoding and other
techniques
• Updatable
• Spark SQL for easy BI integration
9. 100% Reactive
• Scala
• Akka Cluster
• Spark
• Typesafe Config for all configuration
• Scodec, Ficus, Enumeratum, Scalactic, etc.
• Even most of the performance critical parts are
written in Scala :)
10. Scala, Akka, and Spark
• Akka - eliminate shared mutable state
• Remote and cluster makes building distributed
client-server architectures easy
• Backpressure, at-least-once is easy to build
• Failure handling and supervision are critical for
databases
• Spark for SQL, DataFrames, ML, interfacing
13. Akka vs Futures
• Akka Actors:
• External FiloDB node API (remote + cluster)
• Async messaging with clients
• State management and scheduling (flushing)
• Futures:
• Core I/O
• Columnar data processing / ingestion
• Type-safe processing stages
14. Akka for Control Flow
Driver
Client
Executor
NCA
DsCA1 DsCA2
Executor
NCA
DsCA1 DsCA2
Flush()
NodeClusterActor
SingletonClusterProxy
15. Yes, Akka in Spark
• Columnar ingestion is stateful - need stickiness of state. This
is inherently difficult in Spark.
• Akka (cluster) gives us a separate, asynchronous control
channel to talk to FiloDB ingestors
• Spark only gives data flow primitives, not async messaging
• We need to route incoming records to the correct ingestion
node. Sorting data is inefficient and forces all nodes to wait
for sorting to be done.
• On failure, can control state recovery and moving state
16. Data Ingestion Setup
Executor
NCA
DsCA1 DsCA2
task0 task1
Row
Source
Actor
Row
Source
Actor
Executor
NCA
DsCA1 DsCA2
task0 task1
Row
Source
Actor
Row
Source
Actor
Node
Cluster
Actor
Partition Map
17. FiloDB NodeFiloDB Node
FiloDB separate nodes
Executor
NCA
DsCA1 DsCA2
task0 task1
Row
Source
Actor
Row
Source
Actor
Executor
NCA
DsCA1 DsCA2
task0 task1
Row
Source
Actor
Row
Source
Actor
Node
Cluster
Actor
Partition Map
19. Backpressure
• Assumes receiver is OK, starts sending rows
• Allows configurable number of unacked
messages before stops sending
• Acking is receiver’s way of rate-limiting
• Automatic retries for at-least-once
• NACK for when receiver must stop (out of
memory or MemTable full)
20. Testing Akka Cluster
• MultiNodeSpec / sbt-multi-jvm
• AWESOME
• Test multi-node message routing
• Test cluster membership and subscription
• Inject network failures
21. Core: All Futures
/**
* Clears all data from the column store for that given projection, for all versions.
* More like a truncation, not a drop.
* NOTE: please make sure there are no reprojections or writes going on before calling this
*/
def clearProjectionData(projection: Projection): Future[Response]
/**
* Completely and permanently drops the dataset from the column store.
* @param dataset the DatasetRef for the dataset to drop.
*/
def dropDataset(dataset: DatasetRef): Future[Response]
/**
* Appends the ChunkSets and incremental indices in the segment to the column store.
* @param segment the ChunkSetSegment to write / merge to the columnar store
* @param version the version # to write the segment to
* @return Success. Future.failure(exception) otherwise.
*/
def appendSegment(projection: RichProjection,
segment: ChunkSetSegment,
version: Int): Future[Response]
23. Kamon Tracing
• http://kamon.io
• One trace can encapsulate multiple Future steps
all executing on different threads
• Tunable tracing levels
• Summary stats and histograms for segments
• Super useful for production debugging of reactive
stack
24. Kamon Metrics
• Uses HDRHistogram for much finer and more
accurate buckets
• Built-in metrics for Akka actors, Spray, Akka-
Http, Play, etc. etc.
KAMON trace name=append-segment n=2863 min=765952 p50=2113536 p90=3211264 p95=3981312 p99=9895936
p999=16121856 max=19529728
KAMON trace-segment name=write-chunks n=2864 min=436224 p50=1597440 p90=2637824 p95=3424256 p99=9109504
p999=15335424 max=18874368
KAMON trace-segment name=write-index n=2863 min=278528 p50=432128 p90=544768 p95=598016 p99=888832
p999=2260992 max=8355840
25. Validation: Scalactic
private def getColumnsFromNames(allColumns: Seq[Column],
columnNames: Seq[String]): Seq[Column] Or BadSchema = {
if (columnNames.isEmpty) {
Good(allColumns)
} else {
val columnMap = allColumns.map { c => c.name -> c }.toMap
val missing = columnNames.toSet -- columnMap.keySet
if (missing.nonEmpty) { Bad(MissingColumnNames(missing.toSeq, "projection")) }
else { Good(columnNames.map(columnMap)) }
}
}
for { computedColumns <- getComputedColumns(dataset.name, allColIds, columns)
dataColumns <- getColumnsFromNames(columns, normProjection.columns)
richColumns = dataColumns ++ computedColumns
// scalac has problems dealing with (a, b, c) <- getColIndicesAndType... apparently
segStuff <- getColIndicesAndType(richColumns, Seq(normProjection.segmentColId), "segment")
keyStuff <- getColIndicesAndType(richColumns, normProjection.keyColIds, "row")
partStuff <- getColIndicesAndType(richColumns, dataset.partitionColumns, "partition") }
yield {
• Notice how multiple validations compose!
27. Filo: High Performance
Binary Vectors
• Designed for NoSQL, not a file format
• random or linear access
• on or off heap
• missing value support
• Scala only, but cross-platform support possible
http://github.com/velvia/filo is a binary data vector
library designed for extreme read performance with
minimal deserialization costs.
28. Billions of Ops / Sec
• JMH benchmark: 0.5ns per FiloVector element access / add
• 2 Billion adds per second - single threaded
• Who said Scala cannot be fast?
• Spark API (row-based) limits performance significantly
val randomInts = (0 until numValues).map(i => util.Random.nextInt)
val randomIntsAray = randomInts.toArray
val filoBuffer = VectorBuilder(randomInts).toFiloBuffer
val sc = FiloVector[Int](filoBuffer)
@Benchmark
@BenchmarkMode(Array(Mode.AverageTime))
@OutputTimeUnit(TimeUnit.MICROSECONDS)
def sumAllIntsFiloApply(): Int = {
var total = 0
for { i <- 0 until numValues optimized } {
total += sc(i)
}
total
}