My presentation at the Toronto Scala and Typesafe User Group: http://www.meetup.com/Toronto-Scala-Typesafe-User-Group/events/224034596/.
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Anton Kirillov
This talk is about architecture designs for data processing platforms based on SMACK stack which stands for Spark, Mesos, Akka, Cassandra and Kafka. The main topics of the talk are:
- SMACK stack overview
- storage layer layout
- fixing NoSQL limitations (joins and group by)
- cluster resource management and dynamic allocation
- reliable scheduling and execution at scale
- different options for getting the data into your system
- preparing for failures with proper backup and patching strategies
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time PersonalizationPatrick Di Loreto
The gambling industry has arguably been one of the most comprehensively affected by the internet revolution, and if an organization such as William Hill hadn't adapted successfully it would have disappeared. We call this, “Going Reactive.”
The company's latest innovations are very cutting edge platforms for personalization, recommendation, and big data, which are based on Akka, Scala, Play Framework, Kafka, Cassandra, Spark, and Mesos.
Everyone in the Scala world is using or looking into using Akka for low-latency, scalable, distributed or concurrent systems. I'd like to share my story of developing and productionizing multiple Akka apps, including low-latency ingestion and real-time processing systems, and Spark-based applications.
When does one use actors vs futures?
Can we use Akka with, or in place of, Storm?
How did we set up instrumentation and monitoring in production?
How does one use VisualVM to debug Akka apps in production?
What happens if the mailbox gets full?
What is our Akka stack like?
I will share best practices for building Akka and Scala apps, pitfalls and things we'd like to avoid, and a vision of where we would like to go for ideal Akka monitoring, instrumentation, and debugging facilities. Plus backpressure and at-least-once processing.
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...DataStax Academy
Typesafe did a survey of Spark usage last year and found that a large percentage of Spark users combine it with Cassandra and Kafka. This talk focuses on streaming data scenarios that demonstrate how these three tools complement each other for building robust, scalable, and flexible data applications. Cassandra provides resilient and scalable storage, with flexible data format and query options. Kafka provides durable, scalable collection of streaming data with message-queue semantics. Spark provides very flexible analytics, everything from classic SQL queries to machine learning and graph algorithms, running in a streaming model based on "mini-batches", offline batch jobs, or interactive queries. We'll consider best practices and areas where improvements are needed.
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisHelena Edelson
Slides from my talk with Evan Chan at Strata San Jose: NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis. Streaming analytics architecture in big data for fast streaming, ad hoc and batch, with Kafka, Spark Streaming, Akka, Mesos, Cassandra and FiloDB. Simplifying to a unified architecture.
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...Lightbend
Akka Streams and its amazing handling of streaming with back-pressure should be no surprise to anyone. But it takes a couple of use cases to really see it in action - especially in use cases where the amount of work continues to increase as you’re processing it. This is where back-pressure really shines.
In this talk for Architects and Dev Managers by Akara Sucharitakul, Principal MTS for Global Platform Frameworks at PayPal, Inc., we look at how back-pressure based on Akka Streams and Kafka is being used at PayPal to handle very bursty workloads.
In addition, Akara will also share experiences in creating a platform based on Akka and Akka Streams that currently processes over 1 billion transactions per day (on just 8 VMs), with the aim of helping teams adopt these technologies. In this webinar, you will:
*Start with a sample web crawler use case to examine what happens when each processing pass expands to a larger and larger workload to process.
*Review how we use the buffering capabilities in Kafka and the back-pressure with asynchronous processing in Akka Streams to handle such bursts.
*Look at lessons learned, plus some constructive “rants” about the architectural components, the maturity, or immaturity you’ll expect, and tidbits and open source goodies like memory-mapped stream buffers that can be helpful in other Akka Streams and/or Kafka use cases.
Using the SDACK Architecture to Build a Big Data ProductEvans Ye
You definitely have heard about the SMACK architecture, which stands for Spark, Mesos, Akka, Cassandra, and Kafka. It’s especially suitable for building a lambda architecture system. But what is SDACK? Apparently it’s very much similar to SMACK except the “D" stands for Docker. While SMACK is an enterprise scale, multi-tanent supported solution, the SDACK architecture is particularly suitable for building a data product. In this talk, I’ll talk about the advantages of the SDACK architecture, and how TrendMicro uses the SDACK architecture to build an anomaly detection data product. The talk will cover:
1) The architecture we designed based on SDACK to support both batch and streaming workload.
2) The data pipeline built based on Akka Stream which is flexible, scalable, and able to do self-healing.
3) The Cassandra data model designed to support time series data writes and reads.
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaHelena Edelson
Scala Days, Amsterdam, 2015: Lambda Architecture - Batch and Streaming with Spark, Cassandra, Kafka, Akka and Scala; Fault Tolerance, Data Pipelines, Data Flows, Data Locality, Akka Actors, Spark, Spark Cassandra Connector, Big Data, Asynchronous data flows. Time series data, KillrWeather, Scalable Infrastructure, Partition For Scale, Replicate For Resiliency, Parallelism
Isolation, Data Locality, Location Transparency
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Helena Edelson
Regardless of the meaning we are searching for over our vast amounts of data, whether we are in science, finance, technology, energy, health care…, we all share the same problems that must be solved: How do we achieve that? What technologies best support the requirements? This talk is about how to leverage fast access to historical data with real time streaming data for predictive modeling for lambda architecture with Spark Streaming, Kafka, Cassandra, Akka and Scala. Efficient Stream Computation, Composable Data Pipelines, Data Locality, Cassandra data model and low latency, Kafka producers and HTTP endpoints as akka actors...
Reactive dashboard’s using apache sparkRahul Kumar
Apache Spark's Tutorial talk, In this talk i explained how to start working with Apache spark, feature of apache spark and how to compose data platform with spark. This talk also explains about reactive platform, tools and framework like Play, akka.
Kafka Lambda architecture with mirroringAnant Rustagi
This document outlines a master plan for a lambda architecture that involves mirroring data from multiple Kafka clusters into a Hadoop cluster for batch processing and analytics, as well as real-time processing using Storm/Spark on the mirrored data in the Kafka clusters, with data from various sources integrated into the Kafka clusters with the topic name "Data".
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson
O'Reilly Webcast with Myself and Evan Chan on the new SNACK Stack (playoff of SMACK) with FIloDB: Scala, Spark Streaming, Akka, Cassandra, FiloDB and Kafka.
Feeding Cassandra with Spark-Streaming and KafkaDataStax Academy
In this session we will examine a sample application that simulates an IoT stream that is handled through Kafka, Spark Streaming, and into Cassandra. The session will discuss the implementation details including the Kafka design considerations, Spark Steaming functionality including working with windowing to achieve analytics and finally Cassandra Time series data model considerations. The example is based on OSS Kafka and Integrated Spark and Cassandra in DSE.
A Tale of Two APIs: Using Spark Streaming In ProductionLightbend
Fast Data architectures are the answer to the increasing need for the enterprise to process and analyze continuous streams of data to accelerate decision making and become reactive to the particular characteristics of their market.
Apache Spark is a popular framework for data analytics. Its capabilities include SQL-based analytics, dataflow processing, graph analytics and a rich library of built-in machine learning algorithms. These libraries can be combined to address a wide range of requirements for large-scale data analytics.
To address Fast Data flows, Spark offers two API's: The mature Spark Streaming and its younger sibling, Structured Streaming. In this talk, we are going to introduce both APIs. Using practical examples, you will get a taste of each one and obtain guidance on how to choose the right one for your application.
The document discusses the SMACK stack 1.1, which includes tools for streaming, Mesos, analytics, Cassandra, and Kafka. It describes how SMACK stack 1.1 adds capabilities for dynamic compute, microservices, orchestration, and microsegmentation. It also provides examples of running Storm on Mesos and using Apache Kafka for decoupling data pipelines.
Real-Time Anomaly Detection with Spark MLlib, Akka and CassandraNatalino Busa
We present a solution for streaming anomaly detection, named “Coral”, based on Spark, Akka and Cassandra. In the system presented, we run Spark to run the data analytics pipeline for anomaly detection. By running Spark on the latest events and data, we make sure that the model is always up-to-date and that the amount of false positives is kept low, even under changing trends and conditions. Our machine learning pipeline uses Spark decision tree ensembles and k-means clustering. Once the model is trained by Spark, the model’s parameters are pushed to the Streaming Event Processing Layer, implemented in Akka. The Akka layer will then score 1000s of event per seconds according to the last model provided by Spark. Spark and Akka communicate which each other using Cassandra as a low-latency data store. By doing so, we make sure that every element of this solution is resilient and distributed. Spark performs micro-batches to keep the model up-to-date while Akka detects the new anomalies by using the latest Spark-generated data model. The project is currently hosted on Github. Have a look at : http://coral-streaming.github.io
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Helena Edelson
This document provides an overview of streaming big data with Spark, Kafka, Cassandra, Akka, and Scala. It discusses delivering meaning in near-real time at high velocity and an overview of Spark Streaming, Kafka and Akka. It also covers Cassandra and the Spark Cassandra Connector as well as integration in big data applications. The presentation is given by Helena Edelson, a Spark Cassandra Connector committer and Akka contributor who is a Scala and big data conference speaker working as a senior software engineer at DataStax.
Are you tired of struggling with your existing data analytic applications?
When MapReduce first emerged it was a great boon to the big data world, but modern big data processing demands have outgrown this framework.
That’s where Apache Spark steps in, boasting speeds 10-100x faster than Hadoop and setting the world record in large scale sorting. Spark’s general abstraction means it can expand beyond simple batch processing, making it capable of such things as blazing-fast, iterative algorithms and exactly once streaming semantics. This combined with it’s interactive shell make it a powerful tool useful for everybody, from data tinkerers to data scientists to data developers.
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Spark Summit
This document summarizes key aspects of running Spark Streaming applications in production, including fault tolerance, performance, and monitoring. It discusses how Spark Streaming receives data streams in batches and processes them across executors. It describes how driver and executor failures can be handled through checkpointing saved DAG information and write ahead logs that replicate received data blocks. Restarting the driver from checkpoints allows recovering the application state.
This document provides an introduction and overview of Kafka, Spark and Cassandra. It begins with introductions to each technology - Cassandra as a distributed database, Spark as a fast and general engine for large-scale data processing, and Kafka as a platform for building real-time data pipelines and streaming apps. It then discusses how these three technologies can be used together to build a complete data pipeline for ingesting, processing and analyzing large volumes of streaming data in real-time while storing the results in Cassandra for fast querying.
Since 2014, Typesafe has been actively contributing to the Apache Spark project, and has become a certified development support partner of Databricks, the company started by the creators of Spark. Typesafe and Mesosphere have forged a partnership in which Typesafe is the official commercial support provider of Spark on Apache Mesos, along with Mesosphere’s Datacenter Operating Systems (DCOS).
In this webinar with Iulian Dragos, Spark team lead at Typesafe Inc., we reveal how Typesafe supports running Spark in various deployment modes, along with the improvements we made to Spark to help integrate backpressure signals into the underlying technologies, making it a better fit for Reactive Streams. He also show you the functionalities at work, and how to make it simple to deploy to Spark on Mesos with Typesafe.
We will introduce:
Various deployment modes for Spark: Standalone, Spark on Mesos, and Spark with Mesosphere DCOS
Overview of Mesos and how it relates to Mesosphere DCOS
Deeper look at how Spark runs on Mesos
How to manage coarse-grained and fine-grained scheduling modes on Mesos
What to know about a client vs. cluster deployment
A demo running Spark on Mesos
Many people promise fast data as the next step after big data. The idea of creating a complete end-to-end data pipeline that combines Spark, Akka, Cassandra, Kafka, and Apache Mesos came up two years ago, sometimes called the SMACK stack. The SMACK stack is an ideal environment for handling all sorts of data-processing needs which can be nightly batch-processing tasks, real-time ingestion of sensor data or business intelligence questions. The SMACK stack includes a lot of components which have to be deployed somewhere. Let’s see how we can create a distributed environment in the cloud with Terraform and how we can provision a Mesos-Cluster with Mesosphere Datacenter Operating System (DC/OS) to create a powerful fast data platform.
Apache cassandra & apache spark for time series dataPatrick McFadin
Apache Cassandra is a distributed database that stores time series data in a partitioned and ordered format. Apache Spark can efficiently query this Cassandra data using Resilient Distributed Datasets (RDDs) and perform analytics like aggregations. For example, weather station data stored sequentially in Cassandra by time can be aggregated into daily high and low temperatures with Spark and written back to a roll-up Cassandra table.
Getting Started Running Apache Spark on Apache MesosPaco Nathan
This document provides an overview of Apache Mesos and how to run Apache Spark on a Mesos cluster. It describes Mesos as a distributed systems kernel that allows sharing compute resources across applications. It then gives step-by-step instructions for launching a Mesos cluster in AWS, configuring and running Spark jobs on the cluster, and where to find example Spark jobs and further Mesos resources.
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...Alexey Kharlamov
At Integral, we process heavy volumes of click-stream traffic. 50K QPS of ad impressions at peak and close to 200K QPS of all browser calls. We build analytics on this streams of data. There are two applications which require quite significant computational effort: 'sessionization' and fraud detection.
Sessionization implies linking a series of requests from same browser into single record. There can be 5 or more total requests spread over 15-30 minutes which we need to link to each other.
Fraud detection is a process looking at various signals in browser requests and at substantial historical evidence data classifying ad impression either as legitimate or as fraudulent.
We've been doing both (as well as all other analytics) in batch mode once an hour at best. Both processes, and, in particular, fraud detection, are time sensitive and much more meaningful if done in near-real-time.
This talk would be about our experience migrating a once-per-day offline batch processing of impression data using hadoop to in-memory stream processing using Kafka, Storm and Cassandra. We will touch upon our choices and our reasoning for selecting the products used for this solution.
Hadoop is no longer the only or always preferred option in Big Data space. In-memory stream processing may be more effective for time series data preparation and aggregation. Ability to scale at a significantly lower cost means more customers, better accuracy and better business practices: since only in-stream processing allows for low-latency data and insight delivery it opens entirely new opportunities. However, transitioning of non-trivial data pipelines raises a number of questions hidden previously within the offline nature of batch processing. How will you join several data feeds? How will you implement failure recovery? In addition to handling terabytes of data per day our streaming system has to be guided by the following considerations:
• Recovery time
• Time relativity and continuity
• Geographical distribution of data sources
• Limit on data loss
• Maintainability
The system produces complex cross-correlational analysis of several data feeds and aggregation for client analytics with input feed frequency of up to 100K msg/sec.
This presentation will benefit anyone interested in learning an alternate approach for big data analytics, especially the process of joining multiple streams in memory using Cassandra. Presentation will also highlight certain optimization patterns used those can be useful in similar situations.
Although most microservices are stateless - they delegate things like persistence and consistency to a database or external storage. But sometimes you benefit when you keep the state inside the application. In this talk I’m going to discuss why you want to build stateful microservices and design choices to make. I’ll use Akka framework and explain tools like Akka Clustering and Akka Persistence in depth and show a few practical examples.
This document summarizes a presentation about monitoring Cassandra systems. It discusses gathering metrics from Cassandra using JMX and nodetool, including thread pool statistics, latency histograms, and metric types. It also provides an overview of the Cassandra read/write process involving memtables and SSTables.
The document discusses the challenges of managing large volumes of data from various sources in a traditional divided approach. It argues that Hadoop provides a solution by allowing all data to be stored together in a single system and processed as needed. This addresses the problems caused by keeping data isolated in different silos and enables new types of analysis across all available data.
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Helena Edelson
Regardless of the meaning we are searching for over our vast amounts of data, whether we are in science, finance, technology, energy, health care…, we all share the same problems that must be solved: How do we achieve that? What technologies best support the requirements? This talk is about how to leverage fast access to historical data with real time streaming data for predictive modeling for lambda architecture with Spark Streaming, Kafka, Cassandra, Akka and Scala. Efficient Stream Computation, Composable Data Pipelines, Data Locality, Cassandra data model and low latency, Kafka producers and HTTP endpoints as akka actors...
Reactive dashboard’s using apache sparkRahul Kumar
Apache Spark's Tutorial talk, In this talk i explained how to start working with Apache spark, feature of apache spark and how to compose data platform with spark. This talk also explains about reactive platform, tools and framework like Play, akka.
Kafka Lambda architecture with mirroringAnant Rustagi
This document outlines a master plan for a lambda architecture that involves mirroring data from multiple Kafka clusters into a Hadoop cluster for batch processing and analytics, as well as real-time processing using Storm/Spark on the mirrored data in the Kafka clusters, with data from various sources integrated into the Kafka clusters with the topic name "Data".
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson
O'Reilly Webcast with Myself and Evan Chan on the new SNACK Stack (playoff of SMACK) with FIloDB: Scala, Spark Streaming, Akka, Cassandra, FiloDB and Kafka.
Feeding Cassandra with Spark-Streaming and KafkaDataStax Academy
In this session we will examine a sample application that simulates an IoT stream that is handled through Kafka, Spark Streaming, and into Cassandra. The session will discuss the implementation details including the Kafka design considerations, Spark Steaming functionality including working with windowing to achieve analytics and finally Cassandra Time series data model considerations. The example is based on OSS Kafka and Integrated Spark and Cassandra in DSE.
A Tale of Two APIs: Using Spark Streaming In ProductionLightbend
Fast Data architectures are the answer to the increasing need for the enterprise to process and analyze continuous streams of data to accelerate decision making and become reactive to the particular characteristics of their market.
Apache Spark is a popular framework for data analytics. Its capabilities include SQL-based analytics, dataflow processing, graph analytics and a rich library of built-in machine learning algorithms. These libraries can be combined to address a wide range of requirements for large-scale data analytics.
To address Fast Data flows, Spark offers two API's: The mature Spark Streaming and its younger sibling, Structured Streaming. In this talk, we are going to introduce both APIs. Using practical examples, you will get a taste of each one and obtain guidance on how to choose the right one for your application.
The document discusses the SMACK stack 1.1, which includes tools for streaming, Mesos, analytics, Cassandra, and Kafka. It describes how SMACK stack 1.1 adds capabilities for dynamic compute, microservices, orchestration, and microsegmentation. It also provides examples of running Storm on Mesos and using Apache Kafka for decoupling data pipelines.
Real-Time Anomaly Detection with Spark MLlib, Akka and CassandraNatalino Busa
We present a solution for streaming anomaly detection, named “Coral”, based on Spark, Akka and Cassandra. In the system presented, we run Spark to run the data analytics pipeline for anomaly detection. By running Spark on the latest events and data, we make sure that the model is always up-to-date and that the amount of false positives is kept low, even under changing trends and conditions. Our machine learning pipeline uses Spark decision tree ensembles and k-means clustering. Once the model is trained by Spark, the model’s parameters are pushed to the Streaming Event Processing Layer, implemented in Akka. The Akka layer will then score 1000s of event per seconds according to the last model provided by Spark. Spark and Akka communicate which each other using Cassandra as a low-latency data store. By doing so, we make sure that every element of this solution is resilient and distributed. Spark performs micro-batches to keep the model up-to-date while Akka detects the new anomalies by using the latest Spark-generated data model. The project is currently hosted on Github. Have a look at : http://coral-streaming.github.io
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Helena Edelson
This document provides an overview of streaming big data with Spark, Kafka, Cassandra, Akka, and Scala. It discusses delivering meaning in near-real time at high velocity and an overview of Spark Streaming, Kafka and Akka. It also covers Cassandra and the Spark Cassandra Connector as well as integration in big data applications. The presentation is given by Helena Edelson, a Spark Cassandra Connector committer and Akka contributor who is a Scala and big data conference speaker working as a senior software engineer at DataStax.
Are you tired of struggling with your existing data analytic applications?
When MapReduce first emerged it was a great boon to the big data world, but modern big data processing demands have outgrown this framework.
That’s where Apache Spark steps in, boasting speeds 10-100x faster than Hadoop and setting the world record in large scale sorting. Spark’s general abstraction means it can expand beyond simple batch processing, making it capable of such things as blazing-fast, iterative algorithms and exactly once streaming semantics. This combined with it’s interactive shell make it a powerful tool useful for everybody, from data tinkerers to data scientists to data developers.
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Spark Summit
This document summarizes key aspects of running Spark Streaming applications in production, including fault tolerance, performance, and monitoring. It discusses how Spark Streaming receives data streams in batches and processes them across executors. It describes how driver and executor failures can be handled through checkpointing saved DAG information and write ahead logs that replicate received data blocks. Restarting the driver from checkpoints allows recovering the application state.
This document provides an introduction and overview of Kafka, Spark and Cassandra. It begins with introductions to each technology - Cassandra as a distributed database, Spark as a fast and general engine for large-scale data processing, and Kafka as a platform for building real-time data pipelines and streaming apps. It then discusses how these three technologies can be used together to build a complete data pipeline for ingesting, processing and analyzing large volumes of streaming data in real-time while storing the results in Cassandra for fast querying.
Since 2014, Typesafe has been actively contributing to the Apache Spark project, and has become a certified development support partner of Databricks, the company started by the creators of Spark. Typesafe and Mesosphere have forged a partnership in which Typesafe is the official commercial support provider of Spark on Apache Mesos, along with Mesosphere’s Datacenter Operating Systems (DCOS).
In this webinar with Iulian Dragos, Spark team lead at Typesafe Inc., we reveal how Typesafe supports running Spark in various deployment modes, along with the improvements we made to Spark to help integrate backpressure signals into the underlying technologies, making it a better fit for Reactive Streams. He also show you the functionalities at work, and how to make it simple to deploy to Spark on Mesos with Typesafe.
We will introduce:
Various deployment modes for Spark: Standalone, Spark on Mesos, and Spark with Mesosphere DCOS
Overview of Mesos and how it relates to Mesosphere DCOS
Deeper look at how Spark runs on Mesos
How to manage coarse-grained and fine-grained scheduling modes on Mesos
What to know about a client vs. cluster deployment
A demo running Spark on Mesos
Many people promise fast data as the next step after big data. The idea of creating a complete end-to-end data pipeline that combines Spark, Akka, Cassandra, Kafka, and Apache Mesos came up two years ago, sometimes called the SMACK stack. The SMACK stack is an ideal environment for handling all sorts of data-processing needs which can be nightly batch-processing tasks, real-time ingestion of sensor data or business intelligence questions. The SMACK stack includes a lot of components which have to be deployed somewhere. Let’s see how we can create a distributed environment in the cloud with Terraform and how we can provision a Mesos-Cluster with Mesosphere Datacenter Operating System (DC/OS) to create a powerful fast data platform.
Apache cassandra & apache spark for time series dataPatrick McFadin
Apache Cassandra is a distributed database that stores time series data in a partitioned and ordered format. Apache Spark can efficiently query this Cassandra data using Resilient Distributed Datasets (RDDs) and perform analytics like aggregations. For example, weather station data stored sequentially in Cassandra by time can be aggregated into daily high and low temperatures with Spark and written back to a roll-up Cassandra table.
Getting Started Running Apache Spark on Apache MesosPaco Nathan
This document provides an overview of Apache Mesos and how to run Apache Spark on a Mesos cluster. It describes Mesos as a distributed systems kernel that allows sharing compute resources across applications. It then gives step-by-step instructions for launching a Mesos cluster in AWS, configuring and running Spark jobs on the cluster, and where to find example Spark jobs and further Mesos resources.
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...Alexey Kharlamov
At Integral, we process heavy volumes of click-stream traffic. 50K QPS of ad impressions at peak and close to 200K QPS of all browser calls. We build analytics on this streams of data. There are two applications which require quite significant computational effort: 'sessionization' and fraud detection.
Sessionization implies linking a series of requests from same browser into single record. There can be 5 or more total requests spread over 15-30 minutes which we need to link to each other.
Fraud detection is a process looking at various signals in browser requests and at substantial historical evidence data classifying ad impression either as legitimate or as fraudulent.
We've been doing both (as well as all other analytics) in batch mode once an hour at best. Both processes, and, in particular, fraud detection, are time sensitive and much more meaningful if done in near-real-time.
This talk would be about our experience migrating a once-per-day offline batch processing of impression data using hadoop to in-memory stream processing using Kafka, Storm and Cassandra. We will touch upon our choices and our reasoning for selecting the products used for this solution.
Hadoop is no longer the only or always preferred option in Big Data space. In-memory stream processing may be more effective for time series data preparation and aggregation. Ability to scale at a significantly lower cost means more customers, better accuracy and better business practices: since only in-stream processing allows for low-latency data and insight delivery it opens entirely new opportunities. However, transitioning of non-trivial data pipelines raises a number of questions hidden previously within the offline nature of batch processing. How will you join several data feeds? How will you implement failure recovery? In addition to handling terabytes of data per day our streaming system has to be guided by the following considerations:
• Recovery time
• Time relativity and continuity
• Geographical distribution of data sources
• Limit on data loss
• Maintainability
The system produces complex cross-correlational analysis of several data feeds and aggregation for client analytics with input feed frequency of up to 100K msg/sec.
This presentation will benefit anyone interested in learning an alternate approach for big data analytics, especially the process of joining multiple streams in memory using Cassandra. Presentation will also highlight certain optimization patterns used those can be useful in similar situations.
Although most microservices are stateless - they delegate things like persistence and consistency to a database or external storage. But sometimes you benefit when you keep the state inside the application. In this talk I’m going to discuss why you want to build stateful microservices and design choices to make. I’ll use Akka framework and explain tools like Akka Clustering and Akka Persistence in depth and show a few practical examples.
This document summarizes a presentation about monitoring Cassandra systems. It discusses gathering metrics from Cassandra using JMX and nodetool, including thread pool statistics, latency histograms, and metric types. It also provides an overview of the Cassandra read/write process involving memtables and SSTables.
The document discusses the challenges of managing large volumes of data from various sources in a traditional divided approach. It argues that Hadoop provides a solution by allowing all data to be stored together in a single system and processed as needed. This addresses the problems caused by keeping data isolated in different silos and enables new types of analysis across all available data.
Moving to a data-centric architecture: Toronto Data Unconference 2015Adam Muise
Why use a datalake? Why use lambda? A conversation starter for Toronto Data Unconference 2015. We will discuss technologies such as Hadoop, Kafka, Spark Streaming, and Cassandra.
Creating a Data Science Team from an Architect's perspective. This is about team building on how to support a data science team with the right staff, including data engineers and devops.
2015 nov 27_thug_paytm_rt_ingest_brief_finalAdam Muise
The document discusses Paytm Labs' transition from batch data ingestion to real-time data ingestion using Apache Kafka and Confluent. It outlines their current batch-driven pipeline and some of its limitations. Their new approach, called DFAI (Direct-From-App-Ingest), will have applications directly write data to Kafka using provided SDKs. This data will then be streamed and aggregated in real-time using their Fabrica framework to generate views for different use cases. The benefits of real-time ingestion include having fresher data available and a more flexible schema.
Obey The Rules: Implementing a Rules Engine in FlexRJ Owen
A presentation I gave with Drew McLean at 360|Flex 2010 in San Jose. The presentation covers how to develop a client-side rules engine using Adobe Flex. We discuss rules engine theory and give three sample implementations. I apologize that I cannot upload source files here - please contact us for more information.
Processing 50,000 Events Per Second with Cassandra and Spark (Ben Slater, Ins...DataStax
Over the last 12 months, Instaclustr has developed a centralised system to capture and analysis monitoring metrics from its managed services fleet of well over 500 Cassandra and Spark nodes. While we entered into the exercise with plenty of experience provisioning and managing Cassandra and Spark, this was our first experience building a significant application. We'll walk through some of the missteps we made and lessons learned along the way as well as sharing our current solution architecture.
About the Speaker
Ben Slater Chief Product Officer, Instaclustr
Instaclustr provides Cassandra and Spark as a managed service in the cloud. As Chief Product Officer, Ben is charged with steering Instaclustr's development roadmap, managing product engineering and overseeing the production support and consulting teams. Ben has over 20 years experience in systems development including previously as lead architect for the product that is now Oracle Policy Automation and over 10 years as a solution architect and project manager for Accenture.
The document provides an introduction to artificial intelligence, including:
- A brief history of AI from the 1980s "AI winter" period of failed projects through to recent advances enabled by improved hardware and new research areas like machine learning.
- Knowledge representation and reasoning, rule engines, hybrid reasoning systems, and expert systems are introduced as key concepts in AI.
- The advantages of using a rule engine are discussed, as well as when rule engines are appropriate versus other approaches like scripting engines. The Rete algorithm, which is commonly used in rule engines, is also introduced.
Cassandra as an event sourced journal for big data analytics Cassandra Summit...Martin Zapletal
The document discusses event sourcing and CQRS architectures using technologies like Akka, Cassandra, and Spark. It provides an overview of how event sourcing avoids the limitations of traditional mutable databases by using an immutable write log. It describes how CQRS separates read and write concerns for better scalability. Example architectures show how Akka persistence can store events in Cassandra and provide views of data, while Spark can perform analytics on the full event stream.
Your enterprise can become truly intelligent
-------------------------------------------
Get there with Red Hat’s JBoss® Enterprise Business Rules Management System (BRMS), a key component of our vision for the intelligent, integrated enterprise. It delivers the power of business rules, complex event processing, and business process management in a single open source distribution—all accessible from a common set of authoring tools.
JBoss Enterprise BRMS supports a broad range of decision-management and process-driven applications with a unique combination of open source technologies:
- jBPM5 business process management
- Drools business rules
- Drools Fusion complex event processing
Build rule, event, and process-driven applications that scale across the enterprise
-------------------------------------------
Discover best practices for constructing BRMS applications that support large numbers of rules operating on big data. We’ll illustrate common use cases with real-world case studies and give you practical tips for estimating computing resource requirements.
The business rules engine allows business users to easily create and manage business rules through intuitive HTML forms. It features built-in review and approval processes, version control of rules for reuse, and a secure audit trail of rule changes. The rules engine can automate workflows, make various types of decisions including through decision tables and trees, compute values based on changes to related values, allocate tasks based on matrices, drive scoring, and perform credit checks, eligibility checks, and document checklists through picking rules. The rules engine architecture includes components for rule version control, editing, simulating, executing rules for various functions, and auditing rule execution history.
Building Reactive Systems with Akka (in Java 8 or Scala)Jonas Bonér
Learn how to build Reactive Systems with Akka. Examples in both Java 8 and Scala.
Abstract:
The demands and expectations for applications have changed dramatically in recent years. Applications today are deployed on a wide range of infrastructure; from mobile devices up to thousands of nodes running in the cloud—all powered by multi-core processors. They need to be rich and collaborative, have a real-time feel with millisecond response time and should never stop running. Additionally, modern applications are a mashup of external services that need to be consumed and composed to provide the features at hand. We are seeing a new type of applications emerging to address these new challenges—these are being called Reactive Applications.
In this talk we will introduce you to Akka and discuss how it can help you deliver on the four key traits of Reactive; Responsive, Resilient, Elastic and Message-Driven. We will start with the basics of Akka and work our way towards some of its more advanced modules such as Akka Cluster and Akka Persistence—all driven through code and practical examples.
The document introduces business rules and the JBoss Drools rules engine. It discusses how business rules are used to model business logic and policies, and how Drools uses a Rete algorithm for efficient forward-chaining rule processing. Drools provides tools for authoring, managing, and executing rules through its Expert, Guvnor, Flow and Fusion components.
Akka persistence == event sourcing in 30 minutesKonrad Malawski
Akka 2.3 introduces akka-persistence, a wonderful way of implementing event-sourced applications. Let's give it a shot and see how DDD and Akka are a match made in heaven :-)
In this talk Ben will walk you through running Cassandra in a docker environment to give you a flexible development environment that uses only a very small set of resources, both locally and with your favorite cloud provider. Lessons learned running Cassandra with a very small set of resources are applicable to both your local development environment and larger, less constrained production deployments.
This document provides an overview of rule engines and the Drools rule engine. It defines key concepts like rules, the ReteOO algorithm, and why use a rule engine. It then describes Drools Expert and the different Drools rule formats. It explains executing rules in Drools and the Drools Eclipse IDE. Finally, it summarizes the Drools Guvnor rule management system and Drools Flow for process automation.
Real-Time Analytics with Apache Cassandra and Apache SparkGuido Schmutz
This document provides an overview of real-time analytics with Apache Cassandra and Apache Spark. It discusses how Spark can be used for stream processing over Cassandra for storage. Spark Streaming ingests real-time data from sources like Kafka and processes it using Spark transformations and actions. The processed data can be stored in Cassandra for querying. Cassandra is well suited for high write throughput and storing large amounts of data, while Spark enables fast in-memory processing and machine learning capabilities. Together, Spark and Cassandra provide a scalable solution for real-time analytics and querying of large datasets.
Este documento presenta una agenda para un curso sobre bases de datos NoSQL. El curso introducirá conceptos como cloud computing y bases de datos distribuidas, y luego cubrirá tipos específicos de bases de datos NoSQL como clave-valor, orientadas a documentos y sus ventajas y desventajas. También analizará ejemplos como Apache Cassandra, Apache CouchDB y MongoDB, y la integración con bases de datos relacionales.
Solr provides concise summaries of key points from the document:
1. Solr discusses its search architecture including the use of Thrift for service encapsulation and reduced network traffic. Only IDs are returned from searches to reduce index size and enable easy scaling of primary key lookups.
2. Load balancing is discussed, including an algorithm that hashes the query and number of servers to provide server affinity while distributing load evenly.
3. Replication of the index is covered, including challenges with multicast and an implementation using BitTorrent to efficiently replicate files.
During the talk, we will build a simple web app using Lift and then introduce Akka ( http://akkasource.org) to help scale it. Specifically, we will demonstrate Remote Actors, "Let it crash" fail over, and Dispatcher. Other Scala oriented tools we will use include sbt and ENSIME mode for emacs.
This document summarizes a presentation about scaling web applications with Akka. It discusses how Akka uses an actor model of computation with message passing between lightweight processes to enable safe concurrency. Key features of Akka that help with scaling include fault tolerance through supervision, flexible dispatch strategies to leverage multiple cores, and support for NoSQL databases through pluggable storage backends. The presentation provides code examples of implementing actors in Akka and other frameworks and concludes by taking questions about Akka.
1. AJAX uses a group of technologies including HTML, CSS, DOM, JavaScript, and XMLHttpRequest to asynchronously exchange data with a web server in the background without interfering with the display and behavior of the existing page.
2. The document discusses how AJAX works and the XMLHttpRequest object used to asynchronously exchange data with a web server. It provides examples of using AJAX for real-time validation and to retrieve up-to-date stock information from a database without reloading the page.
3. The key steps in an AJAX application are to create an XMLHttpRequest object, assign an onreadystatechange handler, open a request to the server, and send the request. The response is then
Struts 2 is an open source MVC framework based on Java EE standards. It uses a request-response pipeline where interceptors pre-process and post-process requests. The core components are interceptors, actions, and results. Interceptors provide functionality like validation while actions contain application logic and results define the response. Values are stored and accessed from a value stack using OGNL.
This document discusses the Ajaxian framework Prototype and its utilities for asynchronous JavaScript (Ajax). It provides an overview of Prototype's basic utilities for DOM manipulation and Ajax helpers. The Ajax helpers include an Ajax object that handles cross-browser XMLHttpRequests and an Ajax.Request method for making Ajax calls with configurable options and callbacks. An example is given showing how to make an Ajax request and specify a callback function using Ajax.Request.
Building Scalable Stateless Applications with RxJavaRick Warren
RxJava is a lightweight open-source library, originally from Netflix, that makes it easy to compose asynchronous data sources and operations. This presentation is a high-level intro to this library and how it can fit into your application.
This document summarizes advanced Akka features presented by Martin Kanters and Johan Janssen. It covers local and remote actors, scheduling, clustering, routing, cluster singletons, sharding, persistence, Akka HTTP, and finite state machines. The presentation introduces these features and provides examples to illustrate how they can be used with Akka.
This document provides an overview of Lucene and Solr for developers. It discusses Lucene core components like IndexWriter and analyzers. It also covers Solr architecture and customizing Solr through extension points like query parsers, update processors, and search components. The document provides examples and recommendations for testing and exploring the Lucene/Solr code and documentation.
This document provides an agenda for an introduction to jQuery and jQuery UI. It begins with an overview of selectors, events, traversing, effects & animations, and Ajax in jQuery. It then discusses what jQuery is, its history, advantages over vanilla JavaScript, examples, and the noConflict() method. Next, it covers various selectors, traversing methods, filters, and events. The document concludes with discussions of manipulating HTML and CSS, animations, jQuery's Ajax methods, and integrating jQuery UI.
7.1 Identify which attribute scopes are thread-safe:
Local variables
Instance variables
Class variables
Request attributes
Session attributes
Context attributes
7.2 Identify correct statements about differences between the multithreaded and single-threaded servlet models.
7.3 Identify the interface used to declare that a servlet must use the single thread model.
apidays LIVE Australia 2020 - Building distributed systems on the shoulders o...apidays
apidays LIVE Australia 2020 - Building Business Ecosystems
Building distributed systems on the shoulders of giants
Dasith Wijesiriwardena, Telstra Purple (Readify)
Over the past few years, web-applications have started to play an increasingly important role in our lives. We expect them to be always available and the data to be always fresh. This shift into the realm of real-time data processing is now transitioning to physical devices, and Gartner predicts that the Internet of Things will grow to an installed base of 26 billion units by 2020.
Reactive web-applications are an answer to the new requirements of high-availability and resource efficiency brought by this rapid evolution. On the JVM, a set of new languages and tools has emerged that enable the development of entirely asynchronous request and data handling pipelines. At the same time, container-less application frameworks are gaining increasing popularity over traditional deployment mechanisms.
This talk is going to give you an introduction into one of the most trending reactive web-application stack on the JVM, involving the Scala programming language, the concurrency toolkit Akka and the web-application framework Play. It will show you how functional programming techniques enable asynchronous programming, and how those technologies help to build robust and resilient web-applications.
In this slideshare we introduce the basic concepts of a simple REST applications with Python and present some examples, see our Github repository. In addition we’ll go under the hood to see how Hammock provides abstraction and I’ll also show simple benchmarks that measure the library overhead.
NET Systems Programming Learned the Hard Way.pptxpetabridge
This document discusses .NET systems programming and garbage collection. It covers garbage collection generations, modes, and considerations for minimizing allocations. It also discusses eliminating delegates to reduce allocations, using value types appropriately, avoiding empty collections, and optimizing for thread locality to reduce context switching overhead. Data structures and synchronization techniques are discussed, emphasizing the importance of choosing lock-free data structures when possible to improve performance.
Scala Programming for Semantic Web Developers ESWC Semdev2015Jean-Paul Calbimonte
Scalable and Reactive Programming for Semantic Web Developers discusses using Scala for semantic web development. Key points include:
- Scala allows for more concise RDF code compared to Java through features like type inference and implicit parameters.
- The actor model and futures in Scala enable asynchronous and reactive programming for RDF streams and SPARQL queries.
- OWL API reasoning with ontologies can be done more clearly in Scala through implicit classes that simplify common operations.
The document discusses the challenges and rewards of developing a large JavaScript application with over 120,000 lines of code. It describes the Xopus XML editor framework, including its object-oriented architecture, class loading system, and techniques for improving performance like asynchronous execution and lazy loading of components. The framework aims to provide an application development structure for JavaScript applications in a similar way that Java and C# frameworks work.
This document provides an overview of Durable Functions, which allow for implementing stateful workflows in Azure Functions. Durable Functions enable writing stateful workflows as code by using orchestrator functions to coordinate asynchronous activity functions. Key components include orchestrator functions that call and coordinate stateless activity functions, and an orchestration client to start and manage orchestrations. Patterns demonstrated include function chaining, fan-out/fan-in, and using an orchestration client to start and monitor orchestrations.
AJAX is an acronym standing for Asynchronous JavaScript and XML and this technology helps us to load data from the server without a browser page refresh.
If you are new with AJAX, I would recommend you go through our Ajax Tutorial before proceeding further.
JQuery is a great tool which provides a rich set of AJAX methods to develop next generation web application.
Semantic Cultivators : The Critical Future Role to Enable AIartmondano
By 2026, AI agents will consume 10x more enterprise data than humans, but with none of the contextual understanding that prevents catastrophic misinterpretations.
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxAnoop Ashok
In today's fast-paced retail environment, efficiency is key. Every minute counts, and every penny matters. One tool that can significantly boost your store's efficiency is a well-executed planogram. These visual merchandising blueprints not only enhance store layouts but also save time and money in the process.
AI and Data Privacy in 2025: Global TrendsInData Labs
In this infographic, we explore how businesses can implement effective governance frameworks to address AI data privacy. Understanding it is crucial for developing effective strategies that ensure compliance, safeguard customer trust, and leverage AI responsibly. Equip yourself with insights that can drive informed decision-making and position your organization for success in the future of data privacy.
This infographic contains:
-AI and data privacy: Key findings
-Statistics on AI data privacy in the today’s world
-Tips on how to overcome data privacy challenges
-Benefits of AI data security investments.
Keep up-to-date on how AI is reshaping privacy standards and what this entails for both individuals and organizations.
Big Data Analytics Quick Research Guide by Arthur MorganArthur Morgan
This is a Quick Research Guide (QRG).
QRGs include the following:
- A brief, high-level overview of the QRG topic.
- A milestone timeline for the QRG topic.
- Links to various free online resource materials to provide a deeper dive into the QRG topic.
- Conclusion and a recommendation for at least two books available in the SJPL system on the QRG topic.
QRGs planned for the series:
- Artificial Intelligence QRG
- Quantum Computing QRG
- Big Data Analytics QRG
- Spacecraft Guidance, Navigation & Control QRG (coming 2026)
- UK Home Computing & The Birth of ARM QRG (coming 2027)
Any questions or comments?
- Please contact Arthur Morgan at [email protected].
100% human made.
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul
Artificial intelligence is changing how businesses operate. Companies are using AI agents to automate tasks, reduce time spent on repetitive work, and focus more on high-value activities. Noah Loul, an AI strategist and entrepreneur, has helped dozens of companies streamline their operations using smart automation. He believes AI agents aren't just tools—they're workers that take on repeatable tasks so your human team can focus on what matters. If you want to reduce time waste and increase output, AI agents are the next move.
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxshyamraj55
We’re bringing the TDX energy to our community with 2 power-packed sessions:
🛠️ Workshop: MuleSoft for Agentforce
Explore the new version of our hands-on workshop featuring the latest Topic Center and API Catalog updates.
📄 Talk: Power Up Document Processing
Dive into smart automation with MuleSoft IDP, NLP, and Einstein AI for intelligent document workflows.
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungenpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-nomad-web-best-practices-und-verwaltung-von-multiuser-umgebungen/
HCL Nomad Web wird als die nächste Generation des HCL Notes-Clients gefeiert und bietet zahlreiche Vorteile, wie die Beseitigung des Bedarfs an Paketierung, Verteilung und Installation. Nomad Web-Client-Updates werden “automatisch” im Hintergrund installiert, was den administrativen Aufwand im Vergleich zu traditionellen HCL Notes-Clients erheblich reduziert. Allerdings stellt die Fehlerbehebung in Nomad Web im Vergleich zum Notes-Client einzigartige Herausforderungen dar.
Begleiten Sie Christoph und Marc, während sie demonstrieren, wie der Fehlerbehebungsprozess in HCL Nomad Web vereinfacht werden kann, um eine reibungslose und effiziente Benutzererfahrung zu gewährleisten.
In diesem Webinar werden wir effektive Strategien zur Diagnose und Lösung häufiger Probleme in HCL Nomad Web untersuchen, einschließlich
- Zugriff auf die Konsole
- Auffinden und Interpretieren von Protokolldateien
- Zugriff auf den Datenordner im Cache des Browsers (unter Verwendung von OPFS)
- Verständnis der Unterschiede zwischen Einzel- und Mehrbenutzerszenarien
- Nutzung der Client Clocking-Funktion
Mobile App Development Company in Saudi ArabiaSteve Jonas
EmizenTech is a globally recognized software development company, proudly serving businesses since 2013. With over 11+ years of industry experience and a team of 200+ skilled professionals, we have successfully delivered 1200+ projects across various sectors. As a leading Mobile App Development Company In Saudi Arabia we offer end-to-end solutions for iOS, Android, and cross-platform applications. Our apps are known for their user-friendly interfaces, scalability, high performance, and strong security features. We tailor each mobile application to meet the unique needs of different industries, ensuring a seamless user experience. EmizenTech is committed to turning your vision into a powerful digital product that drives growth, innovation, and long-term success in the competitive mobile landscape of Saudi Arabia.
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersToradex
Toradex brings robust Linux support to SMARC (Smart Mobility Architecture), ensuring high performance and long-term reliability for embedded applications. Here’s how:
• Optimized Torizon OS & Yocto Support – Toradex provides Torizon OS, a Debian-based easy-to-use platform, and Yocto BSPs for customized Linux images on SMARC modules.
• Seamless Integration with i.MX 8M Plus and i.MX 95 – Toradex SMARC solutions leverage NXP’s i.MX 8 M Plus and i.MX 95 SoCs, delivering power efficiency and AI-ready performance.
• Secure and Reliable – With Secure Boot, over-the-air (OTA) updates, and LTS kernel support, Toradex ensures industrial-grade security and longevity.
• Containerized Workflows for AI & IoT – Support for Docker, ROS, and real-time Linux enables scalable AI, ML, and IoT applications.
• Strong Ecosystem & Developer Support – Toradex offers comprehensive documentation, developer tools, and dedicated support, accelerating time-to-market.
With Toradex’s Linux support for SMARC, developers get a scalable, secure, and high-performance solution for industrial, medical, and AI-driven applications.
Do you have a specific project or application in mind where you're considering SMARC? We can help with Free Compatibility Check and help you with quick time-to-market
For more information: https://www.toradex.com/computer-on-modules/smarc-arm-family
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfAbi john
Analyze the growth of meme coins from mere online jokes to potential assets in the digital economy. Explore the community, culture, and utility as they elevate themselves to a new era in cryptocurrency.
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Impelsys Inc.
Impelsys provided a robust testing solution, leveraging a risk-based and requirement-mapped approach to validate ICU Connect and CritiXpert. A well-defined test suite was developed to assess data communication, clinical data collection, transformation, and visualization across integrated devices.
Role of Data Annotation Services in AI-Powered ManufacturingAndrew Leo
From predictive maintenance to robotic automation, AI is driving the future of manufacturing. But without high-quality annotated data, even the smartest models fall short.
Discover how data annotation services are powering accuracy, safety, and efficiency in AI-driven manufacturing systems.
Precision in data labeling = Precision on the production floor.
Quantum Computing Quick Research Guide by Arthur MorganArthur Morgan
This is a Quick Research Guide (QRG).
QRGs include the following:
- A brief, high-level overview of the QRG topic.
- A milestone timeline for the QRG topic.
- Links to various free online resource materials to provide a deeper dive into the QRG topic.
- Conclusion and a recommendation for at least two books available in the SJPL system on the QRG topic.
QRGs planned for the series:
- Artificial Intelligence QRG
- Quantum Computing QRG
- Big Data Analytics QRG
- Spacecraft Guidance, Navigation & Control QRG (coming 2026)
- UK Home Computing & The Birth of ARM QRG (coming 2027)
Any questions or comments?
- Please contact Arthur Morgan at [email protected].
100% human made.
TrsLabs - Fintech Product & Business ConsultingTrs Labs
Hybrid Growth Mandate Model with TrsLabs
Strategic Investments, Inorganic Growth, Business Model Pivoting are critical activities that business don't do/change everyday. In cases like this, it may benefit your business to choose a temporary external consultant.
An unbiased plan driven by clearcut deliverables, market dynamics and without the influence of your internal office equations empower business leaders to make right choices.
Getting things done within a budget within a timeframe is key to Growing Business - No matter whether you are a start-up or a big company
Talk to us & Unlock the competitive advantage
2. What is Paytm Labs and Paytm?
• Paytm Labs is a data-driven lab focusing on tackling very
difficult problems involving the topics of fraud,
recommendations, ratings, and platforms for Paytm.
• Paytm is the world's fastest growing mobile-first
marketplace and payment ecosystem that serves over 100
million people who make over 1.5 million business
transactions representing $1.7 billion of goods and
services exchanged annually.
2
3. What is Akka?
• Akka (http://akka.io/):
• “Akka is a toolkit and runtime for building highly
concurrent, distributed, and resilient message-driven
applications on the JVM.”
• Packages: “akka-actor”, “akka-remote”, “akka-cluster”,
“akka-persistence”, “akka-http”, and “akka-stream”.
3
4. What is Cassandra?
• Cassandra (http://cassandra.apache.org/):
• “The Apache Cassandra database is the right choice
when you need scalability and high availability without
compromising performance.”
4
5. What is Spray?
• Spray (http://spray.io/):
• “Spray is an open-source toolkit for building REST/HTTP-
based integration layers on top of Scala and Akka.”
• Packages: “spray-caching”, “spray-can”, “spray-http”,
“spray-httpx”, “spray-io”, “spray-json”, “spray-routing”,
“spray-servlet”.
5
6. What is Maquette?
• A real-time fraud rule-engine which enables synchronous
calls for core operational platforms to evaluate fraud.
• Its core technologies include Akka, Cassandra, and Spray.
6
7. Why Akka, Cassandra, and Spray?
• Akka, Cassandra, and Spray are highly performant,
developer-friendly, treat failures as a first-class concept,
provide great support for clustering to ensure
responsiveness, resiliency, and elasticity when creating
Reactive Systems.
7
10. HTTP Layer
• Utilize Spray-Can for a fast HTTP endpoint.
• Utilize Jackson for JSON deserialization/serialization.
• Utilize a separate dispatcher for the Bulkhead Pattern.
• Expose a normalized yet flexible schema for integration.
• Request Handling: Worst → Best
• Cameo Pattern (Per-request Actor),
• Ask Pattern (Future),
• RequestHandlerPool (Akka Router Pool).
10
12. Environment Layer
• A tree of actors which are responsible for managing a
cache or pool of Contexts and Dependencies required to
evaluate incoming requests.
• A Context is a Document Message which wraps
configurations for evaluating requests.
• A Dependency is a Document Message which wraps
optimized queries to Cassandra.
12
13. Environment Layer
• Map incoming requests to a Context by forking a template
with .copy().
• Forward the forked Context to Executor Layer in the same
or different JVM with Akka Router.
• Consider implementing a custom router to favour locality
of execution on the same JVM until responsiveness
requires distribution.
13
14. Environment Layer
• Always pre-compute and pre-optimize the Environment
Layer as a whole.
• Allow the capability to remotely pre-compute and update
Contexts.
• Ensure Contexts and Dependencies are designed for
optimization by allowing arithmetic reduction or sorts.
• Having a ProxyActor and StateActor for an
EnvironmentActor is preferred to ensure caching of the
whole environment to recover from failures fast.
14
15. Environment Layer
type EnvironmentStateActorRefFactory =
(EnvironmentProxyActorContext, EnvironmentProxyActorSelf) => ActorRef
type EnvironmentActorRefFactory =
(EnvironmentProxyActorContext, EnvironmentProxyActorSelf) => ActorRef
class EnvironmentProxyActor(
environmentStateActorRefFactory: EnvironmentStateActorRefFactory,
environmentActorRefFactory: EnvironmentActorRefFactory
) extends Actor with ActorLogging {
val environmentStateActorRef = environmentStateActorRefFactory(context, self)
val environmentActorRef = environmentActorRefFactory(context, self)
override def receive: Receive =
receiveEnvironmentState orElse
receiveFraudRequest orElse
receiveEnvironmentLocalCommand orElse
receiveEnvironmentRemoteCommand
} 15
18. Executor Layer
• A pipeline of actors responsible for scheduling execution of
Tasks defined within a Context with the specified
Dependencies, executing the Tasks, and coordinating the
results of the Tasks to provide a response.
• A Task is an optimized set of executable rules.
18
19. Executor Layer
• Ideally, an Execution Layer should be stateless to allow
easy recovery from failures.
• Ideally, keep the Execution Layer available across the
cluster.
19
20. Executor Layer
type ExecutorRouterActorRefFactory =
(ExecutorActorContext, ExecutorActorSelf) => ActorRef
type ExecutorCoordinatorActorRefFactory =
(ExecutorActorContext, ExecutorActorSender, ExecutorActorNext, MaquetteContext, Timeout) =>
ActorRef
class ExecutorActor(
executorRouterActorRefFactory: ExecutorRouterActorRefFactory,
executorCoordinatorActorRefFactory: ExecutorCoordinatorActorRefFactory,
actionActorRef: ActorRef
) extends Actor with ActorLogging {
import ExecutorActor._
import ExecutorSchedulerStrategy._
val executorRouterActorRef: ActorRef = executorRouterActorRefFactory(context, self)
override def receive: Receive =
receiveMaquetteContext orElse
receiveMaquetteResult
object ExecutorSchedulerStrategy {
def scheduleExecution(maquetteContext: MaquetteContext): Unit = { ... }
}
}
20
21. Executor Layer
• Design a Task as a functional and monadic data structure.
• Utilizing functional programming, the Task should isolate
side effects from functions.
• Utilizing Monads, the Task becomes easily optimizable
with its properties for composition or reduction which
allows high parallelization.
21
22. Executor Layer
case class Query(
selectComponent: Select, fromComponent: From, whereComponent: Where
) {
def + (that: Query): Query = {
this.copy(selectComponent =
Select(this.selectComponent.columnNames union
that.selectComponent.columnNames)
)
}
def - (that: Query): Query = {
this.copy(selectComponent =
Select(this.selectComponent.columnNames diff
that.selectComponent.columnNames)
)
}
}
22
Note: An example of a Rule object is not shown as it is a trade secret.
23. Executor Layer
• For a Task object, consider the use of an external DSL to
interpret into executable and immutable graphs and even
Java byte code.
• Scala Parser Combinators:
https://github.com/scala/scala-parser-combinators
• Parboiled2: https://github.com/sirthias/parboiled2
• ANTLR: http://www.antlr.org/
23
24. Executor Layer
object QueryParser extends JavaTokenParsers {
def parseQuery(queryString: String): Try[Query] = {
parseAll(queryStatement, queryString) ...
}
object QueryGrammar {
lazy val queryStatement: Parser[Query] =
selectClause ~ fromClause ~ opt(whereClause) ~ ";" ^^ {
case selectComponent ~ fromComponent ~ whereComponent ~ ";" =>
Query(selectComponent, fromComponent, whereComponent.getOrElse(Where.Empty))
}
}
object SelectGrammar { ... }
object FromGrammar { ... }
object WhereGrammar { ... }
object StaticClauseGrammar { ... }
object DynamicClauseGrammar { ... }
object InterpolationTypeGrammar { ... }
object DataTypeGrammar { ... }
object LexicalGrammar { ... }
}
24
Note: An example of a Rule parser is not shown as it is a trade secret.
25. Abstracting Concurrency for High Parallelism Tasks
• Scala Futures.
• Scala Parallel Collections.
• Akka Router Pool.
• Akka Streams.
25
26. Scala Futures
• “A Future is an object holding a value which may become
available at some point.”
26
val f = for {
a <- Future(10 / 2)
b <- Future(a + 1)
c <- Future(a - 1)
if c > 3
} yield b * c
f foreach println
27. Scala Futures
• Advantages: Efficient, Highly Parallel, Simple Monadic
Abstraction.
• Disadvantages: Lacks Communication, Lacks Low-Level
Concurrency Control, JVM Bound.
• Note: Monadic Futures Enqueue All Operations to ExecutionContext
⇒ Lack of Control over Context-Switching.
27
28. Scala Parallel Collections
• Scala Parallel Collections is a package in the Scala
standard library which allows collections to execute
operations in parallel.
28
(0 until 100000).par
.filter(x => x.toString == x.toString.reverse)
29. Scala Parallel Collections
• Advantages: Very Efficient, Highly Parallel, Control of
Parallelism Level.
• Disadvantages: Lacks Communication, Non-parallelizable
Operations (foldLeft() and aggregate()), Non-
deterministic and Side Effects Issues for Degree of
Abstraction, JVM-Bound.
29
30. Akka Router Pool
• An Akka Router Pool maintains pool of child actors to
forward messages.
• If an Akka Router Pool is configured with an appropriate
dispatcher, mailbox, supervisor, and routing logic, it allows
a highly parallel yet elastic construct to execute tasks.
30
31. Akka Router Pool
val routerSupervisionStrategy = OneForOneStrategy() {
case _ => SupervisorStrategy.Restart
}
val routerPool = FromConfig.
withSupervisorStrategy(routerSupervisionStrategy)
val routerProps = routerPool.props(
ExecutorWorkerActor.props(accessLayer).
withDispatcher(DispatcherConfigPath)
)
context.actorOf(
props = routerProps,
name = RouterName
)
31
33. Akka Router Pool
• Disadvantages:
• Complex optimizations or implementation required.
• Actors with state potentially lead to issues regarding
mutability and lack of idempotence.
• Actors which require communication beyond parent-child
trees lead to potentially complex graphs.
33
34. Akka Steams
• “Akka Streams is an implementation of Reactive Streams,
which is a standard for asynchronous stream processing
with non-blocking backpressure.”
34
implicit val system = ActorSystem("reactive-tweets")
implicit val materializer = ActorMaterializer()
val authors: Source[Author, Unit] =
tweets
.filter(_.hashtags.contains(akka))
.map(_.author)
authors.runWith(Sink.foreach(println))
35. Akka Steams
• Advantages: Backpressure and Failure as First-class
Concepts, Concurrency Control, Simple Monadic
Abstraction, Graph API, Bi-directional Channels.
• Disadvantages: Too New = Risk for Production.
• Current: JVM Bounded; Potentially: Distributed
Streaming.
• Current: No Graph Optimization; Potentially: Macro-
based Optimization.
35
36. Maquette Performance
• With 10 Cassandra nodes, 4 Maquette nodes, and an HA
Proxy as a staging environment, ~40 000 requests per
second with a mean 10 millisecond response time with 50
rules.
36
37. Tips
• Investigate Akka Streams for Akka HTTP.
• Investigate CPU usage and memory consumption: YourKit
or VisualVM and Eclipse MAT.
• Utilize Kamon for real-time metrics to StatsD or a third-
party service like Datadog.
• If implementing a DSL or a complex actor-based graph,
remember to utilize ScalaTest and Akka TestKit properly.
• Utilize Gatling.io for load and scenario based testing.
37
38. Tips
• We used Cassandra 2.1.6 as our main data store for
Maquette. We experienced many pains with operating
Cassandra.
• Mastering Apache Cassandra (2nd Edition):
http://www.amazon.com/Mastering-Apache-Cassandra-
Second-Edition-ebook/dp/B00VAG2WZO
38
39. Tips
• Investigate the Play Framework with Akka Cluster to create
a web application for operations.
• Commands to operate instances in the cluster.
• Commands to configure instances in real-time.
• GUI interface for data scientists and business analysts to
easily define and configure rules.
39
40. Tips
• Utilize Kafka to publish audits which can be utilized to
monitor rules through an Logstash, Elasticsearch, and
Kibana flow, and archived in a HDFS.
• Consider Kafka to replay audits as requests to run real-time
engine offline for tuning rules.
40
41. Resources
• The Reactive Manifesto:
• http://www.reactivemanifesto.org/
• Reactive Messaging Patterns with the Actor Model:
• http://www.amazon.ca/Reactive-Messaging-Patterns-Actor-
Model/dp/0133846830
• Learning Concurrent Programming in Scala:
• http://www.amazon.com/Learning-Concurrent-Programming-Aleksandar-
Prokopec/dp/1783281413
• Akka Concurrency:
• http://www.amazon.ca/Akka-Concurrency-Derek-Wyatt/dp/0981531660
41