WASP is a framework to develop big data pipelines, working with streaming analytics, multi model storages and machine learning models. Everything is in real time.
TensorFlow & TensorFrames w/ Apache Spark presents Marco Saviano. It discusses numerical computing with Apache Spark and Google TensorFlow. TensorFrames allows manipulating Spark DataFrames with TensorFlow programs. It provides most operations in row-based and block-based versions. Row-based processes rows individually while block-based processes blocks of rows together for better efficiency. Reduction operations coalesce rows until one row remains. Future work may improve communication between Spark and TensorFlow through direct memory copying and using columnar storage formats.
Assessing Graph Solutions for Apache SparkDatabricks
Users have several options for running graph algorithms with Apache Spark. To support a graph data architecture on top of its linear-oriented DataFrames, the Spark platform offers GraphFrames. However, due to the fact that GraphFrames are immutable and not a native graph, there are cases where it might not offer the features or performance needed for certain use cases. Another option is to connect Spark to a real-time, scalable and distributed native graph database such as TigerGraph.
In this session, we compare three options — GraphX, Cypher for Apache Spark, and TigerGraph — for different types of workload requirements and data sizes, to help users select the right solution for their needs. We also look at the data transfer and loading time for TigerGraph.
Spark Summit EU talk by Sameer AgarwalSpark Summit
This document discusses Project Tungsten, which aims to substantially improve the memory and CPU efficiency of Spark. It describes how Spark has optimized IO but the CPU has become the bottleneck. Project Tungsten focuses on improving execution performance through techniques like explicit memory management, code generation, cache-aware algorithms, whole-stage code generation, and columnar in-memory data formats. It shows how these techniques provide significant performance improvements, such as 5-30x speedups on operators and 10-100x speedups on radix sort. Future work includes cost-based optimization and improving performance on many-core machines.
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim HunterDatabricks
Deep learning has shown tremendous successes, yet it often requires a lot of effort to leverage its power. Existing deep learning frameworks require writing a lot of code to run a model, let alone in a distributed manner. Deep Learning Pipelines is a Spark Package library that makes practical deep learning simple based on the Spark MLlib Pipelines API. Leveraging Spark, Deep Learning Pipelines scales out many compute-intensive deep learning tasks. In this talk we dive into – the various use cases of Deep Learning Pipelines such as prediction at massive scale, transfer learning, and hyperparameter tuning, many of which can be done in just a few lines of code. – how to work with complex data such as images in Spark and Deep Learning Pipelines. – how to deploy deep learning models through familiar Spark APIs such as MLlib and Spark SQL to empower everyone from machine learning practitioners to business analysts. Finally, we discuss integration with popular deep learning frameworks.
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...Databricks
In this talk, we will present how we analyze, predict, and visualize network quality data, as a spark AI use case in a telecommunications company. SK Telecom is the largest wireless telecommunications provider in South Korea with 300,000 cells and 27 million subscribers. These 300,000 cells generate data every 10 seconds, the total size of which is 60TB, 120 billion records per day.
In order to address previous problems of Spark based on HDFS, we have developed a new data store for SparkSQL consisting of Redis and RocksDB that allows us to distribute and store these data in real time and analyze it right away, We were not satisfied with being able to analyze network quality in real-time, we tried to predict network quality in near future in order to quickly detect and recover network device failures, by designing network signal pattern-aware DNN model and a new in-memory data pipeline from spark to tensorflow.
In addition, by integrating Apache Livy and MapboxGL to SparkSQL and our new store, we have built a geospatial visualization system that shows the current population and signal strength of 300,000 cells on the map in real time.
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16BigMine
Apache Spark has become the most active open source Big Data project, and its Machine Learning library MLlib has seen rapid growth in usage. A critical aspect of MLlib and Spark is the ability to scale: the same code used on a laptop can scale to 100’s or 1000’s of machines. This talk will describe ongoing and future efforts to make MLlib even faster and more scalable by integrating with two key initiatives in Spark. The first is Catalyst, the query optimizer underlying DataFrames and Datasets. The second is Tungsten, the project for approaching bare-metal speeds in Spark via memory management, cache-awareness, and code generation. This talk will discuss the goals, the challenges, and the benefits for MLlib users and developers. More generally, we will reflect on the importance of integrating ML with the many other aspects of big data analysis.
About MLlib: MLlib is a general Machine Learning library providing many ML algorithms, feature transformers, and tools for model tuning and building workflows. The library benefits from integration with the rest of Apache Spark (SQL, streaming, Graph, core), which facilitates ETL, streaming, and deployment. It is used in both ad hoc analysis and production deployments throughout academia and industry.
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RDatabricks
This talk discusses integrating common data science tools like Python pandas, scikit-learn, and R with MLlib, Spark’s distributed Machine Learning (ML) library. Integration is simple; migration to distributed ML can be done lazily; and scaling to big data can significantly improve accuracy. We demonstrate integration with a simple data science workflow. Data scientists often encounter scaling bottlenecks with single-machine ML tools. Yet the overhead in migrating to a distributed workflow can seem daunting. In this talk, we demonstrate such a migration, taking advantage of Spark and MLlib’s integration with common ML libraries. We begin with a small dataset which runs on a single machine. Increasing the size, we hit bottlenecks in various parts of the workflow: hyperparameter tuning, then ETL, and eventually the core learning algorithm. As we hit each bottleneck, we parallelize that part of the workflow using Spark and MLlib. As we increase the dataset and model size, we can see significant gains in accuracy. We end with results demonstrating the impressive scalability of MLlib algorithms. With accuracy comparable to traditional ML libraries, combined with state-of-the-art distributed scalability, MLlib is a valuable new tool for the modern data scientist.
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...MLconf
Spark and GraphX in the Netflix Recommender System: We at Netflix strive to deliver maximum enjoyment and entertainment to our millions of members across the world. We do so by having great content and by constantly innovating on our product. A key strategy to optimize both is to follow a data-driven method. Data allows us to find optimal approaches to applications such as content buying or our renowned personalization algorithms. But, in order to learn from this data, we need to be smart about the algorithms we use, how we apply them, and how we can scale them to our volume of data (over 50 million members and 5 billion hours streamed over three months). In this talk we describe how Spark and GraphX can be leveraged to address some of our scale challenges. In particular, we share insights and lessons learned on how to run large probabilistic clustering and graph diffusion algorithms on top of GraphX, making it possible to apply them at Netflix scale.
Spark Summit EU 2015: Reynold Xin KeynoteDatabricks
This document summarizes Spark's development over the past 12 months and provides a look ahead. It discusses improvements to both the frontend, such as DataFrames and machine learning pipelines, and the backend through projects like Tungsten for performance optimizations. Going forward, it mentions new features like the Dataset API, streaming DataFrames, and potential hardware improvements from technologies like 3D XPoint memory. The overall goal is to provide a unified engine and APIs that can automatically optimize analytics workloads across languages and domains.
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...Spark Summit
This talk will cover the tools we used, the hurdles we faced and the work arounds we developed with the help from Databricks support in our attempt to build a custom machine learning model and use it to predict the TV ratings for different networks and demographics.
The Apache Spark machine learning and dataframe APIs make it incredibly easy to produce a machine learning pipeline to solve an archetypal supervised learning problem. In our applications at Cadent, we face a challenge with high dimensional labels and relatively low dimensional features; at first pass such a problem is all but intractable but thanks to a large number of historical records and the tools available in Apache Spark, we were able to construct a multi-stage model capable of forecasting with sufficient accuracy to drive the business application.
Over the course of our work we have come across many tools that made our lives easier, and others that forced work around. In this talk we will review our custom multi-stage methodology, review the challenges we faced and walk through the key steps that made our project successful.
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaDatabricks
At NMC (Nielsen Marketing Cloud) we provide our customers (marketers and publishers) real-time analytics tools to profile their target audiences. To achieve that, we need to ingest billions of events per day into our big data stores, and we need to do it in a scalable yet cost-efficient manner.
In this session, we will discuss how we continuously transform our data infrastructure to support these goals. Specifically, we will review how we went from CSV files and standalone Java applications all the way to multiple Kafka and Spark clusters, performing a mixture of Streaming and Batch ETLs, and supporting 10x data growth We will share our experience as early-adopters of Spark Streaming and Spark Structured Streaming, and how we overcame technical barriers (and there were plenty). We will present a rather unique solution of using Kafka to imitate streaming over our Data Lake, while significantly reducing our cloud services’ costs. Topics include:
Kafka and Spark Streaming for stateless and stateful use-cases
Spark Structured Streaming as a possible alternative
Combining Spark Streaming with batch ETLs
”Streaming” over Data Lake using Kafka
Snorkel: Dark Data and Machine Learning with Christopher RéJen Aman
Building applications that can read and analyze a wide variety of data may change the way we do science and make business decisions. However, building such applications is challenging: real world data is expressed in natural language, images, or other “dark” data formats which are fraught with imprecision and ambiguity and so are difficult for machines to understand. This talk will describe Snorkel, whose goal is to make routine Dark Data and other prediction tasks dramatically easier. At its core, Snorkel focuses on a key bottleneck in the development of machine learning systems: the lack of large training datasets. In Snorkel, a user implicitly creates large training sets by writing simple programs that label data, instead of performing manual feature engineering or tedious hand-labeling of individual data items. We’ll provide a set of tutorials that will allow folks to write Snorkel applications that use Spark.
Snorkel is open source on github and available from Snorkel.Stanford.edu.
Spark Summit EU talk by Ram Sriharsha and Vlad FeinbergSpark Summit
This document summarizes an online machine learning framework called Structured Streaming that is being developed for Apache Spark. Some key points:
- It allows machine learning algorithms to be applied continuously to streaming data and update models incrementally in an online fashion.
- Models are updated every time interval (e.g. every second) based on new data within that interval. This provides an approximation of processing all data to date.
- It uses a stateful aggregation approach to allow models to be updated and merged across distributed partitions in a way that is deterministic but not necessarily commutative.
- APIs are provided for common online learning algorithms like online logistic regression and gradient descent to interface with streaming data sources and sinks.
A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovSpark Summit
This document summarizes research on implementing deep learning models using Spark. It describes:
1) Implementing a multilayer perceptron (MLP) model for digit recognition in Spark using batch processing and matrix optimizations to improve efficiency.
2) Analyzing the tradeoffs of computation and communication in parallelizing the gradient calculation for batch training across multiple nodes to find the optimal number of workers.
3) Benchmark results showing Spark MLP achieves similar performance to Caffe on a single node and outperforms it by scaling nearly linearly when using multiple nodes.
Continuous Evaluation of Deployed Models in Production Many high-tech industr...Databricks
Many high-tech industries rely on machine-learning systems in production environments to automatically classify and respond to vast amounts of incoming data. Despite their critical roles, these systems are often not actively monitored. When a problem first arises, it may go unnoticed for some time. Once it is noticed, investigating its underlying cause is a time-consuming, manual process. Wouldn’t it be great if the model’s output were automatically monitored? If they could be visualized, sliced by different dimensions? If the system could automatically detect performance degradation and trigger alerts? In this presentation, we describe our experience from building such a core machine-learning services: Model Evaluation.
Our service provides automated, continuous evaluation of the performance of a deployed model over commonly-used metrics like the area-under-the-curve (AUC), root-mean-square-error (RMSE) etc. In addition, summary statistics about the model’s output, their distributions are also computed. The service also provides a dashboard to visualize the performance metrics, summary statistics and distributions of a model over time along with REST APIs to retrieve these metrics programmatically.
These metrics can be sliced by input features (e.g. Geography, Product type) to provide insights into model performance over different segments. The talk will describe various components that are required in building such a service and metrics of interest. Our system has a backend component built with spark on Azure Databricks. The backend can scale to analyze TBs of data to generate model evaluation metrics.
We will talk about how we modified Spark MLLib for computing AUC sliced by different dimensions and other optimizations in Spark to improve compute and performance. Our front-end and middle-tier, built with Docker and Azure Webapp provides visuals and REST APIs to retrieve the above metrics. This talk will cover various aspects of building, deploying and using the above system.
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Databricks
DeepLearning4J (DL4J) is a powerful Open Source distributed framework that brings Deep Learning to the JVM (it can serve as a DIY tool for Java, Scala, Clojure and Kotlin programmers). It can be used on distributed GPUs and CPUs. It is integrated with Hadoop and Apache Spark. ND4J is a Open Source, distributed and GPU-enabled library that brings the intuitive scientific computing tools of the Python community to the JVM. Training neural network models using DL4J, ND4J and Spark is a powerful combination, but it presents some unexpected issues that can compromise performance and nullify the benefits of well written code and good model design. In this talk I will walk through some of those problems and will present some best practices to prevent them, coming from lessons learned when putting things in production.
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)Spark Summit
This document summarizes Daniel Darabos' talk about the design and implementation of the LynxKite graph analytics application. The key ideas discussed are: (1) using column-based attributes to avoid processing unused data, (2) making joins fast through co-located loading of sorted RDDs, (3) not reading or computing all the data through techniques like prefix sampling, and (4) using binary search for lookups instead of filtering for small key sets. Examples are provided to illustrate how these techniques improve performance and user experience of interactive graph analytics on Spark.
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...Spark Summit
This talk tells the story of implementation and optimization of a sparse logistic regression algorithm in spark. I would like to share the lessons I learned and the steps I had to take to improve the speed of execution and convergence of my initial naive implementation. The message isn’t to convince the audience that logistic regression is great and my implementation is awesome, rather it will give details about how it works under the hood, and general tips for implementing an iterative parallel machine learning algorithm in spark. The talk is structured as a sequence of “lessons learned” that are shown in form of code examples building on the initial naive implementation. The performance impact of each “lesson” on execution time and speed of convergence is measured on benchmark datasets. You will see how to formulate logistic regression in a parallel setting, how to avoid data shuffles, when to use a custom partitioner, how to use the ‘aggregate’ and ‘treeAggregate’ functions, how momentum can accelerate the convergence of gradient descent, and much more. I will assume basic understanding of machine learning and some prior knowledge of spark. The code examples are written in scala, and the code will be made available for each step in the walkthrough.
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
The machine learning libraries in Apache Spark are an impressive piece of software engineering, and are maturing rapidly. What advantages does Spark.ml offer over scikit-learn? At Data Science Retreat we've taken a real-world dataset and worked through the stages of building a predictive model -- exploration, data cleaning, feature engineering, and model fitting; which would you use in production?
The machine learning libraries in Apache Spark are an impressive piece of software engineering, and are maturing rapidly. What advantages does Spark.ml offer over scikit-learn?
At Data Science Retreat we've taken a real-world dataset and worked through the stages of building a predictive model -- exploration, data cleaning, feature engineering, and model fitting -- in several different frameworks. We'll show what it's like to work with native Spark.ml, and compare it to scikit-learn along several dimensions: ease of use, productivity, feature set, and performance.
In some ways Spark.ml is still rather immature, but it also conveys new superpowers to those who know how to use it.
Practical Distributed Machine Learning Pipelines on HadoopDataWorks Summit
This document summarizes machine learning pipelines in Apache Spark using MLlib. It introduces Spark DataFrames for structured data manipulation and Apache Spark MLlib for building machine learning workflows. An example text classification pipeline is presented to demonstrate loading data, feature extraction, training a logistic regression model, and evaluating performance. Parameter tuning is discussed as an important part of the machine learning process.
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Spark Summit
R is a hugely popular platform for Data Scientists to create analytic models in many different domains. But when these applications should move from the science lab to the production environment of large enterprises a new set of challenges arises. Independently of R, Spark has been very successful as a powerful general-purpose computing platform. With the introduction of SparkR an exciting new option to productionize Data Science applications has been made available. This talk will give insight into two real-life projects at major enterprises where Data Science applications in R have been migrated to SparkR.
• Dealing with platform challenges: R was not installed on the cluster. We show how to execute SparkR on a Yarn cluster with a dynamic deployment of R.
• Integrating Data Engineering and Data Science: we highlight the technical and cultural challenges that arise from closely integrating these two different areas.
• Separation of concerns: we describe how to disentangle ETL and data preparation from analytic computing and statistical methods.
• Scaling R with SparkR: we present what options SparkR offers to scale R applications and how we applied them to different areas such as time series forecasting and web analytics.
• Performance Improvements: we will show benchmarks for an R applications that took over 20 hours on a single server/single-threaded setup. With moderate effort we have been able to reduce that number to 15 minutes with SparkR. And we will show how we plan to further reduces this to less than a minute in the future.
• Mixing SparkR, SparkSQL and MLlib: we show how we combined the three different libraries to maximize efficiency.
• Summary and Outlook: we describe what we have learnt so far, what the biggest gaps currently are and what challenges we expect to solve in the short- to mid-term.
The Parquet Format and Performance Optimization OpportunitiesDatabricks
The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads.
As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general.
This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
Recent Developments In SparkR For Advanced AnalyticsDatabricks
Since its introduction in Spark 1.4, SparkR has received contributions from both the Spark community and the R community. In this talk, we will summarize recent community efforts on extending SparkR for scalable advanced analytics. We start with the computation of summary statistics on distributed datasets, including single-pass approximate algorithms. Then we demonstrate MLlib machine learning algorithms that have been ported to SparkR and compare them with existing solutions on R, e.g., generalized linear models, classification and clustering algorithms. We also show how to integrate existing R packages with SparkR to accelerate existing R workflows.
Build, Scale, and Deploy Deep Learning Pipelines with EaseDatabricks
Deep Learning has shown a tremendous success, yet it often requires a lot of effort to leverage its power. Existing Deep Learning frameworks require writing a lot of code to work with a model, let alone in a distributed manner.
This webinar is the first of a series in which we survey the state of Deep Learning at scale, and where we introduce the Deep Learning Pipelines, a new open-source package for Apache Spark. This package simplifies Deep Learning in three major ways:
1. It has a simple API that integrates well with enterprise Machine Learning pipelines.
2. It automatically scales out common Deep Learning patterns, thanks to Spark.
3. It enables exposing Deep Learning models through the familiar Spark APIs, such as MLlib and Spark SQL.
In this webinar, we will look at a complex problem of image classification, using Deep Learning and Spark. Using Deep Learning Pipelines, we will show:
* how to build deep learning models in a few lines of code;
* how to scale common tasks like transfer learning and prediction; and
* how to publish models in Spark SQL.
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Databricks
This document summarizes the growth and development of the Spark project. It notes that Spark has grown significantly over the past year in terms of contributors, companies involved, and lines of code. Spark is now one of the most active projects within the Apache Hadoop ecosystem. The document outlines major new additions to Spark including Spark SQL for structured data, MLlib for machine learning algorithms, and Java 8 APIs. It discusses the vision for Spark as a unified platform and standard library for big data applications.
Spark and the Future of Advanced Analytics by Thomas DinsmoreSpark Summit
This document discusses the need for distributed platforms for machine learning and analytics. It argues that distributed systems are necessary because data sources and targets are distributed, data movement is expensive, and data and model requirements are growing. It presents Spark as currently the best option for a distributed framework, and notes that vendors are working to integrate their tools with Spark to enable distributed workflows. In summary, distributed machine learning is needed due to expanding data and computing demands, and Spark has emerged as the leading framework for distributed analytics and machine learning.
Unlocking Your Hadoop Data with Apache Spark and CDH5SAP Concur
Spark/Mesos Seattle Meetup group shares the latest presentation from their recent meetup event on showcasing real world implementations of working with Spark within the context of your Big Data Infrastructure.
Session are demo heavy and slide light focusing on getting your development environments up and running including getting up and running, configuration issues, SparkSQL vs. Hive, etc.
To learn more about the Seattle meetup: http://www.meetup.com/Seattle-Spark-Meetup/members/21698691/
Geospatial Analytics at Scale with Deep Learning and Apache SparkDatabricks
Raela Wang presented on geospatial analytics at scale using deep learning and Apache Spark. The talk covered ingesting image data with Spark, performing object detection on images using transfer learning models, and analyzing the results geospatially using Magellan. A pipeline was demonstrated that reads images with Spark, extracts predictions from a pre-trained model using Deep Learning Pipelines, and visualizes the objects identified on a heatmap while integrating with Magellan for geospatial analytics. The talk highlighted how Spark can tie together various technologies like deep learning frameworks, geospatial tools, and visualization libraries to perform end-to-end geospatial analytics at scale.
First in Class: Optimizing the Data Lake for Tighter IntegrationInside Analysis
The Briefing Room with Dr. Robin Bloor and Teradata RainStor
Live Webcast October 13, 2015
Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=012bb2c290097165911872b1f241531d
Hadoop data lakes are emerging as peers to corporate data warehouses. However, successful data management solutions require a fusion of all relevant data, new and old, which has proven challenging for many companies. With a data lake that’s been optimized for fast queries, solid governance and lifecycle management, users can take data management to a whole new level.
Register for this episode of The Briefing Room to learn from veteran Analyst Dr. Robin Bloor as he discusses the relevance of data lakes in today’s information landscape. He’ll be briefed by Mark Cusack of Teradata, who will explain how his company’s archiving solution has developed into a storage point for raw data. He’ll show how the proven compression, scalability and governance of Teradata RainStor combined with Hadoop can enable an optimized data lake that serves as both reservoir for historical data and as a "system of record” for the enterprise.
Visit InsideAnalysis.com for more information.
Spark Summit EU 2015: Reynold Xin KeynoteDatabricks
This document summarizes Spark's development over the past 12 months and provides a look ahead. It discusses improvements to both the frontend, such as DataFrames and machine learning pipelines, and the backend through projects like Tungsten for performance optimizations. Going forward, it mentions new features like the Dataset API, streaming DataFrames, and potential hardware improvements from technologies like 3D XPoint memory. The overall goal is to provide a unified engine and APIs that can automatically optimize analytics workloads across languages and domains.
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...Spark Summit
This talk will cover the tools we used, the hurdles we faced and the work arounds we developed with the help from Databricks support in our attempt to build a custom machine learning model and use it to predict the TV ratings for different networks and demographics.
The Apache Spark machine learning and dataframe APIs make it incredibly easy to produce a machine learning pipeline to solve an archetypal supervised learning problem. In our applications at Cadent, we face a challenge with high dimensional labels and relatively low dimensional features; at first pass such a problem is all but intractable but thanks to a large number of historical records and the tools available in Apache Spark, we were able to construct a multi-stage model capable of forecasting with sufficient accuracy to drive the business application.
Over the course of our work we have come across many tools that made our lives easier, and others that forced work around. In this talk we will review our custom multi-stage methodology, review the challenges we faced and walk through the key steps that made our project successful.
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaDatabricks
At NMC (Nielsen Marketing Cloud) we provide our customers (marketers and publishers) real-time analytics tools to profile their target audiences. To achieve that, we need to ingest billions of events per day into our big data stores, and we need to do it in a scalable yet cost-efficient manner.
In this session, we will discuss how we continuously transform our data infrastructure to support these goals. Specifically, we will review how we went from CSV files and standalone Java applications all the way to multiple Kafka and Spark clusters, performing a mixture of Streaming and Batch ETLs, and supporting 10x data growth We will share our experience as early-adopters of Spark Streaming and Spark Structured Streaming, and how we overcame technical barriers (and there were plenty). We will present a rather unique solution of using Kafka to imitate streaming over our Data Lake, while significantly reducing our cloud services’ costs. Topics include:
Kafka and Spark Streaming for stateless and stateful use-cases
Spark Structured Streaming as a possible alternative
Combining Spark Streaming with batch ETLs
”Streaming” over Data Lake using Kafka
Snorkel: Dark Data and Machine Learning with Christopher RéJen Aman
Building applications that can read and analyze a wide variety of data may change the way we do science and make business decisions. However, building such applications is challenging: real world data is expressed in natural language, images, or other “dark” data formats which are fraught with imprecision and ambiguity and so are difficult for machines to understand. This talk will describe Snorkel, whose goal is to make routine Dark Data and other prediction tasks dramatically easier. At its core, Snorkel focuses on a key bottleneck in the development of machine learning systems: the lack of large training datasets. In Snorkel, a user implicitly creates large training sets by writing simple programs that label data, instead of performing manual feature engineering or tedious hand-labeling of individual data items. We’ll provide a set of tutorials that will allow folks to write Snorkel applications that use Spark.
Snorkel is open source on github and available from Snorkel.Stanford.edu.
Spark Summit EU talk by Ram Sriharsha and Vlad FeinbergSpark Summit
This document summarizes an online machine learning framework called Structured Streaming that is being developed for Apache Spark. Some key points:
- It allows machine learning algorithms to be applied continuously to streaming data and update models incrementally in an online fashion.
- Models are updated every time interval (e.g. every second) based on new data within that interval. This provides an approximation of processing all data to date.
- It uses a stateful aggregation approach to allow models to be updated and merged across distributed partitions in a way that is deterministic but not necessarily commutative.
- APIs are provided for common online learning algorithms like online logistic regression and gradient descent to interface with streaming data sources and sinks.
A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovSpark Summit
This document summarizes research on implementing deep learning models using Spark. It describes:
1) Implementing a multilayer perceptron (MLP) model for digit recognition in Spark using batch processing and matrix optimizations to improve efficiency.
2) Analyzing the tradeoffs of computation and communication in parallelizing the gradient calculation for batch training across multiple nodes to find the optimal number of workers.
3) Benchmark results showing Spark MLP achieves similar performance to Caffe on a single node and outperforms it by scaling nearly linearly when using multiple nodes.
Continuous Evaluation of Deployed Models in Production Many high-tech industr...Databricks
Many high-tech industries rely on machine-learning systems in production environments to automatically classify and respond to vast amounts of incoming data. Despite their critical roles, these systems are often not actively monitored. When a problem first arises, it may go unnoticed for some time. Once it is noticed, investigating its underlying cause is a time-consuming, manual process. Wouldn’t it be great if the model’s output were automatically monitored? If they could be visualized, sliced by different dimensions? If the system could automatically detect performance degradation and trigger alerts? In this presentation, we describe our experience from building such a core machine-learning services: Model Evaluation.
Our service provides automated, continuous evaluation of the performance of a deployed model over commonly-used metrics like the area-under-the-curve (AUC), root-mean-square-error (RMSE) etc. In addition, summary statistics about the model’s output, their distributions are also computed. The service also provides a dashboard to visualize the performance metrics, summary statistics and distributions of a model over time along with REST APIs to retrieve these metrics programmatically.
These metrics can be sliced by input features (e.g. Geography, Product type) to provide insights into model performance over different segments. The talk will describe various components that are required in building such a service and metrics of interest. Our system has a backend component built with spark on Azure Databricks. The backend can scale to analyze TBs of data to generate model evaluation metrics.
We will talk about how we modified Spark MLLib for computing AUC sliced by different dimensions and other optimizations in Spark to improve compute and performance. Our front-end and middle-tier, built with Docker and Azure Webapp provides visuals and REST APIs to retrieve the above metrics. This talk will cover various aspects of building, deploying and using the above system.
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Databricks
DeepLearning4J (DL4J) is a powerful Open Source distributed framework that brings Deep Learning to the JVM (it can serve as a DIY tool for Java, Scala, Clojure and Kotlin programmers). It can be used on distributed GPUs and CPUs. It is integrated with Hadoop and Apache Spark. ND4J is a Open Source, distributed and GPU-enabled library that brings the intuitive scientific computing tools of the Python community to the JVM. Training neural network models using DL4J, ND4J and Spark is a powerful combination, but it presents some unexpected issues that can compromise performance and nullify the benefits of well written code and good model design. In this talk I will walk through some of those problems and will present some best practices to prevent them, coming from lessons learned when putting things in production.
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)Spark Summit
This document summarizes Daniel Darabos' talk about the design and implementation of the LynxKite graph analytics application. The key ideas discussed are: (1) using column-based attributes to avoid processing unused data, (2) making joins fast through co-located loading of sorted RDDs, (3) not reading or computing all the data through techniques like prefix sampling, and (4) using binary search for lookups instead of filtering for small key sets. Examples are provided to illustrate how these techniques improve performance and user experience of interactive graph analytics on Spark.
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...Spark Summit
This talk tells the story of implementation and optimization of a sparse logistic regression algorithm in spark. I would like to share the lessons I learned and the steps I had to take to improve the speed of execution and convergence of my initial naive implementation. The message isn’t to convince the audience that logistic regression is great and my implementation is awesome, rather it will give details about how it works under the hood, and general tips for implementing an iterative parallel machine learning algorithm in spark. The talk is structured as a sequence of “lessons learned” that are shown in form of code examples building on the initial naive implementation. The performance impact of each “lesson” on execution time and speed of convergence is measured on benchmark datasets. You will see how to formulate logistic regression in a parallel setting, how to avoid data shuffles, when to use a custom partitioner, how to use the ‘aggregate’ and ‘treeAggregate’ functions, how momentum can accelerate the convergence of gradient descent, and much more. I will assume basic understanding of machine learning and some prior knowledge of spark. The code examples are written in scala, and the code will be made available for each step in the walkthrough.
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
The machine learning libraries in Apache Spark are an impressive piece of software engineering, and are maturing rapidly. What advantages does Spark.ml offer over scikit-learn? At Data Science Retreat we've taken a real-world dataset and worked through the stages of building a predictive model -- exploration, data cleaning, feature engineering, and model fitting; which would you use in production?
The machine learning libraries in Apache Spark are an impressive piece of software engineering, and are maturing rapidly. What advantages does Spark.ml offer over scikit-learn?
At Data Science Retreat we've taken a real-world dataset and worked through the stages of building a predictive model -- exploration, data cleaning, feature engineering, and model fitting -- in several different frameworks. We'll show what it's like to work with native Spark.ml, and compare it to scikit-learn along several dimensions: ease of use, productivity, feature set, and performance.
In some ways Spark.ml is still rather immature, but it also conveys new superpowers to those who know how to use it.
Practical Distributed Machine Learning Pipelines on HadoopDataWorks Summit
This document summarizes machine learning pipelines in Apache Spark using MLlib. It introduces Spark DataFrames for structured data manipulation and Apache Spark MLlib for building machine learning workflows. An example text classification pipeline is presented to demonstrate loading data, feature extraction, training a logistic regression model, and evaluating performance. Parameter tuning is discussed as an important part of the machine learning process.
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Spark Summit
R is a hugely popular platform for Data Scientists to create analytic models in many different domains. But when these applications should move from the science lab to the production environment of large enterprises a new set of challenges arises. Independently of R, Spark has been very successful as a powerful general-purpose computing platform. With the introduction of SparkR an exciting new option to productionize Data Science applications has been made available. This talk will give insight into two real-life projects at major enterprises where Data Science applications in R have been migrated to SparkR.
• Dealing with platform challenges: R was not installed on the cluster. We show how to execute SparkR on a Yarn cluster with a dynamic deployment of R.
• Integrating Data Engineering and Data Science: we highlight the technical and cultural challenges that arise from closely integrating these two different areas.
• Separation of concerns: we describe how to disentangle ETL and data preparation from analytic computing and statistical methods.
• Scaling R with SparkR: we present what options SparkR offers to scale R applications and how we applied them to different areas such as time series forecasting and web analytics.
• Performance Improvements: we will show benchmarks for an R applications that took over 20 hours on a single server/single-threaded setup. With moderate effort we have been able to reduce that number to 15 minutes with SparkR. And we will show how we plan to further reduces this to less than a minute in the future.
• Mixing SparkR, SparkSQL and MLlib: we show how we combined the three different libraries to maximize efficiency.
• Summary and Outlook: we describe what we have learnt so far, what the biggest gaps currently are and what challenges we expect to solve in the short- to mid-term.
The Parquet Format and Performance Optimization OpportunitiesDatabricks
The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads.
As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general.
This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
Recent Developments In SparkR For Advanced AnalyticsDatabricks
Since its introduction in Spark 1.4, SparkR has received contributions from both the Spark community and the R community. In this talk, we will summarize recent community efforts on extending SparkR for scalable advanced analytics. We start with the computation of summary statistics on distributed datasets, including single-pass approximate algorithms. Then we demonstrate MLlib machine learning algorithms that have been ported to SparkR and compare them with existing solutions on R, e.g., generalized linear models, classification and clustering algorithms. We also show how to integrate existing R packages with SparkR to accelerate existing R workflows.
Build, Scale, and Deploy Deep Learning Pipelines with EaseDatabricks
Deep Learning has shown a tremendous success, yet it often requires a lot of effort to leverage its power. Existing Deep Learning frameworks require writing a lot of code to work with a model, let alone in a distributed manner.
This webinar is the first of a series in which we survey the state of Deep Learning at scale, and where we introduce the Deep Learning Pipelines, a new open-source package for Apache Spark. This package simplifies Deep Learning in three major ways:
1. It has a simple API that integrates well with enterprise Machine Learning pipelines.
2. It automatically scales out common Deep Learning patterns, thanks to Spark.
3. It enables exposing Deep Learning models through the familiar Spark APIs, such as MLlib and Spark SQL.
In this webinar, we will look at a complex problem of image classification, using Deep Learning and Spark. Using Deep Learning Pipelines, we will show:
* how to build deep learning models in a few lines of code;
* how to scale common tasks like transfer learning and prediction; and
* how to publish models in Spark SQL.
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Databricks
This document summarizes the growth and development of the Spark project. It notes that Spark has grown significantly over the past year in terms of contributors, companies involved, and lines of code. Spark is now one of the most active projects within the Apache Hadoop ecosystem. The document outlines major new additions to Spark including Spark SQL for structured data, MLlib for machine learning algorithms, and Java 8 APIs. It discusses the vision for Spark as a unified platform and standard library for big data applications.
Spark and the Future of Advanced Analytics by Thomas DinsmoreSpark Summit
This document discusses the need for distributed platforms for machine learning and analytics. It argues that distributed systems are necessary because data sources and targets are distributed, data movement is expensive, and data and model requirements are growing. It presents Spark as currently the best option for a distributed framework, and notes that vendors are working to integrate their tools with Spark to enable distributed workflows. In summary, distributed machine learning is needed due to expanding data and computing demands, and Spark has emerged as the leading framework for distributed analytics and machine learning.
Unlocking Your Hadoop Data with Apache Spark and CDH5SAP Concur
Spark/Mesos Seattle Meetup group shares the latest presentation from their recent meetup event on showcasing real world implementations of working with Spark within the context of your Big Data Infrastructure.
Session are demo heavy and slide light focusing on getting your development environments up and running including getting up and running, configuration issues, SparkSQL vs. Hive, etc.
To learn more about the Seattle meetup: http://www.meetup.com/Seattle-Spark-Meetup/members/21698691/
Geospatial Analytics at Scale with Deep Learning and Apache SparkDatabricks
Raela Wang presented on geospatial analytics at scale using deep learning and Apache Spark. The talk covered ingesting image data with Spark, performing object detection on images using transfer learning models, and analyzing the results geospatially using Magellan. A pipeline was demonstrated that reads images with Spark, extracts predictions from a pre-trained model using Deep Learning Pipelines, and visualizes the objects identified on a heatmap while integrating with Magellan for geospatial analytics. The talk highlighted how Spark can tie together various technologies like deep learning frameworks, geospatial tools, and visualization libraries to perform end-to-end geospatial analytics at scale.
First in Class: Optimizing the Data Lake for Tighter IntegrationInside Analysis
The Briefing Room with Dr. Robin Bloor and Teradata RainStor
Live Webcast October 13, 2015
Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=012bb2c290097165911872b1f241531d
Hadoop data lakes are emerging as peers to corporate data warehouses. However, successful data management solutions require a fusion of all relevant data, new and old, which has proven challenging for many companies. With a data lake that’s been optimized for fast queries, solid governance and lifecycle management, users can take data management to a whole new level.
Register for this episode of The Briefing Room to learn from veteran Analyst Dr. Robin Bloor as he discusses the relevance of data lakes in today’s information landscape. He’ll be briefed by Mark Cusack of Teradata, who will explain how his company’s archiving solution has developed into a storage point for raw data. He’ll show how the proven compression, scalability and governance of Teradata RainStor combined with Hadoop can enable an optimized data lake that serves as both reservoir for historical data and as a "system of record” for the enterprise.
Visit InsideAnalysis.com for more information.
How the Development Bank of Singapore solves on-prem compute capacity challen...Alluxio, Inc.
The Development Bank of Singapore (DBS) has evolved its data platforms over three generations to address big data challenges and the explosion of data. It now uses a hybrid cloud model with Alluxio to provide a unified namespace across on-prem and cloud storage for analytics workloads. Alluxio enables "zero-copy" cloud bursting by caching hot data and orchestrating analytics jobs between on-prem and cloud resources like AWS EMR and Google Dataproc. This provides dynamic scaling of compute capacity while retaining data locality. Alluxio also offers intelligent data tiering and policy-driven data migration to cloud storage over time for cost efficiency and management.
1) To show you how to spot an Aspera opportunity ! 2) To outline the Aspera portfolio (Sales overview not technical) 3) To look at the Aspera opportunity from Sharepoint 4) Summary / Q and A / Close – But interaction is welcomed throughout.. 5) But before all of that…. This… 2 AGENDA AND OBJECTIVES
Modernizing Global Shared Data Analytics Platform and our Alluxio JourneyAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Sandipan Chakraborty, Director of Engineering (Rakuten)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Data Orchestration Platform for the CloudAlluxio, Inc.
This document discusses using a hybrid cloud approach with data orchestration to enable analytics workloads on data stored both on-premises and in the cloud. It outlines reasons for a hybrid approach including reducing time to production and leveraging cloud flexibility. It then describes alternatives like lift-and-shift or compute-driven approaches and their issues. Finally, it introduces a data orchestration platform that can cache and tier data intelligently while enabling analytics frameworks to access both on-premises and cloud-based data with low latency.
From limited Hadoop compute capacity to increased data scientist efficiencyAlluxio, Inc.
Alluxio Tech Talk
Oct 17, 2019
Speaker:
Alex Ma, Alluxio
Want to leverage your existing investments in Hadoop with your data on-premise and still benefit from the elasticity of the cloud?
Like other Hadoop users, you most likely experience very large and busy Hadoop clusters, particularly when it comes to compute capacity. Bursting HDFS data to the cloud can bring challenges – network latency impacts performance, copying data via DistCP means maintaining duplicate data, and you may have to make application changes to accomodate the use of S3.
“Zero-copy” hybrid bursting with Alluxio keeps your data on-prem and syncs data to compute in the cloud so you can expand compute capacity, particularly for ephemeral Spark jobs.
Cardinality provides innovative analytics using big data techniques. It analyzes data from multiple sources, including at the customer, service, and network-wide levels. Cardinality's analytics platform includes components for data collection, storage, analysis, and visualization. It utilizes open-source technologies like Hadoop, Spark, and open APIs to provide customizable and scalable solutions to customers.
Leapfrog into Serverless - a Deloitte-Amtrak Case Study | Serverless Confere...Gary Arora
This talk was delivered at the Serverless Conference in New York City in 2017. Deloitte and Amtrak built a Serverless Cloud-Native solution on AWS for real-time operational datastore and near real-time reporting data mart that modernized Amtrak's legacy systems & applications. With Serverless solutions, we are able leapfrog over several rungs of computing evolution.
Gary Arora is a Cloud Solutions Architect at Deloitte Consulting, specializing on Azure & AWS.
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Precisely
This document discusses engineering machine learning data pipelines and addresses five big challenges: 1) scattered and difficult to access data, 2) data cleansing at scale, 3) entity resolution, 4) tracking data lineage, and 5) ongoing real-time changed data capture and streaming. It presents DMX Change Data Capture as a solution to capture changes from various data sources and replicate them in real-time to targets like Kafka, HDFS, databases and data lakes to feed machine learning models. Case studies demonstrate how DMX-h has helped customers like a global hotel chain and insurance and healthcare companies build scalable data pipelines.
Billions of Messages in Real Time: Why Paypal & LinkedIn Trust an Engagement ...confluent
(Bruno Simic, Solutions Engineer, Couchbase)
Breakout during Confluent’s streaming event in Munich. This three-day hands-on course focused on how to build, manage, and monitor clusters using industry best-practices developed by the world’s foremost Apache Kafka™ experts. The sessions focused on how Kafka and the Confluent Platform work, how their main subsystems interact, and how to set up, manage, monitor, and tune your cluster.
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...DataWorks Summit
Progressive Insurance is well known for its innovative use of data to better serve its customers, and the important role that Hortonworks Data Platform has played in that transformation. However, as with most things worth doing, the path to the Data Lake was not without its challenges. In this session, I’ll share our top use cases for Hadoop – including telematics and display ads, how a skills shortage turned supporting these applications into a nightmare, and how – and why – we now use Syncsort DMX-h to accelerate enterprise adoption by making it quick and easy (or faster and easier) to populate the data lake – and keep it up to date – with data from across the enterprise. I’ll discuss the different approaches we tried, the benefits of using a tool vs. open source, and how we created our Hadoop Ingestor app using Syncsort DMX-h.
Ultralight data movement for IoT with SDC Edge. Guglielmo Iozzia - OptumData Driven Innovation
This document provides an overview and demonstration of Streamsets Data Collector (SDC) and SDC Edge for ingesting data from IoT devices and the edge. It discusses the challenges of ingesting data from distributed edge locations. It then describes the key features of SDC for designing flexible data flows with minimal coding. It also introduces SDC Edge, a lightweight agent for running SDC pipelines on edge devices. The presentation includes demonstrations of using SDC with Kafka and using SDC Edge to ingest and analyze data from Android devices and send it to Elasticsearch. It concludes with discussing additional topics and providing useful links.
Slides: Accelerating Queries on Cloud Data LakesDATAVERSITY
Using “zero-copy” hybrid bursting on remote data to solve data lake analytics capacity and performance problems.
Data scientists want answers on demand. But in today’s enterprise architectures, the reality is that most data remains on-prem, despite the promise of cloud-based analytics. Moving all that data to the cloud has typically not been possible for many reasons including cost, latency, and technical difficulty. So, what if there was a technology that would connect these on-prem environments to any major cloud platform, enabling high-powered computing without the need to move massive amounts of data?
Join us for this webinar where Alex Ma of Alluxio, an open-source data orchestration platform, will discuss how a data orchestration approach offers a solution for connecting traditional on-prem data centers and cloud data lakes with other clouds and data centers. With Alluxio’s “zero-copy” burst solution, companies can bridge remote data centers and data lakes with computing frameworks in other locations, enabling them to offload, compute, and leverage the flexibility, scalability, and power of the cloud for their remote data.
Enabling big data & AI workloads on the object store at DBS Alluxio, Inc.
DBS Bank is headquartered in Singapore and has evolved its data platforms over three generations from proprietary systems to a hybrid cloud-native platform using open source technologies. It is using Alluxio to unify access to data stored in its on-premises object store and HDFS as well as enable analytics workloads to burst into AWS. Alluxio provides data caching to speed up analytics jobs, intelligent data tiering for efficiency, and policy-driven data migration to the cloud over time. DBS is exploring further using cloud services for speech processing and moving more workloads to the cloud while keeping data on-premises for compliance.
This document discusses strategies for modernizing data centers through increased abstraction from hardware infrastructure. It advocates for a multi-year strategic planning approach to balance both incremental and transformational changes. Key elements of data center modernization discussed include adopting software-defined, programmable infrastructure through converged solutions, automation, and virtualization. Planning considerations cover people, processes, and technologies to support a transition to software-defined, utility-like operations over time.
Accelerate Analytics and ML in the Hybrid Cloud EraAlluxio, Inc.
Alluxio Webinar
September 22, 2020
For more Alluxio events: https://www.alluxio.io/events/
Speakers:
Alex Ma, Alluxio
Peter Behrakis, Alluxio
Many companies we talk to have on premises data lakes and use the cloud(s) to burst compute. Many are now establishing new object data lakes as well. As a result, running analytics such as Hive, Spark, Presto and machine learning are experiencing sluggish response times with data and compute in multiple locations. We also know there is an immense and growing data management burden to support these workflows.
In this talk, we will walk through what Alluxio’s Data Orchestration for the hybrid cloud era is and how it solves the performance and data management challenges we see.
In this tech talk, we'll go over:
- What is Alluxio Data Orchestration?
- How does it work?
- Alluxio customer results
Agile Lab is an Italian company that specializes in leveraging innovative technologies like machine learning, big data, and artificial intelligence to satisfy customers' objectives. They have over 50 specialists with deep experience in production environments. The company believes in investing in its team through conferences, R&D projects, and welfare benefits. They also release open source frameworks on GitHub and share knowledge through meetups in Milan and Turin.
Bringing Deep Learning into production Paolo Platter
- The document discusses deep learning frameworks and how to choose one for a given environment. It summarizes the strengths, weaknesses, opportunities and threats of popular frameworks like TensorFlow, Theano, Torch, Caffe, DeepLearning4J and H2O.
- It recommends H2O as a good choice for enterprise environments due to its ease of use, scalability on big data, integration with Spark, Java/Scala support and commercial support. DeepLearning4J is also recommended for more advanced deep neural networks and multi-dimensional arrays.
- The document proposes using Spark as a middleware to leverage multiple frameworks and avoid vendor lock-in, and describes Agile Lab's recommended stack for enterprises which combines H
This document discusses reactive and parallel programming using Akka. It introduces Akka, an open-source toolkit for building distributed, concurrent applications on the JVM. Akka supports actors, futures, software transactional memory, and message passing to handle concurrency. Akka cluster allows building fault-tolerant distributed systems. Examples demonstrate using Akka for microservices, streaming engines, authentication platforms, and IoT applications. The document concludes with an example of how Agile Lab uses Akka in their applications.
Streaming data allows for faster reactions and more profits by enabling real-time processing compared to typical batch analysis workflows. Prediction is more valuable than rule-based categorization and clustering because it allows for pattern discovery, adaptation to changing data, and "smart thinking". Spark Streaming is presented as a tool for building streaming architectures with ingestion, processing, and serving layers. A demo of card transaction analysis with real-time classification, fraud detection, and alert generation is shown.
Massive Streaming Analytics with Spark StreamingPaolo Platter
Streaming data allows for faster reactions and more profits by enabling real-time processing compared to typical batch analysis workflows. Prediction is more valuable than rule-based categorization and clustering because it allows for pattern discovery, adaptation to changing data, and "smart thinking". Spark Streaming is presented as a tool for building streaming architectures with ingestion, processing, and serving layers. A demo of card transaction analysis with real-time classification, fraud detection, and alert generation is shown.
This document summarizes key features of the Scala programming language. Some key points include:
- Scala runs on the Java Virtual Machine (JVM) and allows for type inference, immutable values, functional programming patterns like pattern matching, and object-oriented features like traits for inheritance.
- Scala favors immutable values over mutable variables for scalability. Features like actors allow for concurrency without shared mutable state.
- Scala code can be compiled to JavaScript using Scala.js, allowing full-stack development in a single language.
- Traits provide a powerful way to do multiple inheritance by combining traits and classes at runtime.
Meet the Agents: How AI Is Learning to Think, Plan, and CollaborateMaxim Salnikov
Imagine if apps could think, plan, and team up like humans. Welcome to the world of AI agents and agentic user interfaces (UI)! In this session, we'll explore how AI agents make decisions, collaborate with each other, and create more natural and powerful experiences for users.
⭕️➡️ FOR DOWNLOAD LINK : http://drfiles.net/ ⬅️⭕️
Maxon Cinema 4D 2025 is the latest version of the Maxon's 3D software, released in September 2024, and it builds upon previous versions with new tools for procedural modeling and animation, as well as enhancements to particle, Pyro, and rigid body simulations. CG Channel also mentions that Cinema 4D 2025.2, released in April 2025, focuses on spline tools and unified simulation enhancements.
Key improvements and features of Cinema 4D 2025 include:
Procedural Modeling: New tools and workflows for creating models procedurally, including fabric weave and constellation generators.
Procedural Animation: Field Driver tag for procedural animation.
Simulation Enhancements: Improved particle, Pyro, and rigid body simulations.
Spline Tools: Enhanced spline tools for motion graphics and animation, including spline modifiers from Rocket Lasso now included for all subscribers.
Unified Simulation & Particles: Refined physics-based effects and improved particle systems.
Boolean System: Modernized boolean system for precise 3D modeling.
Particle Node Modifier: New particle node modifier for creating particle scenes.
Learning Panel: Intuitive learning panel for new users.
Redshift Integration: Maxon now includes access to the full power of Redshift rendering for all new subscriptions.
In essence, Cinema 4D 2025 is a major update that provides artists with more powerful tools and workflows for creating 3D content, particularly in the fields of motion graphics, VFX, and visualization.
Explaining GitHub Actions Failures with Large Language Models Challenges, In...ssuserb14185
GitHub Actions (GA) has become the de facto tool that developers use to automate software workflows, seamlessly building, testing, and deploying code. Yet when GA fails, it disrupts development, causing delays and driving up costs. Diagnosing failures becomes especially challenging because error logs are often long, complex and unstructured. Given these difficulties, this study explores the potential of large language models (LLMs) to generate correct, clear, concise, and actionable contextual descriptions (or summaries) for GA failures, focusing on developers’ perceptions of their feasibility and usefulness. Our results show that over 80% of developers rated LLM explanations positively in terms of correctness for simpler/small logs. Overall, our findings suggest that LLMs can feasibly assist developers in understanding common GA errors, thus, potentially reducing manual analysis. However, we also found that improved reasoning abilities are needed to support more complex CI/CD scenarios. For instance, less experienced developers tend to be more positive on the described context, while seasoned developers prefer concise summaries. Overall, our work offers key insights for researchers enhancing LLM reasoning, particularly in adapting explanations to user expertise.
https://arxiv.org/abs/2501.16495
Avast Premium Security Crack FREE Latest Version 2025mu394968
🌍📱👉COPY LINK & PASTE ON GOOGLE https://dr-kain-geera.info/👈🌍
Avast Premium Security is a paid subscription service that provides comprehensive online security and privacy protection for multiple devices. It includes features like antivirus, firewall, ransomware protection, and website scanning, all designed to safeguard against a wide range of online threats, according to Avast.
Key features of Avast Premium Security:
Antivirus: Protects against viruses, malware, and other malicious software, according to Avast.
Firewall: Controls network traffic and blocks unauthorized access to your devices, as noted by All About Cookies.
Ransomware protection: Helps prevent ransomware attacks, which can encrypt your files and hold them hostage.
Website scanning: Checks websites for malicious content before you visit them, according to Avast.
Email Guardian: Scans your emails for suspicious attachments and phishing attempts.
Multi-device protection: Covers up to 10 devices, including Windows, Mac, Android, and iOS, as stated by 2GO Software.
Privacy features: Helps protect your personal data and online privacy.
In essence, Avast Premium Security provides a robust suite of tools to keep your devices and online activity safe and secure, according to Avast.
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AIdanshalev
If we were building a GenAI stack today, we'd start with one question: Can your retrieval system handle multi-hop logic?
Trick question, b/c most can’t. They treat retrieval as nearest-neighbor search.
Today, we discussed scaling #GraphRAG at AWS DevOps Day, and the takeaway is clear: VectorRAG is naive, lacks domain awareness, and can’t handle full dataset retrieval.
GraphRAG builds a knowledge graph from source documents, allowing for a deeper understanding of the data + higher accuracy.
TestMigrationsInPy: A Dataset of Test Migrations from Unittest to Pytest (MSR...Andre Hora
Unittest and pytest are the most popular testing frameworks in Python. Overall, pytest provides some advantages, including simpler assertion, reuse of fixtures, and interoperability. Due to such benefits, multiple projects in the Python ecosystem have migrated from unittest to pytest. To facilitate the migration, pytest can also run unittest tests, thus, the migration can happen gradually over time. However, the migration can be timeconsuming and take a long time to conclude. In this context, projects would benefit from automated solutions to support the migration process. In this paper, we propose TestMigrationsInPy, a dataset of test migrations from unittest to pytest. TestMigrationsInPy contains 923 real-world migrations performed by developers. Future research proposing novel solutions to migrate frameworks in Python can rely on TestMigrationsInPy as a ground truth. Moreover, as TestMigrationsInPy includes information about the migration type (e.g., changes in assertions or fixtures), our dataset enables novel solutions to be verified effectively, for instance, from simpler assertion migrations to more complex fixture migrations. TestMigrationsInPy is publicly available at: https://github.com/altinoalvesjunior/TestMigrationsInPy.
Adobe Lightroom Classic Crack FREE Latest link 2025kashifyounis067
🌍📱👉COPY LINK & PASTE ON GOOGLE http://drfiles.net/ 👈🌍
Adobe Lightroom Classic is a desktop-based software application for editing and managing digital photos. It focuses on providing users with a powerful and comprehensive set of tools for organizing, editing, and processing their images on their computer. Unlike the newer Lightroom, which is cloud-based, Lightroom Classic stores photos locally on your computer and offers a more traditional workflow for professional photographers.
Here's a more detailed breakdown:
Key Features and Functions:
Organization:
Lightroom Classic provides robust tools for organizing your photos, including creating collections, using keywords, flags, and color labels.
Editing:
It offers a wide range of editing tools for making adjustments to color, tone, and more.
Processing:
Lightroom Classic can process RAW files, allowing for significant adjustments and fine-tuning of images.
Desktop-Focused:
The application is designed to be used on a computer, with the original photos stored locally on the hard drive.
Non-Destructive Editing:
Edits are applied to the original photos in a non-destructive way, meaning the original files remain untouched.
Key Differences from Lightroom (Cloud-Based):
Storage Location:
Lightroom Classic stores photos locally on your computer, while Lightroom stores them in the cloud.
Workflow:
Lightroom Classic is designed for a desktop workflow, while Lightroom is designed for a cloud-based workflow.
Connectivity:
Lightroom Classic can be used offline, while Lightroom requires an internet connection to sync and access photos.
Organization:
Lightroom Classic offers more advanced organization features like Collections and Keywords.
Who is it for?
Professional Photographers:
PCMag notes that Lightroom Classic is a popular choice among professional photographers who need the flexibility and control of a desktop-based application.
Users with Large Collections:
Those with extensive photo collections may prefer Lightroom Classic's local storage and robust organization features.
Users who prefer a traditional workflow:
Users who prefer a more traditional desktop workflow, with their original photos stored on their computer, will find Lightroom Classic a good fit.
Who Watches the Watchmen (SciFiDevCon 2025)Allon Mureinik
Tests, especially unit tests, are the developers’ superheroes. They allow us to mess around with our code and keep us safe.
We often trust them with the safety of our codebase, but how do we know that we should? How do we know that this trust is well-deserved?
Enter mutation testing – by intentionally injecting harmful mutations into our code and seeing if they are caught by the tests, we can evaluate the quality of the safety net they provide. By watching the watchmen, we can make sure our tests really protect us, and we aren’t just green-washing our IDEs to a false sense of security.
Talk from SciFiDevCon 2025
https://www.scifidevcon.com/courses/2025-scifidevcon/contents/680efa43ae4f5
Solidworks Crack 2025 latest new + license codeaneelaramzan63
Copy & Paste On Google >>> https://dr-up-community.info/
The two main methods for installing standalone licenses of SOLIDWORKS are clean installation and parallel installation (the process is different ...
Disable your internet connection to prevent the software from performing online checks during installation
Why Orangescrum Is a Game Changer for Construction Companies in 2025Orangescrum
Orangescrum revolutionizes construction project management in 2025 with real-time collaboration, resource planning, task tracking, and workflow automation, boosting efficiency, transparency, and on-time project delivery.
Download Wondershare Filmora Crack [2025] With Latesttahirabibi60507
Copy & Past Link 👉👉
http://drfiles.net/
Wondershare Filmora is a video editing software and app designed for both beginners and experienced users. It's known for its user-friendly interface, drag-and-drop functionality, and a wide range of tools and features for creating and editing videos. Filmora is available on Windows, macOS, iOS (iPhone/iPad), and Android platforms.
Not So Common Memory Leaks in Java WebinarTier1 app
This SlideShare presentation is from our May webinar, “Not So Common Memory Leaks & How to Fix Them?”, where we explored lesser-known memory leak patterns in Java applications. Unlike typical leaks, subtle issues such as thread local misuse, inner class references, uncached collections, and misbehaving frameworks often go undetected and gradually degrade performance. This deck provides in-depth insights into identifying these hidden leaks using advanced heap analysis and profiling techniques, along with real-world case studies and practical solutions. Ideal for developers and performance engineers aiming to deepen their understanding of Java memory management and improve application stability.
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdfTechSoup
In this webinar we will dive into the essentials of generative AI, address key AI concerns, and demonstrate how nonprofits can benefit from using Microsoft’s AI assistant, Copilot, to achieve their goals.
This event series to help nonprofits obtain Copilot skills is made possible by generous support from Microsoft.
What You’ll Learn in Part 2:
Explore real-world nonprofit use cases and success stories.
Participate in live demonstrations and a hands-on activity to see how you can use Microsoft 365 Copilot in your own work!
Interactive Odoo Dashboard for various business needs can provide users with dynamic, visually appealing dashboards tailored to their specific requirements. such a module that could support multiple dashboards for different aspects of a business
✅Visit And Buy Now : https://bit.ly/3VojWza
✅This Interactive Odoo dashboard module allow user to create their own odoo interactive dashboards for various purpose.
App download now :
Odoo 18 : https://bit.ly/3VojWza
Odoo 17 : https://bit.ly/4h9Z47G
Odoo 16 : https://bit.ly/3FJTEA4
Odoo 15 : https://bit.ly/3W7tsEB
Odoo 14 : https://bit.ly/3BqZDHg
Odoo 13 : https://bit.ly/3uNMF2t
Try Our website appointment booking odoo app : https://bit.ly/3SvNvgU
👉Want a Demo ?📧 [email protected]
➡️Contact us for Odoo ERP Set up : 091066 49361
👉Explore more apps: https://bit.ly/3oFIOCF
👉Want to know more : 🌐 https://www.axistechnolabs.com/
#odoo #odoo18 #odoo17 #odoo16 #odoo15 #odooapps #dashboards #dashboardsoftware #odooerp #odooimplementation #odoodashboardapp #bestodoodashboard #dashboardapp #odoodashboard #dashboardmodule #interactivedashboard #bestdashboard #dashboard #odootag #odooservices #odoonewfeatures #newappfeatures #odoodashboardapp #dynamicdashboard #odooapp #odooappstore #TopOdooApps #odooapp #odooexperience #odoodevelopment #businessdashboard #allinonedashboard #odooproducts
🌍📱👉COPY LINK & PASTE ON GOOGLE http://drfiles.net/ 👈🌍
Adobe Illustrator is a powerful, professional-grade vector graphics software used for creating a wide range of designs, including logos, icons, illustrations, and more. Unlike raster graphics (like photos), which are made of pixels, vector graphics in Illustrator are defined by mathematical equations, allowing them to be scaled up or down infinitely without losing quality.
Here's a more detailed explanation:
Key Features and Capabilities:
Vector-Based Design:
Illustrator's foundation is its use of vector graphics, meaning designs are created using paths, lines, shapes, and curves defined mathematically.
Scalability:
This vector-based approach allows for designs to be resized without any loss of resolution or quality, making it suitable for various print and digital applications.
Design Creation:
Illustrator is used for a wide variety of design purposes, including:
Logos and Brand Identity: Creating logos, icons, and other brand assets.
Illustrations: Designing detailed illustrations for books, magazines, web pages, and more.
Marketing Materials: Creating posters, flyers, banners, and other marketing visuals.
Web Design: Designing web graphics, including icons, buttons, and layouts.
Text Handling:
Illustrator offers sophisticated typography tools for manipulating and designing text within your graphics.
Brushes and Effects:
It provides a range of brushes and effects for adding artistic touches and visual styles to your designs.
Integration with Other Adobe Software:
Illustrator integrates seamlessly with other Adobe Creative Cloud apps like Photoshop, InDesign, and Dreamweaver, facilitating a smooth workflow.
Why Use Illustrator?
Professional-Grade Features:
Illustrator offers a comprehensive set of tools and features for professional design work.
Versatility:
It can be used for a wide range of design tasks and applications, making it a versatile tool for designers.
Industry Standard:
Illustrator is a widely used and recognized software in the graphic design industry.
Creative Freedom:
It empowers designers to create detailed, high-quality graphics with a high degree of control and precision.
How can one start with crypto wallet development.pptxlaravinson24
This presentation is a beginner-friendly guide to developing a crypto wallet from scratch. It covers essential concepts such as wallet types, blockchain integration, key management, and security best practices. Ideal for developers and tech enthusiasts looking to enter the world of Web3 and decentralized finance.
2. What we do
«I’m not a speed reader, I’m a speed understander»
Quote: Isaac Asimov
SCALABLE
TECHNOLOGIES
MACHINE LEARNING
FOR BIGDATA
3. Technologies
In memory and data exploration NoSQL Data Ingestion and Hi-scalable applications
Hadoop Frameworks and storage format Deployment and resource management Cloud deployment
Change Data Capture BI Tools Development and Data Visualization
4. About me
• Paolo Platter
• CTO @ Agile Lab
• Data & Tech addicted
• Certified Cassandra Architect
• Certified Spark Developer
5. Big Data is Hard
• Distributed computing is complex
• Technology Landscape is huge
• Technologies are not general purpose
• Without experience you fail
• Learning curve is flat
6. Big Data Streaming is harder
• Long running processes
• 24/7 requirements
• Anti-patterns are super common
• Software upgrade is hard
• Data delivery semantics are common pitfalls
7. WASP Platform
WASP is a pre-integrated solution: a big data infrastructre that allows to focus on business logic
instead of hadoop components integration and other side developments.
9. Out of the box features
• Telemetry
• Log collection
• Model Server
• Push Notification
10. Clustering at any level
Infrastructure
Orchestration
Application
Orchestration
11. Kappa + Lamda
Data Sources Ingestion & Decoupling Data ConsumersResult Stores
Event/Stream Processing
Streaming Data Trasformation
Batch Computing
Raw
Index
KeyValue
LOGS
RDMS
SENSOR
QUEUES
HTTP
TCP
MQTT
BI
File Export
Mobile
Web App
Dashboard
Alerting
Logs
Processing
Model
Server
Model
Server
13. Pipegraphs
• Pipegraph is an abstraction layer,
composed by a collection of Spark and
Akka transformers.
• Data routing and Storage can be
changed without coding
• Strategy classes are pluggable with class
loading to customize transformer
behaviour
• Anti-patterns are avoided by design:
only one single output for each
transformer
15. Pipegraphs
Topic 1 Enrichement With
Static Data
Topic 2
HDFS
Save to HDFS HDFS
Save to ElasticSearch ELASTIC
Without coding !!
16. Pipegraphs
Topic 1 Enrichement With
Static Data
Topic 2
Topic 3
Alert Generation
Save to Cassandra C*
HDFS
Save to HDFS HDFS
Save to ElasticSearch ELASTIC
Here we need a strategy
17. WASP Producers
WASP Core
Producer
Producer
Wasp
Master
Rest Proxy
- Enterprise Service Integration
- Lightweight Integration
- Local Connection
Edge Computing
for IoT and SoT
- Decoupled Ingestion
- Http Proxy
- Powered By Confluent
- Intelligent Streaming
- Edge Computing
- Site to Site communication
- Data in Motion with QoS
18. WASP IoT & SoT
WASP CoreWASP Fog
Controller
Enterprise Tier
Platform Tier
(Plant)
Gateway
Intelligence At Edge
Lightweight Process
Clustered
Acquisition
Devices in
Mobility
On Device
Bidirectional Protocol
Site to Site
Site to Site
IndustrialPhysical Store Edge Tier
19. Wasp Fog computing
WASP Core
WASP
Master
Enterprise Tier
Platform Tier
(Plant)
Sensor Layer
Fog Layer
Site to Site
Edge Tier
Fog Layer
Site to Site
Fog
Controller
• Controls a hierarchy of fog and sensor layers.
• Remoted and centralized monitoring of the
whole topology
• Leverages a transactional semantic delivery and
a reliable site to site communication protocol
• Data processing can be clustered to enhance the
throughput
• Provides separation between control plane and
data plane.
• Just one port opened on your firewalls
• Fog controller communicates only with the local
peer
20. Control Plane & Data Plane
Enterprise Tier
Platform Tier
(Plant)
Data
processor
Data
processor
Site to Site
Edge Tier
Data
Processor
Site to Site
Fog
Controller
Data Plane
Control layer
Control layer
Site to Site
Control Plane
Control Plane:
• Always active
• Start / Stop data processors
• Collects metrics and status from data
plane
• Provide inputs to Data Plane, like ML
models or customized BL
• Bidirectional communication
Site to Site
Data
Data Plane:
• Active or Inactive
• Process data in just one direction
• Can load a ML model or a custom BL