The document discusses designing scalable platforms for artificial intelligence (AI) and machine learning (ML). It outlines several challenges in developing AI applications, including technical debts, unpredictability, different data and compute needs compared to traditional software. It then reviews existing commercial AI platforms and common components of AI platforms, including data access, ML workflows, computing infrastructure, model management, and APIs. The rest of the document focuses on eBay's Krylov project as an example AI platform, outlining its architecture, challenges of deploying platforms at scale, and needed skill sets on the platform team.
Revolutionary container based hybrid cloud solution for MLPlatform
Ness' data science platform, NextGenML, puts the entire machine learning process: modelling, execution and deployment in the hands of data science teams.
The entire paradigm approaches collaboration around AI/ML, being implemented with full respect for best practices and commitment to innovation.
Kubernetes (onPrem) + Docker, Azure Kubernetes Cluster (AKS), Nexus, Azure Container Registry(ACR), GlusterFS
Workflow
Argo->Kubeflow
DevOps
Helm, kSonnet, Kustomize,Azure DevOps
Code Management & CI/CD
Git, TeamCity, SonarQube, Jenkins
Security
MS Active Directory, Azure VPN, Dex (K8s) integrated with GitLab
Machine Learning
TensorFlow (model training, boarding, serving), Keras, Seldon
Storage (Azure)
Storage Gen1 & Gen2, Data Lake, File Storage
ETL (Azure)
Databricks, Spark on K8, Data Factory (ADF), HDInsight (Kafka and Spark), Service Bus (ASB)
Lambda functions & VMs, Cache for Redis
Monitoring and Logging
Graphana, Prometeus, GrayLog
Operationalizing AI at scale using MADlib Flow - Greenplum Summit 2019VMware Tanzu
This document discusses operationalizing machine learning models at scale using MADlib Flow. It introduces MADlib Flow, which allows deploying models trained in PostgreSQL or Greenplum to Docker, Pivotal Cloud Foundry, or Kubernetes. Common challenges with operationalizing models are outlined. MADlib Flow addresses these challenges by providing an easy way to deploy models with high scalability, low latency predictions, and end-to-end SQL workflows. A demo of using MADlib Flow to deploy a fraud detection model trained in Greenplum and score transactions in real time is presented.
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...Databricks
The BigDL framework scales deep learning for large data sets using Apache Spark. However there is significant scheduling overhead from Spark when running BigDL at large scale. In this talk we propose a new parameter manager implementation that along with coarse-grained scheduling can provide significant speedups for deep learning models like Inception, VGG etc. Aggregation functions like reduce or treeReduce that are used for parameter aggregation in Apache Spark (and the original MapReduce) are slow as the centralized scheduling and driver network bandwidth become a bottleneck especially in large clusters.
To reduce the overhead of parameter aggregation and allow for near-linear scaling, we introduce a new AllReduce operation, a part of the parameter manager in BigDL which is built directly on top of the BlockManager in Apache Spark. AllReduce in BigDL uses a peer-to-peer mechanism to synchronize and aggregate parameters. During parameter synchronization and aggregation, all nodes in the cluster play the same role and driver’s overhead is eliminated thus enabling near-linear scaling. To address the scheduling overhead we use Drizzle, a recently proposed scheduling framework for Apache Spark. Currently, Spark uses a BSP computation model, and notifies the scheduler at the end of each task. Invoking the scheduler at the end of each task adds overheads and results in decreased throughput and increased latency.
Drizzle introduces group scheduling, where multiple iterations (or a group) of iterations are scheduled at once. This helps decouple the granularity of task execution from scheduling and amortizes the costs of task serialization and launch. Finally we will present results from using the new AllReduce operation and Drizzle on a number of common deep learning models including VGG and Inception. Our benchmarks run on Amazon EC2 and Google DataProc will show the speedups and scalability of our implementation.
This document discusses MLOps and Kubeflow. It begins with an introduction to the speaker and defines MLOps as addressing the challenges of independently autoscaling machine learning pipeline stages, choosing different tools for each stage, and seamlessly deploying models across environments. It then introduces Kubeflow as an open source project that uses Kubernetes to minimize MLOps efforts by enabling composability, scalability, and portability of machine learning workloads. The document outlines key MLOps capabilities in Kubeflow like Jupyter notebooks, hyperparameter tuning with Katib, and model serving with KFServing and Seldon Core. It describes the typical machine learning process and how Kubeflow supports experimental and production phases.
Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep...Databricks
What we call the public cloud was developed primarily to manage and deploy web servers. The target audience for these products is Dev Ops. While this is a massive and exciting market, the world of Data Science and Deep Learning is very different — and possibly even bigger. Unfortunately, the tools available today are not designed for this new audience and the cloud needs to evolve. This talk would cover what the next 10 years of cloud computing will look like.
Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018Sri Ambati
This talk was recorded in London on Oct 30, 2018 and can be viewed here: https://youtu.be/p4iAnxwC_Eg
The good news is building fair, accountable, and transparent machine learning systems is possible. The bad news is it’s harder than many blogs and software package docs would have you believe. The truth is nearly all interpretable machine learning techniques generate approximate explanations, that the fields of eXplainable AI (XAI) and Fairness, Accountability, and Transparency in Machine Learning (FAT/ML) are very new, and that few best practices have been widely agreed upon. This combination can lead to some ugly outcomes!
This talk aims to make your interpretable machine learning project a success by describing fundamental technical challenges you will face in building an interpretable machine learning system, defining the real-world value proposition of approximate explanations for exact models, and then outlining the following viable techniques for debugging, explaining, and testing machine learning models
Mateusz is a software developer who loves all things distributed, machine learning and hates buzzwords. His favourite hobby data juggling.
He obtained his M.Sc. in Computer Science from AGH UST in Krakow, Poland, during which he did an exchange at L’ECE Paris in France and worked on distributed flight booking systems. After graduation he move to Tokyo to work as a researcher at Fujitsu Laboratories on machine learning and NLP projects, where he is still currently based.
TensorFlow 16: Building a Data Science Platform Seldon
1. The document discusses building a data science platform on DC/OS to operationalize machine learning models. It outlines challenges at each stage of the ML pipeline and how DC/OS addresses them with distributed computing capabilities and services for data storage, processing, model training and deployment.
2. Key stages covered include data preparation, distributed training using frameworks like TensorFlow, model management with storage of trained models, and low-latency model serving for production with TensorFlow Serving.
3. DC/OS provides a full-stack platform to operationalize ML at scale through distributed computing resources, container orchestration, and integration of open source data and ML services.
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark.
Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations.
Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.
Get Behind the Wheel with H2O Driverless AI Hands-On Training Sri Ambati
Driverless AI is an automated machine learning platform created by H2O.ai that can complete an entire machine learning workflow from data to deployment in as little as 2 hours. It uses techniques developed by H2O Grandmasters such as automated feature engineering, model tuning, and ensemble building to generate high performing models with little to no input from users. Driverless AI supports both structured and unstructured data types including text/NLP and time series data and generates documentation of all modeling steps.
Machine Learning at Scale with MLflow and Apache SparkDatabricks
This document summarizes the challenges faced by SocGen, a large French bank, in implementing machine learning at scale using Spark and MLflow. Some key challenges included: 1) Keeping data and models local for regulatory reasons while performing training and prediction, 2) Ensuring reliability when moving models between prototyping and production phases, 3) Managing different Python package dependencies, 4) Tracking and managing many models, and 5) Ensuring high availability of the tracking server. The presentation provided a concrete example of using Spark, MLflow, and Kafka to periodically retrain a model for scoring news articles and handling user feedback in a scalable and reliable way.
CI/CD for Machine Learning with Daniel KobranDatabricks
What we call the public cloud was developed primarily to manage and deploy web servers. The target audience for these products is Dev Ops. While this is a massive and exciting market, the world of Data Science and Deep Learning is very different — and possibly even bigger. Unfortunately, the tools available today are not designed for this new audience and the cloud needs to evolve. This talk would cover what the next 10 years of cloud computing will look like.
The Lyft data platform: Now and in the futuremarkgrover
- Lyft has grown significantly in recent years, providing over 1 billion rides to 30.7 million riders through 1.9 million drivers in 2018 across North America.
- Data is core to Lyft's business decisions, from pricing and driver matching to analyzing performance and informing investments.
- Lyft's data platform supports data scientists, analysts, engineers and others through tools like Apache Superset, change data capture from operational stores, and streaming frameworks.
- Key focuses for the platform include business metric observability, streaming applications, and machine learning while addressing challenges of reliability, integration and scale.
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...Databricks
Bighead is Airbnb's machine learning infrastructure that was created to:
- Standardize and simplify the ML development workflow;
- Reduce the time and effort to build ML models from weeks/months to days/weeks; and
- Enable more teams at Airbnb to utilize ML.
It provides shared services and tools for data management, model training/inference, and model management to make the ML process more efficient and production-ready. This includes services like Zipline for feature storage, Redspot for notebook environments, Deep Thought for online inference, and the Bighead UI for model monitoring.
Scaling out Driverless AI with IBM Spectrum Conductor - Kevin Doyle - H2O AI ...Sri Ambati
IBM Spectrum Conductor can manage H2O Driverless AI instances at scale across multiple nodes in an enterprise data center. Key benefits include the ability to run multiple Driverless AI instances on the same host using GPUs, failover capabilities if an instance fails, and role-based access control for users. The integration improves productivity by providing a shared file system, workload management, and allowing easy start/stop of Driverless AI instances.
A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...VMware Tanzu
This document discusses providing a modern interface for data science on Postgres and Greenplum databases. It introduces Ibis, a Python library that provides a DataFrame abstraction for SQL systems. Ibis allows defining complex data pipelines and transformations using deferred expressions, providing type checking before execution. The document argues that Ibis could be enhanced to support user-defined functions, saving results to tables, and data science modeling abstractions to provide a full-featured interface for data scientists on SQL databases.
When Apache Spark Meets TiDB with Xiaoyu MaDatabricks
During the past 10 years, big-data storage layers mainly focus on analytical use cases. When it comes to analytical cases, users usually offload data onto Hadoop cluster and perform queries on HDFS files. People struggle dealing with modifications on append only storage and maintain fragile ETL pipelines.
On the other hand, although Spark SQL has been proven effective parallel query processing engine, some tricks common in traditional databases are not available due to characteristics of storage underneath. TiSpark sits directly on top of a distributed database (TiDB)’s storage engine, expand Spark SQL’s planning with its own extensions and utilizes unique features of database storage engine to achieve functions not possible for Spark SQL on HDFS. With TiSpark, users are able to perform queries directly on changing / fresh data in real time.
The takeaways from this two are twofold:
— How to integrate Spark SQL with a distributed database engine and the benefit of it
— How to leverage Spark SQL’s experimental methods to extend its capacity.
Using BigDL on Apache Spark to Improve the MLS Real Estate Search Experience ...Databricks
BigDL-enabled Deep Learning analysis of photos attached to property listings in Multiple Listings Services database allowed us to extract image features and identify similar-looking properties. We leveraged this information to in real-time property search application to improve the relevancy of user search results. Imagine identifying a property listing photo you like and having the system suggest other listings you should also review. Traditional real-estate MLS (multiple-listings services) search methods rely on SQL-type queries to search and serve real-estate listings results.
However, using BigDL in conjunction with MLSLinstings standard APIs allows users to include photos as search parameters in real-time, based both on image similarities and semantic feature search. The information extracted from listing’s images is used to improve the relevancy of the search results. To enable this use-case, we implemented several CNNs using BigDL framework on Microsoft’s Azure hosted Apache Spark: – Image feature extraction and tagging. Extracts features from real estate images and classifies them according to Real Estates Standards Organization rules, such as overall house style, interior and exterior attributes, etc. – Image similarity network which allows for comparing images that belong to different properties based on their extracted features and create a similarity score to be used in search results.
We’ll discuss the above networks in details as well as run a live demo of real-estate search results. Key takeaways: a) Why invest into Spark BigDL from the start. b) Why choose cloud-based solution from the start. c) Choice of Scala vs Python.
ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...Databricks
The explosion of data volume in the years to come challenge the idea of a centralized cloud infrastructure which handles all business needs. Edge computing comes to rescue by pushing the needs of computation and data analysis at the edge of the network, thus avoiding data exchange when makes sense. One of the areas where data exchange could impose a big overhead is scoring ML models especially where data to score are files like images eg. in a computer vision application.
Another concern in some applications, is that of keeping data as private as possible and this is where keeping things local makes sense. In this talk we will discuss current needs and recent advances in model serving, like newly introduced formats for pushing models at the edge nodes eg. mobile phones and how a unified model serving architecture could cover current and future needs for both data scientists and data engineers. This architecture is based among others, on training models in a distributed fashion with TensorFlow and leveraging Spark for cleaning data before training (eg. using TensorFlow connector).
Finally we will describe a microservice based approach for scoring models back at the cloud infrastructure side (where bandwidth can be high) eg. using TensorFlow serving and updating models remotely with a pull model approach for edge devices. We will talk also about implementing the proposed architecture and how that might look on a modern deployment environment eg. Kubernetes.
In the last several months, MLflow has introduced significant platform enhancements that simplify machine learning lifecycle management. Expanded autologging capabilities, including a new integration with scikit-learn, have streamlined the instrumentation and experimentation process in MLflow Tracking. Additionally, schema management functionality has been incorporated into MLflow Models, enabling users to seamlessly inspect and control model inference APIs for batch and real-time scoring. In this session, we will explore these new features. We will share MLflow’s development roadmap, providing an overview of near-term advancements in the platform.
KFServing, Model Monitoring with Apache Spark and a Feature StoreDatabricks
In recent years, MLOps has emerged to bring DevOps processes to the machine learning (ML) development process, aiming at more automation in the execution of repetitive tasks and at smoother interoperability between tools. Among the different stages in the ML lifecycle, model monitoring involves the supervision of model performance over time, involving the combination of techniques in four categories: outlier detection, data drift detection, explainability and adversarial attacks. Most existing model monitoring tools follow a scheduled batch processing approach or analyse model performance using isolated subsets of the inference data. However, for the continuous monitoring of models, stream processing platforms show several advantages, including support for continuous data analytics, scalable processing of large amounts of data and first-class support for window-based aggregations useful for concept drift detection.
In this talk, we present an open-source platform for serving and monitoring models at scale based on Kubeflow’s model serving framework, KFServing, the Hopsworks Online Feature Store for enriching feature vectors with transformer in KFServing, and Spark and Spark Streaming as general purpose frameworks for monitoring models in production.
We also show how Spark Streaming can use the Hopsworks Feature Store to implement continuous data drift detection, where the Feature Store provides statistics on the distribution of feature values in training, and Spark Streaming computes the statistics on live traffic to the model, alerting if the live traffic differs significantly from the training data. We will include a live demonstration of the platform in action.
Saving Energy in Homes with a Unified Approach to Data and AIDatabricks
Energy wastage by residential buildings is a significant contributor to total worldwide energy consumption. Quby, an Amsterdam based technology company, offers solutions to empower homeowners to stay in control of their electricity, gas and water usage.
Plume - A Code Property Graph Extraction and Analysis LibraryTigerGraph
See all on-demand Graph + AI Sessions: https://www.tigergraph.com/graph-ai-world-sessions/
Get TigerGraph: https://www.tigergraph.com/get-tigergraph/
Multi runtime serving pipelines for machine learningStepan Pushkarev
The talk I gave at Scale By The Bay.
Deploying, Serving and monitoring machine learning models built with different ML frameworks in production. Envoy proxy powered serving mesh. TensorFlow, Spark ML, Scikit-learn and custom functions on CPU and GPU.
A Microservices Framework for Real-Time Model Scoring Using Structured Stream...Databricks
Open-source technologies allow developers to build microservices framework to build myriad real-time applications. One such application is building the real-time model scoring. In this session,
we will showcase how to architect a microservice framework, in particular how to use it to build a low-latency, real-time model scoring system. At the core of the architecture lies Apache Spark’s Structured
Streaming capability to deliver low-latency predictions coupled with Docker and Flask as additional open source tools for model service. In this session, you will walk away with:
* Knowledge of enterprise-grade model as a service
* Streaming architecture design principles enabling real-time machine learning
* Key concepts and building blocks for real-time model scoring
* Real-time and production use cases across industries, such as IIOT, predictive maintenance, fraud detection, sepsis etc.
In this session you will learn about how H&M have created a reference architecture for deploying their machine learning models on azure utilizing databricks following devOps principles. The architecture is currently used in production and has been iterated over multiple times to solve some of the discovered pain points. The team that are presenting is currently responsible for ensuring that best practices are implemented on all H&M use cases covering 100''s of models across the entire H&M group. <br> This architecture will not only give benefits to data scientist to use notebooks for exploration and modeling but also give the engineers a way to build robust production grade code for deployment. The session will in addition cover topics like lifecycle management, traceability, automation, scalability and version control.
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...Databricks
In this talk, we will explore how Uber enables rapid experimentation of machine learning models and optimization algorithms through the Uber’s Data Science Workbench (DSW). DSW covers a series of stages in data scientists’ workflow including data exploration, feature engineering, machine learning model training, testing and production deployment. DSW provides interactive notebooks for multiple languages with on-demand resource allocation and share their works through community features.
It also has support for notebooks and intelligent applications backed by spark job servers. Deep learning applications based on TensorFlow and Torch can be brought into DSW smoothly where resources management is taken care of by the system. The environment in DSW is customizable where users can bring their own libraries and frameworks. Moreover, DSW provides support for Shiny and Python dashboards as well as many other in-house visualization and mapping tools.
In the second part of this talk, we will explore the use cases where custom machine learning models developed in DSW are productionized within the platform. Uber applies Machine learning extensively to solve some hard problems. Some use cases include calculating the right prices for rides in over 600 cities and applying NLP technologies to customer feedbacks to offer safe rides and reduce support costs. We will look at various options evaluated for productionizing custom models (server based and serverless). We will also look at how DSW integrates into the larger Uber’s ML ecosystem, e.g. model/feature stores and other ML tools, to realize the vision of a complete ML platform for Uber.
H2O World - Survey of Available Machine Learning Frameworks - Brendan HergerSri Ambati
H2O World 2015 - Brendan Herger of Capital One
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
The power of RapidMiner, showing the direct marketing demoWessel Luijben
Direct marketing allows businesses to target specific customer demographics like age, gender, and marital status. It has expanded beyond mail to include digital channels like text, email, and online ads. RapidMiner's direct marketing wizard helps businesses invest in the highest converting marketing actions and reduce costs through improved targeting. It provides a table of top customers to target, shows which customer properties most impact response rates, and evaluates the predictive model to determine if more customer data is needed.
CASE 02: SAIGON COOPMART
Logistics & Supply Chain plays an important role, if needed to say a critical factor for the success of Saigon Coopmart. Most of supermarkets over the world follow the identical model in which a warehouse is placed next to supermarket for stocks storage; and the size of warehouse is more or less equal to size of supermarket. However, due to harsh competition, and weak finance, Saigon Coopmart decided to follow a different model with very small size warehouse. This allows Saigon Coopmart to place more supermarkets; but in exchange, stocks only enough for a day, or maximum two compared to ordinary model in which a warehouse can store enough stocks for a week or more. As a consequence, Saigon Coopmart has to ship much more frequency to its supermarkets than its competitors such as Big C.
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark.
Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations.
Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.
Get Behind the Wheel with H2O Driverless AI Hands-On Training Sri Ambati
Driverless AI is an automated machine learning platform created by H2O.ai that can complete an entire machine learning workflow from data to deployment in as little as 2 hours. It uses techniques developed by H2O Grandmasters such as automated feature engineering, model tuning, and ensemble building to generate high performing models with little to no input from users. Driverless AI supports both structured and unstructured data types including text/NLP and time series data and generates documentation of all modeling steps.
Machine Learning at Scale with MLflow and Apache SparkDatabricks
This document summarizes the challenges faced by SocGen, a large French bank, in implementing machine learning at scale using Spark and MLflow. Some key challenges included: 1) Keeping data and models local for regulatory reasons while performing training and prediction, 2) Ensuring reliability when moving models between prototyping and production phases, 3) Managing different Python package dependencies, 4) Tracking and managing many models, and 5) Ensuring high availability of the tracking server. The presentation provided a concrete example of using Spark, MLflow, and Kafka to periodically retrain a model for scoring news articles and handling user feedback in a scalable and reliable way.
CI/CD for Machine Learning with Daniel KobranDatabricks
What we call the public cloud was developed primarily to manage and deploy web servers. The target audience for these products is Dev Ops. While this is a massive and exciting market, the world of Data Science and Deep Learning is very different — and possibly even bigger. Unfortunately, the tools available today are not designed for this new audience and the cloud needs to evolve. This talk would cover what the next 10 years of cloud computing will look like.
The Lyft data platform: Now and in the futuremarkgrover
- Lyft has grown significantly in recent years, providing over 1 billion rides to 30.7 million riders through 1.9 million drivers in 2018 across North America.
- Data is core to Lyft's business decisions, from pricing and driver matching to analyzing performance and informing investments.
- Lyft's data platform supports data scientists, analysts, engineers and others through tools like Apache Superset, change data capture from operational stores, and streaming frameworks.
- Key focuses for the platform include business metric observability, streaming applications, and machine learning while addressing challenges of reliability, integration and scale.
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...Databricks
Bighead is Airbnb's machine learning infrastructure that was created to:
- Standardize and simplify the ML development workflow;
- Reduce the time and effort to build ML models from weeks/months to days/weeks; and
- Enable more teams at Airbnb to utilize ML.
It provides shared services and tools for data management, model training/inference, and model management to make the ML process more efficient and production-ready. This includes services like Zipline for feature storage, Redspot for notebook environments, Deep Thought for online inference, and the Bighead UI for model monitoring.
Scaling out Driverless AI with IBM Spectrum Conductor - Kevin Doyle - H2O AI ...Sri Ambati
IBM Spectrum Conductor can manage H2O Driverless AI instances at scale across multiple nodes in an enterprise data center. Key benefits include the ability to run multiple Driverless AI instances on the same host using GPUs, failover capabilities if an instance fails, and role-based access control for users. The integration improves productivity by providing a shared file system, workload management, and allowing easy start/stop of Driverless AI instances.
A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...VMware Tanzu
This document discusses providing a modern interface for data science on Postgres and Greenplum databases. It introduces Ibis, a Python library that provides a DataFrame abstraction for SQL systems. Ibis allows defining complex data pipelines and transformations using deferred expressions, providing type checking before execution. The document argues that Ibis could be enhanced to support user-defined functions, saving results to tables, and data science modeling abstractions to provide a full-featured interface for data scientists on SQL databases.
When Apache Spark Meets TiDB with Xiaoyu MaDatabricks
During the past 10 years, big-data storage layers mainly focus on analytical use cases. When it comes to analytical cases, users usually offload data onto Hadoop cluster and perform queries on HDFS files. People struggle dealing with modifications on append only storage and maintain fragile ETL pipelines.
On the other hand, although Spark SQL has been proven effective parallel query processing engine, some tricks common in traditional databases are not available due to characteristics of storage underneath. TiSpark sits directly on top of a distributed database (TiDB)’s storage engine, expand Spark SQL’s planning with its own extensions and utilizes unique features of database storage engine to achieve functions not possible for Spark SQL on HDFS. With TiSpark, users are able to perform queries directly on changing / fresh data in real time.
The takeaways from this two are twofold:
— How to integrate Spark SQL with a distributed database engine and the benefit of it
— How to leverage Spark SQL’s experimental methods to extend its capacity.
Using BigDL on Apache Spark to Improve the MLS Real Estate Search Experience ...Databricks
BigDL-enabled Deep Learning analysis of photos attached to property listings in Multiple Listings Services database allowed us to extract image features and identify similar-looking properties. We leveraged this information to in real-time property search application to improve the relevancy of user search results. Imagine identifying a property listing photo you like and having the system suggest other listings you should also review. Traditional real-estate MLS (multiple-listings services) search methods rely on SQL-type queries to search and serve real-estate listings results.
However, using BigDL in conjunction with MLSLinstings standard APIs allows users to include photos as search parameters in real-time, based both on image similarities and semantic feature search. The information extracted from listing’s images is used to improve the relevancy of the search results. To enable this use-case, we implemented several CNNs using BigDL framework on Microsoft’s Azure hosted Apache Spark: – Image feature extraction and tagging. Extracts features from real estate images and classifies them according to Real Estates Standards Organization rules, such as overall house style, interior and exterior attributes, etc. – Image similarity network which allows for comparing images that belong to different properties based on their extracted features and create a similarity score to be used in search results.
We’ll discuss the above networks in details as well as run a live demo of real-estate search results. Key takeaways: a) Why invest into Spark BigDL from the start. b) Why choose cloud-based solution from the start. c) Choice of Scala vs Python.
ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...Databricks
The explosion of data volume in the years to come challenge the idea of a centralized cloud infrastructure which handles all business needs. Edge computing comes to rescue by pushing the needs of computation and data analysis at the edge of the network, thus avoiding data exchange when makes sense. One of the areas where data exchange could impose a big overhead is scoring ML models especially where data to score are files like images eg. in a computer vision application.
Another concern in some applications, is that of keeping data as private as possible and this is where keeping things local makes sense. In this talk we will discuss current needs and recent advances in model serving, like newly introduced formats for pushing models at the edge nodes eg. mobile phones and how a unified model serving architecture could cover current and future needs for both data scientists and data engineers. This architecture is based among others, on training models in a distributed fashion with TensorFlow and leveraging Spark for cleaning data before training (eg. using TensorFlow connector).
Finally we will describe a microservice based approach for scoring models back at the cloud infrastructure side (where bandwidth can be high) eg. using TensorFlow serving and updating models remotely with a pull model approach for edge devices. We will talk also about implementing the proposed architecture and how that might look on a modern deployment environment eg. Kubernetes.
In the last several months, MLflow has introduced significant platform enhancements that simplify machine learning lifecycle management. Expanded autologging capabilities, including a new integration with scikit-learn, have streamlined the instrumentation and experimentation process in MLflow Tracking. Additionally, schema management functionality has been incorporated into MLflow Models, enabling users to seamlessly inspect and control model inference APIs for batch and real-time scoring. In this session, we will explore these new features. We will share MLflow’s development roadmap, providing an overview of near-term advancements in the platform.
KFServing, Model Monitoring with Apache Spark and a Feature StoreDatabricks
In recent years, MLOps has emerged to bring DevOps processes to the machine learning (ML) development process, aiming at more automation in the execution of repetitive tasks and at smoother interoperability between tools. Among the different stages in the ML lifecycle, model monitoring involves the supervision of model performance over time, involving the combination of techniques in four categories: outlier detection, data drift detection, explainability and adversarial attacks. Most existing model monitoring tools follow a scheduled batch processing approach or analyse model performance using isolated subsets of the inference data. However, for the continuous monitoring of models, stream processing platforms show several advantages, including support for continuous data analytics, scalable processing of large amounts of data and first-class support for window-based aggregations useful for concept drift detection.
In this talk, we present an open-source platform for serving and monitoring models at scale based on Kubeflow’s model serving framework, KFServing, the Hopsworks Online Feature Store for enriching feature vectors with transformer in KFServing, and Spark and Spark Streaming as general purpose frameworks for monitoring models in production.
We also show how Spark Streaming can use the Hopsworks Feature Store to implement continuous data drift detection, where the Feature Store provides statistics on the distribution of feature values in training, and Spark Streaming computes the statistics on live traffic to the model, alerting if the live traffic differs significantly from the training data. We will include a live demonstration of the platform in action.
Saving Energy in Homes with a Unified Approach to Data and AIDatabricks
Energy wastage by residential buildings is a significant contributor to total worldwide energy consumption. Quby, an Amsterdam based technology company, offers solutions to empower homeowners to stay in control of their electricity, gas and water usage.
Plume - A Code Property Graph Extraction and Analysis LibraryTigerGraph
See all on-demand Graph + AI Sessions: https://www.tigergraph.com/graph-ai-world-sessions/
Get TigerGraph: https://www.tigergraph.com/get-tigergraph/
Multi runtime serving pipelines for machine learningStepan Pushkarev
The talk I gave at Scale By The Bay.
Deploying, Serving and monitoring machine learning models built with different ML frameworks in production. Envoy proxy powered serving mesh. TensorFlow, Spark ML, Scikit-learn and custom functions on CPU and GPU.
A Microservices Framework for Real-Time Model Scoring Using Structured Stream...Databricks
Open-source technologies allow developers to build microservices framework to build myriad real-time applications. One such application is building the real-time model scoring. In this session,
we will showcase how to architect a microservice framework, in particular how to use it to build a low-latency, real-time model scoring system. At the core of the architecture lies Apache Spark’s Structured
Streaming capability to deliver low-latency predictions coupled with Docker and Flask as additional open source tools for model service. In this session, you will walk away with:
* Knowledge of enterprise-grade model as a service
* Streaming architecture design principles enabling real-time machine learning
* Key concepts and building blocks for real-time model scoring
* Real-time and production use cases across industries, such as IIOT, predictive maintenance, fraud detection, sepsis etc.
In this session you will learn about how H&M have created a reference architecture for deploying their machine learning models on azure utilizing databricks following devOps principles. The architecture is currently used in production and has been iterated over multiple times to solve some of the discovered pain points. The team that are presenting is currently responsible for ensuring that best practices are implemented on all H&M use cases covering 100''s of models across the entire H&M group. <br> This architecture will not only give benefits to data scientist to use notebooks for exploration and modeling but also give the engineers a way to build robust production grade code for deployment. The session will in addition cover topics like lifecycle management, traceability, automation, scalability and version control.
Building Intelligent Applications, Experimental ML with Uber’s Data Science W...Databricks
In this talk, we will explore how Uber enables rapid experimentation of machine learning models and optimization algorithms through the Uber’s Data Science Workbench (DSW). DSW covers a series of stages in data scientists’ workflow including data exploration, feature engineering, machine learning model training, testing and production deployment. DSW provides interactive notebooks for multiple languages with on-demand resource allocation and share their works through community features.
It also has support for notebooks and intelligent applications backed by spark job servers. Deep learning applications based on TensorFlow and Torch can be brought into DSW smoothly where resources management is taken care of by the system. The environment in DSW is customizable where users can bring their own libraries and frameworks. Moreover, DSW provides support for Shiny and Python dashboards as well as many other in-house visualization and mapping tools.
In the second part of this talk, we will explore the use cases where custom machine learning models developed in DSW are productionized within the platform. Uber applies Machine learning extensively to solve some hard problems. Some use cases include calculating the right prices for rides in over 600 cities and applying NLP technologies to customer feedbacks to offer safe rides and reduce support costs. We will look at various options evaluated for productionizing custom models (server based and serverless). We will also look at how DSW integrates into the larger Uber’s ML ecosystem, e.g. model/feature stores and other ML tools, to realize the vision of a complete ML platform for Uber.
H2O World - Survey of Available Machine Learning Frameworks - Brendan HergerSri Ambati
H2O World 2015 - Brendan Herger of Capital One
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
The power of RapidMiner, showing the direct marketing demoWessel Luijben
Direct marketing allows businesses to target specific customer demographics like age, gender, and marital status. It has expanded beyond mail to include digital channels like text, email, and online ads. RapidMiner's direct marketing wizard helps businesses invest in the highest converting marketing actions and reduce costs through improved targeting. It provides a table of top customers to target, shows which customer properties most impact response rates, and evaluates the predictive model to determine if more customer data is needed.
CASE 02: SAIGON COOPMART
Logistics & Supply Chain plays an important role, if needed to say a critical factor for the success of Saigon Coopmart. Most of supermarkets over the world follow the identical model in which a warehouse is placed next to supermarket for stocks storage; and the size of warehouse is more or less equal to size of supermarket. However, due to harsh competition, and weak finance, Saigon Coopmart decided to follow a different model with very small size warehouse. This allows Saigon Coopmart to place more supermarkets; but in exchange, stocks only enough for a day, or maximum two compared to ordinary model in which a warehouse can store enough stocks for a week or more. As a consequence, Saigon Coopmart has to ship much more frequency to its supermarkets than its competitors such as Big C.
1) The team will predict fall undergraduate enrollment at the University of New Mexico using data from 1981 to 2009.
2) They will use simple linear regression to build models predicting enrollment based on January unemployment rates, June high school graduation rates, and monthly per capita income in Albuquerque.
3) The best predictor of enrollment is per capita income, which has an R-squared of 88.86% and standard error of 1366, indicating that as per capita income increases, enrollment also increases.
The document discusses the general linear model (GLM) as an extension of simple and multiple linear regression models. It describes how the GLM allows for modeling more complex relationships between variables by including transformed, squared, and interaction terms. Specifically, it explains how curvilinear relationships can be modeled by adding squared terms and how interaction effects between two variables can be modeled by including an interaction term. The document also discusses transforming the dependent variable to correct for non-constant variance.
The student conducted an independent linear regression analysis to model the relationship between the closing prices of the S&P 500 ETF (SPY) and McDonald's (MCD) stock. A linear regression model was fitted with MCD closing price as the response variable and SPY closing price as the predictor variable. The model found a statistically significant linear relationship between the two variables, with SPY price explaining about 75% of the variation in MCD price. When using the model to predict MCD's closing price based on SPY's actual later closing price, the model prediction was within 0.4% of the actual MCD price.
This document outlines quantitative methods taught in a course on Quantitative Applications in Management. It discusses topics like arithmetic mean, median, mode, standard deviation, correlation, regression, time series analysis, and index numbers. Calculation methods are provided for individual series, discrete series and continuous series. Common statistical measures and their applications in business and management are covered.
This document presents information about regression analysis. It defines regression as the dependence of one variable on another and lists the objectives as defining regression, describing its types (simple, multiple, linear), assumptions, models (deterministic, probabilistic), and the method of least squares. Examples are provided to illustrate simple regression of computer speed on processor speed. Formulas are given to calculate the regression coefficients and lines for predicting y from x and x from y.
This lesson begins with explaining the linear regression method characteristics, and uses. Linear regression method attempts to best fit a line through the data. Using an example and the forecasting process, we apply the linear regression method to create a model and forecast based upon it.
A quick introduction to linear and logistic regression using Python. Part of the Data Science Bootcamp held in Amman by the Jordan Open Source Association Dec/Jan 2015. Reference code can be found on Github https://github.com/jordanopensource/data-science-bootcamp/tree/master/MachineLearning/Session1
This document provides an overview and summary of linear regression analysis theory and computing. It discusses linear regression models and the goals of regression analysis. It also introduces some key topics that will be covered in the book, including simple and multiple linear regression, model diagnosis, generalized linear models, Bayesian linear regression, and computational methods like least squares estimation. The book aims to serve as a one-semester textbook on fundamental regression analysis concepts for graduate students.
The document is a presentation on machine learning and simple linear regression. It introduces the concepts of a regression model, fitting a linear regression line to data by minimizing the residual sum of squares, and using the fitted line to make predictions. It discusses representing the linear regression model as an equation relating the output variable (y) to the input or feature (x), with parameters (w0, w1) estimated from training data. The parameters can be estimated by taking the gradient of the residual sum of squares and setting it equal to zero to find the optimal values for w0 and w1 that best fit the data.
Chapt 11 & 12 linear & multiple regression minitabBoyu Deng
The document discusses linear regression and correlation. It defines linear regression as finding the line of best fit that minimizes the sum of the squared residuals. The regression coefficients (slope and intercept) that achieve this are calculated using sums of squares and cross-products. Hypothesis tests are used to determine if the regression coefficients are statistically significant. Confidence and prediction intervals are also discussed to quantify the uncertainty in the regression line and predicted values.
This document describes a simple linear regression analysis to model the relationship between the number of followers on Twitter (response variable) and years since joining Twitter, number of tweets, photos/videos posted, and people followed (predictor variables) for the top 40 most followed Twitter accounts. The analysis found that years since joining had the strongest linear relationship with followers. The regression equation estimated followers would increase by 12.52 million for each additional year on Twitter. Residual analyses found the model fit the data well although the residuals were not normally distributed.
This document provides information about regression analysis and linear regression. It defines regression analysis as using relationships between quantitative variables to predict a dependent variable from independent variables. Linear regression finds the best fitting straight line relationship between variables. The simple linear regression equation is given as Y = a + bX, where a and b are estimated parameters calculated from sample data. An example is worked through, showing how to calculate the regression equation from data, graph the relationship, and use the equation to estimate values.
This document provides an overview of simple and multiple linear regression analysis. It discusses key concepts such as:
- Dependent and independent variables in bivariate linear regression
- Using scatter plots to explore relationships
- Estimating regression coefficients and equations for simple and multiple regression models
- Using regression models to predict outcomes based on independent variable values
- Conducting statistical tests on overall regression models and individual coefficients
This document summarizes key concepts in simple linear regression:
- Simple linear regression models the relationship between a dependent variable y and independent variable x as y = β0 + β1x + ε, seeking to estimate β0 and β1 using the least squares method.
- The least squares method minimizes the sum of squared residuals to calculate the slope b1 and y-intercept b0 of the estimated regression equation ŷ = b0 + b1x.
- The coefficient of determination R2 indicates how well the regression line represents the data, calculated as the proportion of total variation explained by the regression.
- A worked example illustrates finding the estimated regression equation, R2, and correlation
Simple Regression presentation is a
partial fulfillment to the requirement in PA 297 Research for Public Administrators, presented by Atty. Gayam , Dr. Cabling and Mr. Cagampang
Logistic regression for ordered dependant variable with more than 2 levelsArup Guha
This document discusses multinomial logistic regression models. Multinomial logistic regression can handle dependent variables with more than two categories that may be ordinal (ordered categories) or nominal (unordered categories). The document focuses on proportional odds cumulative logit models, which model ordinal dependent variables by considering the natural ordering of categories. It provides an example of using SAS code to fit a proportional odds model to model the impact of radiation exposure on human health.
The document discusses simple linear regression. It defines key terms like regression equation, regression line, slope, intercept, residuals, and residual plot. It provides examples of using sample data to generate a regression equation and evaluating that regression model. Specifically, it shows generating a regression equation from bivariate data, checking assumptions visually through scatter plots and residual plots, and interpreting the slope as the marginal change in the response variable from a one unit change in the explanatory variable.
The document discusses simple linear regression and correlation methods. It defines deterministic and probabilistic models for describing the relationship between two variables. A simple linear regression model assumes a population regression line with intercept a and slope b, where observations may deviate from the line by some random error e. Key assumptions of the model are that e has a normal distribution with mean 0 and constant variance across values of x, and errors are independent. The slope b estimates the average change in y per unit change in x.
Parallel Linear Regression in Interative Reduce and YARNDataWorks Summit
Online learning techniques, such as Stochastic Gradient Descent (SGD), are powerful when applied to risk minimization and convex games on large problems. However, their sequential design prevents them from taking advantage of newer distributed frameworks such as Hadoop/MapReduce. In this session, we will take a look at how we parallelized linear regression parameter optimization on the next-gen YARN framework Iterative Reduce.
In this session, we will introduce “Knitting Boar”, an open-source Java library for performing distributed online learning on a Hadoop cluster under YARN. We will give an overview of how Woven Wabbit works and examine the lessons learned from YARN application construction.
This document discusses machine learning and the Knitting Boar parallel machine learning library. It provides an introduction to machine learning concepts like classification, recommendation, and clustering. It also introduces Mahout for machine learning on Hadoop. The document describes the Knitting Boar library, which uses YARN to parallelize Mahout's stochastic gradient descent algorithm. It shows how Knitting Boar allows machine learning models to train faster by distributing work across multiple nodes.
KnittingBoar Toronto Hadoop User Group Nov 27 2012Adam Muise
This document discusses machine learning and parallel iterative algorithms. It provides an introduction to machine learning and Mahout. It then describes Knitting Boar, a system for parallelizing stochastic gradient descent on Hadoop YARN. Knitting Boar partitions data among workers that perform online logistic regression in batches. The workers send gradient updates to a master node, which averages the updates to produce a new global model. Experimental results show Knitting Boar achieves roughly linear speedup. The document concludes by discussing developing YARN applications and the Knitting Boar codebase.
Knitting boar - Toronto and Boston HUGs - Nov 2012Josh Patterson
1) The document discusses machine learning and parallel iterative algorithms like stochastic gradient descent. It introduces the Mahout machine learning library and describes an implementation of parallel SGD called Knitting Boar that runs on YARN.
2) Knitting Boar parallelizes Mahout's SGD algorithm by having worker nodes process partitions of the training data in parallel while a master node merges their results.
3) The author argues that approaches like Knitting Boar and IterativeReduce provide better ways to implement machine learning algorithms for big data compared to traditional MapReduce.
Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...Yahoo Developer Network
This document discusses programming abstractions for smart applications on clouds. It proposes a new programming model called Deformable Mesh Abstraction (DMA) that addresses limitations in existing models like MapReduce. DMA allows tasks to recursively spawn new tasks at runtime, supports efficient communication through a shared structure, and can operate on changing datasets. The document describes how DMA can model heuristic problem solving and presents case studies applying DMA to AI planners. It also discusses how DMA could be extended to support file systems and integrated with Hadoop.
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...jencyjayastina
The document discusses MapReduce, a programming model for processing large datasets in parallel across a distributed cluster. It describes how MapReduce works by specifying computation in terms of mapping and reducing functions. The underlying runtime system automatically parallelizes the computation, handles failures and communications. MapReduce is the processing engine of Apache Hadoop, which was derived from Google's MapReduce. It allows processing huge amounts of data through mapping and reducing steps. The mapping step converts data into key-value pairs, while the reducing step combines the output of mapping into smaller tuples. MapReduce is mainly used for parallel processing of large datasets stored in Hadoop clusters.
The document discusses MapReduce and the Hadoop framework. It provides an overview of how MapReduce works, examples of problems it can solve, and how Hadoop implements MapReduce at scale across large clusters in a fault-tolerant manner using the HDFS distributed file system and YARN resource management.
Characterization of hadoop jobs using unsupervised learningJoão Gabriel Lima
This document summarizes research characterizing Hadoop jobs using unsupervised learning techniques. The researchers clustered over 11,000 Hadoop jobs from Yahoo production clusters into 8 groups based on job metrics. The centroids of each cluster represent characteristic jobs and show differences in map/reduce tasks and data processed. Identifying common job profiles can help benchmark and optimize Hadoop performance.
Abstract: With development of the information technology, the scale of data is increasing quickly. The massive data poses a great challenge for data processing and classification. In order to classify the data, there were several algorithm proposed to efficiently cluster the data. One among that is the random forest algorithm, which is used for the feature subset selection. The feature selection involves identifying a subset of the most useful features that produces compatible results as the original entire set of features. It is achieved by classifying the given data. The efficiency is calculated based on the time required to find a subset of features, the effectiveness is related to the quality of the subset of features. The existing system deals with fast clustering based feature selection algorithm, which is proven to be powerful, but when the size of the dataset increases rapidly, the current algorithm is found to be less efficient as the clustering of datasets takes quiet more number of time. Hence the new method of implementation is proposed in this project to efficiently cluster the data and persist on the back-end database accordingly to reduce the time. It is achieved by scalable random forest algorithm. The Scalable random forest is implemented using Map Reduce Programming (An implementation of Big Data) to efficiently cluster the data. In works on two phases, the first step deals with the gathering the datasets and persisting on the datastore and the second step deals with the clustering and classification of data. This process is completely implemented using Google App Engine’s hadoop platform, which is a widely used open-source implementation of Google's distributed file system using MapReduce framework for scalable distributed computing or cloud computing. MapReduce programming model provides an efficient framework for processing large datasets in an extremely parallel mining. And it comes to being the most popular parallel model for data processing in cloud computing platform. However, designing the traditional machine learning algorithms with MapReduce programming framework is very necessary in dealing with massive datasets.Keywords: Data mining, Hadoop, Map Reduce, Clustering Tree.
Title: Big Data on Implementation of Many to Many Clustering
Author: Ravi. R, Michael. G
ISSN 2350-1022
International Journal of Recent Research in Mathematics Computer Science and Information Technology
Paper Publications
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology.
The document describes MapReduce, a programming model and implementation for processing large datasets across clusters of computers. The model uses map and reduce functions to parallelize computations. Map processes key-value pairs to generate intermediate pairs, and reduce merges values with the same intermediate key. The implementation handles parallelization, distribution, and fault tolerance transparently. Hundreds of programs have been implemented using MapReduce at Google, processing terabytes of data on thousands of machines daily.
Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Hadoop features include a distributed file system called HDFS that stores data on commodity machines, providing fault tolerance. It also provides a programming model called MapReduce that allows users to write applications as a set of map and reduce functions that can automatically parallelize across a distributed system.
Mapreduce - Simplified Data Processing on Large ClustersAbhishek Singh
The document describes MapReduce, a programming model and implementation for processing large datasets across clusters of computers. It allows users to write map and reduce functions to parallelize tasks. The MapReduce library automatically parallelizes jobs, distributes data and tasks, handles failures and coordinates communication between machines. It is scalable, processing terabytes of data on thousands of machines, and easy for programmers without parallel experience to use.
This document provides an introduction to MapReduce programming model. It describes how MapReduce inspired by Lisp functions works by dividing tasks into mapping and reducing parts that are distributed and processed in parallel. It then gives examples of using MapReduce for word counting and calculating total sales. It also provides details on MapReduce daemons in Hadoop and includes demo code for summing array elements in Java and doing word counting on a text file using the Hadoop framework in Python.
The document provides an overview of MapReduce, including:
1) MapReduce is a programming model and implementation that allows for large-scale data processing across clusters of computers. It handles parallelization, distribution, and reliability.
2) The programming model involves mapping input data to intermediate key-value pairs and then reducing by key to output results.
3) Example uses of MapReduce include word counting and distributed searching of text.
This document provides an overview of next generation analytics with YARN, Spark and GraphLab. It discusses how YARN addressed limitations of Hadoop 1.0 like scalability, locality awareness and shared cluster utilization. It also describes the Berkeley Data Analytics Stack (BDAS) which includes Spark, and how companies like Ooyala and Conviva use it for tasks like iterative machine learning. GraphLab is presented as ideal for processing natural graphs and the PowerGraph framework partitions such graphs for better parallelism. PMML is introduced as a standard for defining predictive models, and how a Naive Bayes model can be defined and scored using PMML with Spark and Storm.
This deck was presented at the Spark meetup at Bangalore. The key idea behind the presentation was to focus on limitations of Hadoop MapReduce and introduce both Hadoop YARN and Spark in this context. An overview of the other aspects of the Berkeley Data Analytics Stack was also provided.
The document proposes a system called Twiche that uses caching to improve the efficiency of incremental MapReduce jobs. Twiche indexes cached items from the map phase by their original input and applied operations. This allows it to identify duplicate computations and avoid reprocessing the same data. The experimental results show that Twiche can eliminate all duplicate tasks in incremental MapReduce jobs, reducing execution time and CPU utilization compared to traditional MapReduce.
This document discusses using DL4J and DataVec to build production-ready deep learning workflows for time series and text data. It provides an example of modeling sensor data with recurrent neural networks (RNNs) and character-level text generation with LSTMs. Key points include:
- DL4J is a deep learning framework for Java that runs on Spark and supports CPU/GPU. DataVec is a tool for data preprocessing.
- The document demonstrates loading and transforming sensor time series data with DataVec and training an RNN on the data with DL4J.
- It also shows vectorizing character-level text data from beer reviews with DataVec and using an LSTM in DL4J to generate new
This document discusses using DL4J and DataVec to build deep learning workflows for modeling time series sensor data with recurrent neural networks. It provides an example of loading and transforming sensor data with DataVec, configuring an RNN with DL4J, and training the model both locally and distributed on Spark. The overall workflow involves extracting, transforming, and loading data with DataVec, vectorizing it, modeling with DL4J, evaluating performance, and deploying trained models for execution on Spark/Hadoop platforms.
Deep Learning and Recurrent Neural Networks in the EnterpriseJosh Patterson
This document discusses deep learning and recurrent neural networks. It provides an overview of deep learning, including definitions, automated feature learning, and popular deep learning architectures. It also describes DL4J, a tool for building deep learning models in Java and Scala, and discusses applications of recurrent neural networks for tasks like anomaly detection using time series data and audio processing.
Modeling Electronic Health Records with Recurrent Neural NetworksJosh Patterson
Time series data is increasingly ubiquitous. This trend is especially obvious in health and wellness, with both the adoption of electronic health record (EHR) systems in hospitals and clinics and the proliferation of wearable sensors. In 2009, intensive care units in the United States treated nearly 55,000 patients per day, generating digital-health databases containing millions of individual measurements, most of those forming time series. In the first quarter of 2015 alone, over 11 million health-related wearables were shipped by vendors. Recording hundreds of measurements per day per user, these devices are fueling a health time series data explosion. As a result, we will need ever more sophisticated tools to unlock the true value of this data to improve the lives of patients worldwide.
Deep learning, specifically with recurrent neural networks (RNNs), has emerged as a central tool in a variety of complex temporal-modeling problems, such as speech recognition. However, RNNs are also among the most challenging models to work with, particularly outside the domains where they are widely applied. Josh Patterson, David Kale, and Zachary Lipton bring the open source deep learning library DL4J to bear on the challenge of analyzing clinical time series using RNNs. DL4J provides a reliable, efficient implementation of many deep learning models embedded within an enterprise-ready open source data ecosystem (e.g., Hadoop and Spark), making it well suited to complex clinical data. Josh, David, and Zachary offer an overview of deep learning and RNNs and explain how they are implemented in DL4J. They then demonstrate a workflow example that uses a pipeline based on DL4J and Canova to prepare publicly available clinical data from PhysioNet and apply the DL4J RNN.
Building Deep Learning Workflows with DL4JJosh Patterson
In this session we will take a look at a practical review of what is deep learning and introduce DL4J. We’ll look at how it supports deep learning in the enterprise on the JVM. We’ll discuss the architecture of DL4J’s scale-out parallelization on Hadoop and Spark in support of modern machine learning workflows. We’ll conclude with a workflow example from the command line interface that shows the vectorization pipeline in Canova producing vectors for DL4J’s command line interface to build deep learning models easily.
Deep learning with DL4J - Hadoop Summit 2015Josh Patterson
This document discusses deep learning and DL4J. It begins with an overview of deep learning, describing it as automated feature engineering through chained techniques like restricted Boltzmann machines. It then introduces DL4J, describing it as an enterprise-grade Java implementation of deep learning that supports parallelization on Hadoop, Spark, and GPUs. The rest of the document discusses building deep learning workflows with DL4J and related tools like Canova and Arbiter, providing an example of vectorizing and modeling iris data from a CSV file on the command line.
Josh Patterson presented on deep learning and DL4J. He began with an overview of deep learning, explaining it as automated feature engineering where machines learn representations of the world. He then discussed DL4J, describing it as the "Hadoop of deep learning" - an open source deep learning library with Java, Scala, and Python APIs that supports parallelization on Hadoop, Spark, and GPUs. He demonstrated building deep learning workflows with DL4J and Canova, using the Iris dataset as an example to show how data can be vectorized with Canova and then a model trained on it using DL4J from the command line. He concluded by describing Skymind as a distribution of DL4J with enterprise
Deep Learning Intro - Georgia Tech - CSE6242 - March 2015Josh Patterson
This document provides an overview of deep learning, including:
- Deep learning involves using neural networks with multiple hidden layers, like deep belief networks and convolutional neural networks, to learn complex features from data.
- Deep belief networks use stacked restricted Boltzmann machines to learn progressively more complex features, which are then used to initialize and train a neural network.
- Convolutional neural networks use layers of convolutions to learn higher-order features from images and are well-suited for tasks like image recognition.
- Recurrent and recursive neural networks can model temporal and hierarchical relationships in data like text or images.
- Frameworks like DL4J provide tools for implementing and training deep learning models on
Vectorization - Georgia Tech - CSE6242 - March 2015Josh Patterson
This document discusses vectorization, which is the process of converting raw data like text into numerical feature vectors that can be fed into machine learning algorithms. It covers the vector space model for text vectorization where each unique word is mapped to an index in a vector and the value is the word count. Common text vectorization strategies like bag-of-words, TF-IDF, and kernel hashing are explained. General vectorization techniques for different attribute types like nominal, ordinal, interval and ratio are also overviewed along with feature engineering methods and the Canova tool.
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Josh Patterson
Josh Patterson is a principal solution architect who has worked with Hadoop at Cloudera and Tennessee Valley Authority. Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of commodity servers. It allows for consolidating mixed data types at low cost while keeping raw data always available. Hadoop uses commodity hardware and scales to petabytes without changes. Its distributed file system provides fault tolerance and replication while its processing engine handles all data types and scales processing.
Georgia Tech cse6242 - Intro to Deep Learning and DL4JJosh Patterson
Introduction to deep learning and DL4J - http://deeplearning4j.org/ - a guest lecture by Josh Patterson at Georgia Tech for the cse6242 graduate class.
Intro to Vectorization Concepts - GaTech cse6242Josh Patterson
Vectorization is the process of converting text into numeric vectors that can be used by machine learning algorithms. There are several common techniques for vectorization, including the bag-of-words model, TF-IDF, and n-grams. The bag-of-words model represents documents as vectors counting the number of times each word appears. TF-IDF improves on this by weighting words based on their frequency in documents and inverse frequency in the corpus. N-grams consider sequences of words, such as bigrams like "Coca Cola", as single units. Kernel hashing allows vectorization in a single pass by mapping words to a fixed-sized vector using a hash function.
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on HadoopJosh Patterson
As the data world undergoes its cambrian explosion phase our data tools need to become more advanced to keep pace. Deep Learning has emerged as a key tool in the non-linear arms race of machine learning. In this session we will take a look at how we parallelize Deep Belief Networks in Deep Learning on Hadoop’s next generation YARN framework with Iterative Reduce. We’ll also look at some real world examples of processing data with Deep Learning such as image classification and natural language processing.
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARNJosh Patterson
This document summarizes Josh Patterson's work on parallel machine learning algorithms. It discusses his past publications and work on routing algorithms and metaheuristics. It then outlines his work developing parallel versions of algorithms like linear regression, logistic regression, and neural networks using Hadoop and YARN. It presents performance results showing these parallel algorithms can achieve close to linear speedup. It also discusses techniques used like vector caching and unit testing frameworks. Finally, it discusses future work on algorithms like Adagrad and parallel quasi-Newton methods.
Have you ever been recommended a friend on Facebook? Or an item you might be interested in on Amazon? If so then you’ve benefitted from the value of recommendation systems. Recommendation systems apply knowledge discovery techniques to the problem of making recommendations that are personalized for each user. Recommendation systems are one way we can use algorithms to help us sort through the masses of information to find the “good stuff” in a very personalized way.
Josh Patterson gave a presentation on Hadoop and how it has been used. He discussed his background working on Hadoop projects including for the Tennessee Valley Authority. He outlined what Hadoop is, how it works, and examples of use cases. This includes how Hadoop was used to store and analyze large amounts of smart grid sensor data for the openPDC project. He discussed integrating Hadoop with existing enterprise systems and tools for working with Hadoop like Pig and Hive.
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...Alan Dix
Talk at the final event of Data Fusion Dynamics: A Collaborative UK-Saudi Initiative in Cybersecurity and Artificial Intelligence funded by the British Council UK-Saudi Challenge Fund 2024, Cardiff Metropolitan University, 29th April 2025
https://alandix.com/academic/talks/CMet2025-AI-Changes-Everything/
Is AI just another technology, or does it fundamentally change the way we live and think?
Every technology has a direct impact with micro-ethical consequences, some good, some bad. However more profound are the ways in which some technologies reshape the very fabric of society with macro-ethical impacts. The invention of the stirrup revolutionised mounted combat, but as a side effect gave rise to the feudal system, which still shapes politics today. The internal combustion engine offers personal freedom and creates pollution, but has also transformed the nature of urban planning and international trade. When we look at AI the micro-ethical issues, such as bias, are most obvious, but the macro-ethical challenges may be greater.
At a micro-ethical level AI has the potential to deepen social, ethnic and gender bias, issues I have warned about since the early 1990s! It is also being used increasingly on the battlefield. However, it also offers amazing opportunities in health and educations, as the recent Nobel prizes for the developers of AlphaFold illustrate. More radically, the need to encode ethics acts as a mirror to surface essential ethical problems and conflicts.
At the macro-ethical level, by the early 2000s digital technology had already begun to undermine sovereignty (e.g. gambling), market economics (through network effects and emergent monopolies), and the very meaning of money. Modern AI is the child of big data, big computation and ultimately big business, intensifying the inherent tendency of digital technology to concentrate power. AI is already unravelling the fundamentals of the social, political and economic world around us, but this is a world that needs radical reimagining to overcome the global environmental and human challenges that confront us. Our challenge is whether to let the threads fall as they may, or to use them to weave a better future.
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfAbi john
Analyze the growth of meme coins from mere online jokes to potential assets in the digital economy. Explore the community, culture, and utility as they elevate themselves to a new era in cryptocurrency.
How Can I use the AI Hype in my Business Context?Daniel Lehner
𝙄𝙨 𝘼𝙄 𝙟𝙪𝙨𝙩 𝙝𝙮𝙥𝙚? 𝙊𝙧 𝙞𝙨 𝙞𝙩 𝙩𝙝𝙚 𝙜𝙖𝙢𝙚 𝙘𝙝𝙖𝙣𝙜𝙚𝙧 𝙮𝙤𝙪𝙧 𝙗𝙪𝙨𝙞𝙣𝙚𝙨𝙨 𝙣𝙚𝙚𝙙𝙨?
Everyone’s talking about AI but is anyone really using it to create real value?
Most companies want to leverage AI. Few know 𝗵𝗼𝘄.
✅ What exactly should you ask to find real AI opportunities?
✅ Which AI techniques actually fit your business?
✅ Is your data even ready for AI?
If you’re not sure, you’re not alone. This is a condensed version of the slides I presented at a Linkedin webinar for Tecnovy on 28.04.2025.
Spark is a powerhouse for large datasets, but when it comes to smaller data workloads, its overhead can sometimes slow things down. What if you could achieve high performance and efficiency without the need for Spark?
At S&P Global Commodity Insights, having a complete view of global energy and commodities markets enables customers to make data-driven decisions with confidence and create long-term, sustainable value. 🌍
Explore delta-rs + CDC and how these open-source innovations power lightweight, high-performance data applications beyond Spark! 🚀
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025BookNet Canada
Book industry standards are evolving rapidly. In the first part of this session, we’ll share an overview of key developments from 2024 and the early months of 2025. Then, BookNet’s resident standards expert, Tom Richardson, and CEO, Lauren Stewart, have a forward-looking conversation about what’s next.
Link to recording, transcript, and accompanying resource: https://bnctechforum.ca/sessions/standardsgoals-for-2025-standards-certification-roundup/
Presented by BookNet Canada on May 6, 2025 with support from the Department of Canadian Heritage.
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersToradex
Toradex brings robust Linux support to SMARC (Smart Mobility Architecture), ensuring high performance and long-term reliability for embedded applications. Here’s how:
• Optimized Torizon OS & Yocto Support – Toradex provides Torizon OS, a Debian-based easy-to-use platform, and Yocto BSPs for customized Linux images on SMARC modules.
• Seamless Integration with i.MX 8M Plus and i.MX 95 – Toradex SMARC solutions leverage NXP’s i.MX 8 M Plus and i.MX 95 SoCs, delivering power efficiency and AI-ready performance.
• Secure and Reliable – With Secure Boot, over-the-air (OTA) updates, and LTS kernel support, Toradex ensures industrial-grade security and longevity.
• Containerized Workflows for AI & IoT – Support for Docker, ROS, and real-time Linux enables scalable AI, ML, and IoT applications.
• Strong Ecosystem & Developer Support – Toradex offers comprehensive documentation, developer tools, and dedicated support, accelerating time-to-market.
With Toradex’s Linux support for SMARC, developers get a scalable, secure, and high-performance solution for industrial, medical, and AI-driven applications.
Do you have a specific project or application in mind where you're considering SMARC? We can help with Free Compatibility Check and help you with quick time-to-market
For more information: https://www.toradex.com/computer-on-modules/smarc-arm-family
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul
Artificial intelligence is changing how businesses operate. Companies are using AI agents to automate tasks, reduce time spent on repetitive work, and focus more on high-value activities. Noah Loul, an AI strategist and entrepreneur, has helped dozens of companies streamline their operations using smart automation. He believes AI agents aren't just tools—they're workers that take on repeatable tasks so your human team can focus on what matters. If you want to reduce time waste and increase output, AI agents are the next move.
Dev Dives: Automate and orchestrate your processes with UiPath MaestroUiPathCommunity
This session is designed to equip developers with the skills needed to build mission-critical, end-to-end processes that seamlessly orchestrate agents, people, and robots.
📕 Here's what you can expect:
- Modeling: Build end-to-end processes using BPMN.
- Implementing: Integrate agentic tasks, RPA, APIs, and advanced decisioning into processes.
- Operating: Control process instances with rewind, replay, pause, and stop functions.
- Monitoring: Use dashboards and embedded analytics for real-time insights into process instances.
This webinar is a must-attend for developers looking to enhance their agentic automation skills and orchestrate robust, mission-critical processes.
👨🏫 Speaker:
Andrei Vintila, Principal Product Manager @UiPath
This session streamed live on April 29, 2025, 16:00 CET.
Check out all our upcoming Dev Dives sessions at https://community.uipath.com/dev-dives-automation-developer-2025/.
This is the keynote of the Into the Box conference, highlighting the release of the BoxLang JVM language, its key enhancements, and its vision for the future.
Semantic Cultivators : The Critical Future Role to Enable AIartmondano
By 2026, AI agents will consume 10x more enterprise data than humans, but with none of the contextual understanding that prevents catastrophic misinterpretations.
HCL Nomad Web – Best Practices and Managing Multiuser Environmentspanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-nomad-web-best-practices-and-managing-multiuser-environments/
HCL Nomad Web is heralded as the next generation of the HCL Notes client, offering numerous advantages such as eliminating the need for packaging, distribution, and installation. Nomad Web client upgrades will be installed “automatically” in the background. This significantly reduces the administrative footprint compared to traditional HCL Notes clients. However, troubleshooting issues in Nomad Web present unique challenges compared to the Notes client.
Join Christoph and Marc as they demonstrate how to simplify the troubleshooting process in HCL Nomad Web, ensuring a smoother and more efficient user experience.
In this webinar, we will explore effective strategies for diagnosing and resolving common problems in HCL Nomad Web, including
- Accessing the console
- Locating and interpreting log files
- Accessing the data folder within the browser’s cache (using OPFS)
- Understand the difference between single- and multi-user scenarios
- Utilizing Client Clocking
Book industry standards are evolving rapidly. In the first part of this session, we’ll share an overview of key developments from 2024 and the early months of 2025. Then, BookNet’s resident standards expert, Tom Richardson, and CEO, Lauren Stewart, have a forward-looking conversation about what’s next.
Link to recording, presentation slides, and accompanying resource: https://bnctechforum.ca/sessions/standardsgoals-for-2025-standards-certification-roundup/
Presented by BookNet Canada on May 6, 2025 with support from the Department of Canadian Heritage.
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxAnoop Ashok
In today's fast-paced retail environment, efficiency is key. Every minute counts, and every penny matters. One tool that can significantly boost your store's efficiency is a well-executed planogram. These visual merchandising blueprints not only enhance store layouts but also save time and money in the process.
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxAnoop Ashok
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
2. Josh Patterson
Email: Past
Published in IAAI-09:
[email protected] “TinyTermite: A Secure Routing Algorithm”
Twitter: Grad work in Meta-heuristics, Ant-algorithms
Tennessee Valley Authority (TVA)
@jpatanooga Hadoop and the Smartgrid
Github: Cloudera
Principal Solution Architect
https://github.com/jp Today
atanooga Independent Consultant
5. The World as Optimization
Data tells us about our model/engine/product
We take this data and evolve our product towards a
state of minimal market error
WSJ Special Section, Monday March 11, 2013
Zynga changing games based off player behavior
UPS cut fuel consumption by 8.4MM gallons
Ford used sentiment analysis to look at how new car
features would be received
6. The Modern Data Landscape
Apps are coming but they need
Platforms
Components
Workflows
Lots of investment in Hadoop in this space
Lots of ETL pipelines
Lots of descriptive Statistics
Growing interest in Machine Learning
7. Hadoop as The Linux of Data
Hadoop has won the Cycle “Hadoop is the
kernel of a
Gartner: Hadoop will be in
distributed operating
2/3s of advanced analytics
products by 2015 [1] system, and all the
other components
around the kernel
are now arriving on
this stage”
---Doug Cutting
8. Today’s Hadoop ML Pipeline
Data cleansing / ETL performed with Hive or Pig
Data In Place Processed
Mahout
R
Custom MapReduce Algorithm
Or Externally Processed
SAS
SPSS
KXEN
Weka
9. As Focus Shifts to Applications
Data rates have been climbing fast
Speed at Scale becomes the new Killer App
Companies will want to leverage the Big Data
infrastructure they’ve already been working with
Hadoop
HDFS as main storage system
A drive to validate big data investments with results
Emergence of applications which create “data products”
10. Patterson’s Law
“As the percent of your total data held
in a storage system approaches 100%
the amount of in-system processing
and analytics also approaches 100%”
11. Tools Will Move onto Hadoop
Already seeing this with Vendors
Who hasn’t announced a SQL engine on Hadoop
lately?
Trend will continue with machine learning tools
Mahout was the beginning
More are following
But what about parallel iterative algorithms?
12. Distributed Systems Are Hard
Lots of moving parts
Especially as these applications become more complicated
Machine learning can be a non-trivial operation
We need great building blocks that work well together
I agree with Jimmy Lin [3]: “keep it simple”
“make sure costs don’t outweigh benefits”
Minimize “Yet Another Tool To Learn” (YATTL) as much as
we can!
13. To Summarize
Data moving into Hadoop everywhere
Patterson’s Law
Focus on hadoop, build around next-gen “linux of data”
Need simple components to build next-gen data base apps
They should work cleanly with the cluster that the fortune
500 has: Hadoop
Also should be easy to integrate into Hadoop and with the
hadoop-tool ecosystem
Minimize YATTL
15. Linear Regression
In linear regression, data is
modeled using linear predictor
functions
unknown model parameters are
estimated from the data.
We use optimization techniques
like Stochastic Gradient Descent to
find the coeffcients in the model
Y = (1*x0) + (c1*x1) + … + (cN*xN)
17. 17
Stochastic Gradient Descent
Hypothesis about data
Cost function
Update function
Andrew Ng’s Tutorial:
https://class.coursera.org/ml/lecture/preview_view
/11
18. 18
Stochastic Gradient Descent
Training Data
Training
Simple gradient descent procedure
Loss functions needs to be convex
(with exceptions)
Linear Regression
SGD
Loss Function: squared error of
prediction
Prediction: linear combination of
coefficients and input variables
Model
19. 19
Mahout’s SGD
Currently Single Process
Multi-threaded parallel, but not cluster parallel
Runs locally, not deployed to the cluster
Tied to logistic regression implementation
20. 20
Current Limitations
Sequential algorithms on a single node only goes so
far
The “Data Deluge”
Presents algorithmic challenges when combined with
large data sets
need to design algorithms that are able to perform in a
distributed fashion
MapReduce only fits certain types of algorithms
21. 21
Distributed Learning Strategies
McDonald, 2010
Distributed Training Strategies for the Structured
Perceptron
Langford, 2007
Vowpal Wabbit
Jeff Dean’s Work on Parallel SGD
DownPour SGD
Sandblaster
23. 23
YARN
Yet Another Resource Negotiator
Node
Manager
Framework for scheduling
Container App Mstr
distributed applications Client
Resource Node
Manager Manager
Allows for any type of parallel Client
application to run natively on App Mstr Container
hadoop
MRv2 is now a distributed MapReduce Status Node
Manager
application
Job Submission
Node Status
Resource Request Container Container
24. 24
IterativeReduce
Designed specifically for parallel iterative
algorithms on Hadoop
Implemented directly on top of YARN
Intrinsic Parallelism
Easier to focus on problem
Not focusing on the distributed application part
26. 26
SGD Master
Collects all parameter vectors at each pass /
superstep
Produces new global parameter vector
By averaging workers’ vectors
Sends update to all workers
Workers replace local parameter vector with new
global parameter vector
27. 27
SGD Worker
Each given a split of the total dataset
Similar to a map task
Performs local SGD pass
Local parameter vector sent to master at
superstep
Stays active/resident between iterations
28. 28
SGD: Serial vs Parallel
Split 1 Split 2 Split 3
Training Data
Worker N
Worker 1 Worker 2
…
Partial Partial Model Partial
Model Model
Master
Model Global Model
29. Parallel Linear Regression with IterativeReduce
Based directly on work we did with Knitting Boar
Parallel logistic regression
Scales linearly with input size
Can produce a linear regression model off large amounts
of data
Packaged in a new suite of parallel iterative algorithms
called Metronome
100% Java, ASF 2.0 Licensed, on github
30. Unit Testing and IRUnit
Simulates the IterativeReduce parallel framework
Uses the same app.properties file that YARN applications do
Examples
https://github.com/jpatanooga/Metronome/blob/master/src/test/jav
a/tv/floe/metronome/linearregression/iterativereduce/TestSimulat
eLinearRegressionIterativeReduce.java
https://github.com/jpatanooga/KnittingBoar/blob/master/src/test/j
ava/com/cloudera/knittingboar/sgd/iterativereduce/TestKnittingB
oar_IRUnitSim.java
32. Running the Job via YARN
Build with Maven
Copy Jar to host with cluster access
Copy dataset to HDFS
Run job
Yarn jar iterativereduce-0.1-SNAPSNOT.jar app.properties
33. Results
Linear Regression - Parallel vs Serial
200
Total Processing Time
150
100
Parallel Runs
50 Serial Runs
0
64 128 192 256 320
Megabytes Processed Total
34. Lessons Learned
Linear scale continues to be achieved with
parameter averaging variations
Tuning is critical
Need to be good at selecting a learning rate
YARN still experimental, has caveats
Container allocation is still slow
Metronome continues to be experimental
35. Special Thanks
Michael Katzenellenbollen
Dr. James Scott
University of Texas at Austin
Dr. Jason Baldridge
University of Texas at Austin
36. Future Directions
More testing, stability
Cache vectors in memory for speed
Metronome
Take on properties of LibLinear
Plugable optimization, general linear models
YARN-centric first class Hadoop citizen
Focus on being a complement to Mahout
K-means, PageRank implementations
38. References
1. http://www.infoworld.com/d/business-
intelligence/gartner-hadoop-will-be-in-two-thirds-of-
advanced-analytics-products-2015-211475
2. https://cwiki.apache.org/MAHOUT/logistic-
regression.html
3. MapReduce is Good Enough? If All You Have is a
Hammer, Throw Away Everything That’s Not a Nail!
• http://arxiv.org/pdf/1209.2191.pdf
Editor's Notes
#9: Reference some thoughts on attribution pipelines
#16: Talk about how you normally would use the Normal equation, notes from Andrew Ng
#18: “Unlikely optimization algorithms such as stochastic gradient descent show amazing performance for large-scale problems.“Bottou, 2010SGD has been around for decadesyet recently Langford, Bottou, others have shown impressive speed increasesSGD has been shown to train multiple orders of magnitude faster than batch style learnerswith no loss on model accuracy
#19: “Unlikely optimization algorithms such as stochastic gradient descent show amazing performance for large-scale problems.“Bottou, 2010SGD has been around for decadesyet recently Langford, Bottou, others have shown impressive speed increasesSGD has been shown to train multiple orders of magnitude faster than batch style learnerswith no loss on model accuracy
#20: The most important additions in Mahout’s SGD are:confidence weighted learning rates per termevolutionary tuning of hyper-parametersmixed ranking and regressiongrouped AUCImplications of it being local is that you are limited to the compute capacity of the local machine as opposed to even a single machine on the cluster.
#21: At current disk bandwidth and capacity (2TB at 100MB/s throughput) 6 hours to read the content of a single HD
#23: Benefits of data flow: runtime can decide where to run tasks and can automatically recover from failuresAcyclic data flow is a powerful abstraction, but is not efficient for applications that repeatedly reuse a working set of data:Iterative algorithms (many in machine learning)• No single programming model or framework can excel atevery problem; there are always tradeoffs between simplicity, expressivity, fault tolerance, performance, etc.
#25: Performance still largely dependent on implementation of algo
#29: POLR: Parallel Online Logistic RegressionTalking points:wanted to start with a known tool to the hadoop community, with expected characteristicsMahout’s SGD is well known, and so we used that as a base point