This presentation describes the various layers and open source components that can be used to design and implement a lambda architecture enabled to support batch processing for model training and streaming for prediction
The document provides an overview of Apache Spark internals and Resilient Distributed Datasets (RDDs). It discusses:
- RDDs are Spark's fundamental data structure - they are immutable distributed collections that allow transformations like map and filter to be applied.
- RDDs track their lineage or dependency graph to support fault tolerance. Transformations create new RDDs while actions trigger computation.
- Operations on RDDs include narrow transformations like map that don't require data shuffling, and wide transformations like join that do require shuffling.
- The RDD abstraction allows Spark's scheduler to optimize execution through techniques like pipelining and cache reuse.
Spark SQL provides relational data processing capabilities in Spark. It introduces a DataFrame API that allows both relational operations on external data sources and Spark's built-in distributed collections. The Catalyst optimizer improves performance by applying database query optimization techniques. It is highly extensible, making it easy to add data sources, optimization rules, and data types for domains like machine learning. Spark SQL evaluation shows it outperforms alternative systems on both SQL query processing and Spark program workloads involving large datasets.
Seldon provides an open platform for deploying machine learning models at scale. It helps companies bring machine learning to life through its Seldon Core platform, which provides a control plane for managing ML workflows and inference graphs. Seldon Core supports deploying models built with any ML framework or language and integrates with common serving systems. It also includes powerful processing components for routing traffic, transforming data, and monitoring models.
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroPyData
Those of us who use TensorFlow often focus on building the model that's most predictive, not the one that's most deployable. So how to put that hard work to work? In this talk, we'll walk through a strategy for taking your machine learning models from Jupyter Notebook into production and beyond.
Spark is fast becoming a critical part of Customer Solutions on Azure. Databricks on Microsoft Azure provides a first-class experience for building and running Spark applications. The Microsoft Azure CAT team engaged with many early adopter customers helping them build their solutions on Azure Databricks.
In this session, we begin by reviewing typical workload patterns, integration with other Azure services like Azure Storage, Azure Data Lake, IoT / Event Hubs, SQL DW, PowerBI etc. Most importantly, we will share real-world tips and learnings that you can take and apply in your Data Engineering / Data Science workloads
Michelle Ufford of Netflix presented on their approach to data quality. They developed Quinto, a data quality service that implements a Write-Audit-Publish pattern for ETL jobs. It audits metrics after data is written to check for issues like row counts being too high/low. Configurable rules determine if issues warrant failing or warning on a job. Future work includes expanding metadata tracking and anomaly detection. The presentation emphasized building modular components over monolithic frameworks and only implementing quality checks where needed.
Neo4j – The Fastest Path to Scalable Real-Time AnalyticsNeo4j
The document discusses how graph databases like Neo4j can enable real-time analytics at massive scale by leveraging relationships in data. It notes that data is growing exponentially but traditional databases can't efficiently analyze relationships. Neo4j natively stores and queries relationships to allow analytics 1000x faster. The document advocates that graphs will form the foundation of modern data and analytics by enhancing machine learning models and enabling outcomes like building intelligent applications faster, gaining deeper insights, and scaling limitlessly without compromising data.
This document provides an overview of Apache Spark, including how it compares to Hadoop, the Spark ecosystem, Resilient Distributed Datasets (RDDs), transformations and actions on RDDs, the directed acyclic graph (DAG) scheduler, Spark Streaming, and the DataFrames API. Key points covered include Spark's faster performance versus Hadoop through its use of memory instead of disk, the RDD abstraction for distributed collections, common RDD operations, and Spark's capabilities for real-time streaming data processing and SQL queries on structured data.
This document provides an overview of NoSQL databases and summarizes key information about several NoSQL databases, including HBase, Redis, Cassandra, MongoDB, and Memcached. It discusses concepts like horizontal scalability, the CAP theorem, eventual consistency, and data models used by different NoSQL databases like key-value, document, columnar, and graph structures.
MLOps Virtual Event | Building Machine Learning Platforms for the Full LifecycleDatabricks
This document summarizes a webinar on building machine learning platforms. It discusses how operating ML models is complex, requiring tasks like monitoring performance, handling data drift, and ensuring governance and security. It then outlines common components of ML platforms, including data management, model management, and code/deployment management. The webinar will demonstrate how different organizations handle these components and include demos from four companies. It will also cover Databricks' approach to providing an ML platform that integrates various tools and simplifies the full ML lifecycle from data preparation to deployment.
This document discusses MLOps and Kubeflow. It begins with an introduction to the speaker and defines MLOps as addressing the challenges of independently autoscaling machine learning pipeline stages, choosing different tools for each stage, and seamlessly deploying models across environments. It then introduces Kubeflow as an open source project that uses Kubernetes to minimize MLOps efforts by enabling composability, scalability, and portability of machine learning workloads. The document outlines key MLOps capabilities in Kubeflow like Jupyter notebooks, hyperparameter tuning with Katib, and model serving with KFServing and Seldon Core. It describes the typical machine learning process and how Kubeflow supports experimental and production phases.
A Beginner's Guide to Large Language ModelsAjitesh Kumar
Large Language Models (LLMs) are a type of deep learning model designed to process and understand vast amounts of natural language data. Built on neural network architectures, particularly the transformer architecture, LLMs have revolutionized the field of natural language processing. In this presentation, we will explore the world of LLMs, their significance, and the different types of LLMs based on the transformer architecture, such as autoregressive language models (e.g., GPT), autoencoding language models (e.g., BERT), and combined models (e.g., T5). Join us as we delve into the world of LLMs and discover their potential in shaping the future of natural language processing.
Faster, better, stronger: The new InnoDBMariaDB plc
For MariaDB Enterprise Server 10.5, the default transactional storage engine, InnoDB, has been significantly rewritten to improve the performance of writes and backups. Next, we removed a number of parameters to reduce unnecessary complexity, not only in terms of configuration but of the code itself. And finally, we improved crash recovery thanks to better consistency checks and we reduced memory consumption and file I/O thanks to an all new log record format.
In this session, we’ll walk through all of the improvements to InnoDB, and dive deep into the implementation to explain how these improvements help everything from configuration and performance to reliability and recovery.
XSEDE is a major research infrastructure with collaborations worldwide supporting thousands of researchers across a wide range of domains. XSEDE has taken an integrative and holistic approach to supporting researchers in the use of the varying resources and services available via XSEDE. This presentation will briefly review XSEDE and its vision and provide a discussion of the efforts within XSEDE targeted at supporting research communities.
[DSC Europe 22] Overview of the Databricks Platform - Petar ZecevicDataScienceConferenc1
This document provides an overview of the Databricks platform. It discusses how Databricks combines features of data warehouses and data lakes to create a "data lakehouse" that supports both business intelligence/reporting and data science/machine learning use cases. Key components of the Databricks platform include Apache Spark, Delta Lake, MLFlow, Jupyter notebooks, and Delta Live Tables. The platform aims to unify data engineering, data warehousing, streaming, and data science tasks on a single open-source platform.
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
This presentation about Apache Spark covers all the basics that a beginner needs to know to get started with Spark. It covers the history of Apache Spark, what is Spark, the difference between Hadoop and Spark. You will learn the different components in Spark, and how Spark works with the help of architecture. You will understand the different cluster managers on which Spark can run. Finally, you will see the various applications of Spark and a use case on Conviva. Now, let's get started with what is Apache Spark.
Below topics are explained in this Spark presentation:
1. History of Spark
2. What is Spark
3. Hadoop vs Spark
4. Components of Apache Spark
5. Spark architecture
6. Applications of Spark
7. Spark usecase
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
Simplilearn’s Apache Spark and Scala certification training are designed to:
1. Advance your expertise in the Big Data Hadoop Ecosystem
2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark
3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos
What skills will you learn?
By completing this Apache Spark and Scala course you will be able to:
1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations
2. Understand the fundamentals of the Scala programming language and its features
3. Explain and master the process of installing Spark as a standalone cluster
4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark
5. Master Structured Query Language (SQL) using SparkSQL
6. Gain a thorough understanding of Spark streaming features
7. Master and describe the features of Spark ML programming and GraphX programming
Who should take this Scala course?
1. Professionals aspiring for a career in the field of real-time big data analytics
2. Analytics professionals
3. Research professionals
4. IT developers and testers
5. Data scientists
6. BI and reporting professionals
7. Students who wish to gain a thorough understanding of Apache Spark
Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training
Here is my seminar presentation on No-SQL Databases. it includes all the types of nosql databases, merits & demerits of nosql databases, examples of nosql databases etc.
For seminar report of NoSQL Databases please contact me: [email protected]
Using Databricks as an Analysis PlatformDatabricks
Over the past year, YipitData spearheaded a full migration of its data pipelines to Apache Spark via the Databricks platform. Databricks now empowers its 40+ data analysts to independently create data ingestion systems, manage ETL workflows, and produce meaningful financial research for our clients.
This document provides an overview of the Apache Spark framework. It covers Spark fundamentals including the Spark execution model using Resilient Distributed Datasets (RDDs), basic Spark programming, and common Spark libraries and use cases. Key topics include how Spark improves on MapReduce by operating in-memory and supporting general graphs through its directed acyclic graph execution model. The document also reviews Spark installation and provides examples of basic Spark programs in Scala.
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a modern data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. They all may sound great in theory, but I'll dig into the concerns you need to be aware of before taking the plunge. I’ll also include use cases so you can see what approach will work best for your big data needs. And I'll discuss Microsoft version of the data mesh.
SUN Network File system - Design, Implementation and Experience aniadkar
Overview of SUN Network File system and its design, architecture and implementation along with changes in NFS v3 and NFS v4
Presented by – Aniruddh Adkar
CSE 710 Parallel and Distributed File Systems ( Spring 2016 )
SUNY, University at Buffalo
What is Talend | Talend Tutorial for Beginners | Talend Online Training | Edu...Edureka!
( Talend Training: https://www.edureka.co/talend-for-big-data )
This Edureka video on What Is Talend will give you the complete insights of What actually is Talend, its various products and how it is being used in the industry.
This video helps you to learn following topics:
1. What Is Talend?
2. Evolution Of Talend
3. Talend Products
4. Use Cases
5. Demo
Subscribe to our channel to get video updates. Hit the subscribe button above and click the bell icon.
5 Factors When Selecting a High Performance, Low Latency DatabaseScyllaDB
There are hundreds of possible databases you can choose from today. Yet if you draw up a short list of critical criteria related to performance and scalability for your use case, the field of choices narrows and your evaluation decision becomes much easier.
In this session, we’ll explore 5 essential factors to consider when selecting a high performance low latency database, including options, opportunities, and tradeoffs related to software architecture, hardware utilization, interoperability, RASP, and Deployment.
This document discusses tools for distributed data analysis including Apache Spark. It is divided into three parts:
1) An introduction to cluster computing architectures like batch processing and stream processing.
2) The Python data analysis library stack including NumPy, Matplotlib, Scikit-image, Scikit-learn, Rasterio, Fiona, Pandas, and Jupyter.
3) The Apache Spark cluster computing framework and examples of its use including contexts, HDFS, telemetry, MLlib, streaming, and deployment on AWS.
Databricks Meetup @ Los Angeles Apache Spark User GroupPaco Nathan
This document summarizes a presentation on Apache Spark and Spark Streaming. It provides an overview of Spark, describing it as an in-memory cluster computing framework. It then discusses Spark Streaming, explaining that it runs streaming computations as small batch jobs to provide low latency processing. Several use cases for Spark Streaming are presented, including from companies like Stratio, Pearson, Ooyala, and Sharethrough. The presentation concludes with a demonstration of Python Spark Streaming code.
This document provides an overview of Apache Spark, including how it compares to Hadoop, the Spark ecosystem, Resilient Distributed Datasets (RDDs), transformations and actions on RDDs, the directed acyclic graph (DAG) scheduler, Spark Streaming, and the DataFrames API. Key points covered include Spark's faster performance versus Hadoop through its use of memory instead of disk, the RDD abstraction for distributed collections, common RDD operations, and Spark's capabilities for real-time streaming data processing and SQL queries on structured data.
This document provides an overview of NoSQL databases and summarizes key information about several NoSQL databases, including HBase, Redis, Cassandra, MongoDB, and Memcached. It discusses concepts like horizontal scalability, the CAP theorem, eventual consistency, and data models used by different NoSQL databases like key-value, document, columnar, and graph structures.
MLOps Virtual Event | Building Machine Learning Platforms for the Full LifecycleDatabricks
This document summarizes a webinar on building machine learning platforms. It discusses how operating ML models is complex, requiring tasks like monitoring performance, handling data drift, and ensuring governance and security. It then outlines common components of ML platforms, including data management, model management, and code/deployment management. The webinar will demonstrate how different organizations handle these components and include demos from four companies. It will also cover Databricks' approach to providing an ML platform that integrates various tools and simplifies the full ML lifecycle from data preparation to deployment.
This document discusses MLOps and Kubeflow. It begins with an introduction to the speaker and defines MLOps as addressing the challenges of independently autoscaling machine learning pipeline stages, choosing different tools for each stage, and seamlessly deploying models across environments. It then introduces Kubeflow as an open source project that uses Kubernetes to minimize MLOps efforts by enabling composability, scalability, and portability of machine learning workloads. The document outlines key MLOps capabilities in Kubeflow like Jupyter notebooks, hyperparameter tuning with Katib, and model serving with KFServing and Seldon Core. It describes the typical machine learning process and how Kubeflow supports experimental and production phases.
A Beginner's Guide to Large Language ModelsAjitesh Kumar
Large Language Models (LLMs) are a type of deep learning model designed to process and understand vast amounts of natural language data. Built on neural network architectures, particularly the transformer architecture, LLMs have revolutionized the field of natural language processing. In this presentation, we will explore the world of LLMs, their significance, and the different types of LLMs based on the transformer architecture, such as autoregressive language models (e.g., GPT), autoencoding language models (e.g., BERT), and combined models (e.g., T5). Join us as we delve into the world of LLMs and discover their potential in shaping the future of natural language processing.
Faster, better, stronger: The new InnoDBMariaDB plc
For MariaDB Enterprise Server 10.5, the default transactional storage engine, InnoDB, has been significantly rewritten to improve the performance of writes and backups. Next, we removed a number of parameters to reduce unnecessary complexity, not only in terms of configuration but of the code itself. And finally, we improved crash recovery thanks to better consistency checks and we reduced memory consumption and file I/O thanks to an all new log record format.
In this session, we’ll walk through all of the improvements to InnoDB, and dive deep into the implementation to explain how these improvements help everything from configuration and performance to reliability and recovery.
XSEDE is a major research infrastructure with collaborations worldwide supporting thousands of researchers across a wide range of domains. XSEDE has taken an integrative and holistic approach to supporting researchers in the use of the varying resources and services available via XSEDE. This presentation will briefly review XSEDE and its vision and provide a discussion of the efforts within XSEDE targeted at supporting research communities.
[DSC Europe 22] Overview of the Databricks Platform - Petar ZecevicDataScienceConferenc1
This document provides an overview of the Databricks platform. It discusses how Databricks combines features of data warehouses and data lakes to create a "data lakehouse" that supports both business intelligence/reporting and data science/machine learning use cases. Key components of the Databricks platform include Apache Spark, Delta Lake, MLFlow, Jupyter notebooks, and Delta Live Tables. The platform aims to unify data engineering, data warehousing, streaming, and data science tasks on a single open-source platform.
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
This presentation about Apache Spark covers all the basics that a beginner needs to know to get started with Spark. It covers the history of Apache Spark, what is Spark, the difference between Hadoop and Spark. You will learn the different components in Spark, and how Spark works with the help of architecture. You will understand the different cluster managers on which Spark can run. Finally, you will see the various applications of Spark and a use case on Conviva. Now, let's get started with what is Apache Spark.
Below topics are explained in this Spark presentation:
1. History of Spark
2. What is Spark
3. Hadoop vs Spark
4. Components of Apache Spark
5. Spark architecture
6. Applications of Spark
7. Spark usecase
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
Simplilearn’s Apache Spark and Scala certification training are designed to:
1. Advance your expertise in the Big Data Hadoop Ecosystem
2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark
3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos
What skills will you learn?
By completing this Apache Spark and Scala course you will be able to:
1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations
2. Understand the fundamentals of the Scala programming language and its features
3. Explain and master the process of installing Spark as a standalone cluster
4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark
5. Master Structured Query Language (SQL) using SparkSQL
6. Gain a thorough understanding of Spark streaming features
7. Master and describe the features of Spark ML programming and GraphX programming
Who should take this Scala course?
1. Professionals aspiring for a career in the field of real-time big data analytics
2. Analytics professionals
3. Research professionals
4. IT developers and testers
5. Data scientists
6. BI and reporting professionals
7. Students who wish to gain a thorough understanding of Apache Spark
Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training
Here is my seminar presentation on No-SQL Databases. it includes all the types of nosql databases, merits & demerits of nosql databases, examples of nosql databases etc.
For seminar report of NoSQL Databases please contact me: [email protected]
Using Databricks as an Analysis PlatformDatabricks
Over the past year, YipitData spearheaded a full migration of its data pipelines to Apache Spark via the Databricks platform. Databricks now empowers its 40+ data analysts to independently create data ingestion systems, manage ETL workflows, and produce meaningful financial research for our clients.
This document provides an overview of the Apache Spark framework. It covers Spark fundamentals including the Spark execution model using Resilient Distributed Datasets (RDDs), basic Spark programming, and common Spark libraries and use cases. Key topics include how Spark improves on MapReduce by operating in-memory and supporting general graphs through its directed acyclic graph execution model. The document also reviews Spark installation and provides examples of basic Spark programs in Scala.
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a modern data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. They all may sound great in theory, but I'll dig into the concerns you need to be aware of before taking the plunge. I’ll also include use cases so you can see what approach will work best for your big data needs. And I'll discuss Microsoft version of the data mesh.
SUN Network File system - Design, Implementation and Experience aniadkar
Overview of SUN Network File system and its design, architecture and implementation along with changes in NFS v3 and NFS v4
Presented by – Aniruddh Adkar
CSE 710 Parallel and Distributed File Systems ( Spring 2016 )
SUNY, University at Buffalo
What is Talend | Talend Tutorial for Beginners | Talend Online Training | Edu...Edureka!
( Talend Training: https://www.edureka.co/talend-for-big-data )
This Edureka video on What Is Talend will give you the complete insights of What actually is Talend, its various products and how it is being used in the industry.
This video helps you to learn following topics:
1. What Is Talend?
2. Evolution Of Talend
3. Talend Products
4. Use Cases
5. Demo
Subscribe to our channel to get video updates. Hit the subscribe button above and click the bell icon.
5 Factors When Selecting a High Performance, Low Latency DatabaseScyllaDB
There are hundreds of possible databases you can choose from today. Yet if you draw up a short list of critical criteria related to performance and scalability for your use case, the field of choices narrows and your evaluation decision becomes much easier.
In this session, we’ll explore 5 essential factors to consider when selecting a high performance low latency database, including options, opportunities, and tradeoffs related to software architecture, hardware utilization, interoperability, RASP, and Deployment.
This document discusses tools for distributed data analysis including Apache Spark. It is divided into three parts:
1) An introduction to cluster computing architectures like batch processing and stream processing.
2) The Python data analysis library stack including NumPy, Matplotlib, Scikit-image, Scikit-learn, Rasterio, Fiona, Pandas, and Jupyter.
3) The Apache Spark cluster computing framework and examples of its use including contexts, HDFS, telemetry, MLlib, streaming, and deployment on AWS.
Databricks Meetup @ Los Angeles Apache Spark User GroupPaco Nathan
This document summarizes a presentation on Apache Spark and Spark Streaming. It provides an overview of Spark, describing it as an in-memory cluster computing framework. It then discusses Spark Streaming, explaining that it runs streaming computations as small batch jobs to provide low latency processing. Several use cases for Spark Streaming are presented, including from companies like Stratio, Pearson, Ooyala, and Sharethrough. The presentation concludes with a demonstration of Python Spark Streaming code.
Data Engineering A Deep Dive into DatabricksKnoldus Inc.
During this session, you'll gain a comprehensive understanding of Databricks' capabilities for efficiently processing and managing data, with a focus on Apache Spark for data transformation. We'll cover data ingestion methods, storage, orchestration, and best practices to ensure your data engineering workflows are optimized for success.
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
London Spark Meetup 2014-11-11 @Skimlinks
http://www.meetup.com/Spark-London/events/217362972/
To paraphrase the immortal crooner Don Ho: "Tiny Batches, in the wine, make me happy, make me feel fine." http://youtu.be/mlCiDEXuxxA
Apache Spark provides support for streaming use cases, such as real-time analytics on log files, by leveraging a model called discretized streams (D-Streams). These "micro batch" computations operated on small time intervals, generally from 500 milliseconds up. One major innovation of Spark Streaming is that it leverages a unified engine. In other words, the same business logic can be used across multiple uses cases: streaming, but also interactive, iterative, machine learning, etc.
This talk will compare case studies for production deployments of Spark Streaming, emerging design patterns for integration with popular complementary OSS frameworks, plus some of the more advanced features such as approximation algorithms, and take a look at what's ahead — including the new Python support for Spark Streaming that will be in the upcoming 1.2 release.
Also, let's chat a bit about the new Databricks + O'Reilly developer certification for Apache Spark…
Interest is growing in the Apache Spark community in using Deep Learning techniques and in the Deep Learning community in scaling algorithms with Apache Spark. A few of them to note include:
· Databrick’s efforts in scaling Deep learning with Spark
· Intel announcing the BigDL: A Deep learning library for Spark
· Yahoo’s recent efforts to opensource TensorFlowOnSpark
In this lecture we will discuss the key use cases and developments that have emerged in the last year in using Deep Learning techniques with Spark.
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Jason Dai
This document outlines an agenda for a talk on building deep learning applications on big data platforms using Analytics Zoo. The agenda covers motivations around trends in big data, deep learning frameworks on Apache Spark like BigDL and TensorFlowOnSpark, an introduction to Analytics Zoo and its high-level pipeline APIs, built-in models, and reference use cases. It also covers distributed training in BigDL, advanced applications, and real-world use cases of deep learning on big data at companies like JD.com and World Bank. The talk concludes with a question and answer session.
ApacheCon 2021 Apache Deep Learning 302Timothy Spann
ApacheCon 2021 Apache Deep Learning 302
Tuesday 18:00 UTC
Apache Deep Learning 302
Timothy Spann
This talk will discuss and show examples of using Apache Hadoop, Apache Kudu, Apache Flink, Apache Hive, Apache MXNet, Apache OpenNLP, Apache NiFi and Apache Spark for deep learning applications. This is the follow up to previous talks on Apache Deep Learning 101 and 201 and 301 at ApacheCon, Dataworks Summit, Strata and other events. As part of this talk, the presenter will walk through using Apache MXNet Pre-Built Models, integrating new open source Deep Learning libraries with Python and Java, as well as running real-time AI streams from edge devices to servers utilizing Apache NiFi and Apache NiFi - MiNiFi. This talk is geared towards Data Engineers interested in the basics of architecting Deep Learning pipelines with open source Apache tools in a Big Data environment. The presenter will also walk through source code examples available in github and run the code live on Apache NiFi and Apache Flink clusters.
Tim Spann is a Developer Advocate @ StreamNative where he works with Apache NiFi, Apache Pulsar, Apache Flink, Apache MXNet, TensorFlow, Apache Spark, big data, the IoT, machine learning, and deep learning. Tim has over a decade of experience with the IoT, big data, distributed computing, streaming technologies, and Java programming. Previously, he was a Principal Field Engineer at Cloudera, a senior solutions architect at AirisData and a senior field engineer at Pivotal. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton on big data, the IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as IoT Fusion, Strata, ApacheCon, Data Works Summit Berlin, DataWorks Summit Sydney, and Oracle Code NYC. He holds a BS and MS in computer science.
* https://github.com/tspannhw/ApacheDeepLearning302/
* https://github.com/tspannhw/nifi-djl-processor
* https://github.com/tspannhw/nifi-djlsentimentanalysis-processor
* https://github.com/tspannhw/nifi-djlqa-processor
* https://www.linkedin.com/pulse/2021-schedule-tim-spann/
In the past, emerging technologies took years to mature. In the case of big data, while effective tools are still emerging, the analytics requirements are changing rapidly resulting in businesses to either make it or be left behind
This document discusses applying Apache Spark to data science challenges in media and entertainment. It introduces Spark as a unifying framework for content personalization using recommendation systems and streaming data, as well as social media analytics using GraphFrames. Specific use cases discussed include content personalization with recommendations, churn analysis, analyzing social networks with GraphFrames, sentiment analysis, and viewership prediction using topic modeling. The document also discusses continuous applications with Spark Streaming, and how Spark ML can be used for machine learning workflows and optimization.
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraJoe Stein
Slides for our solution we developed for using Mesos, Docker, Kafka, Spark, Cassandra and Solr (DataStax Enterprise Edition) all developed in Go for doing realtime log analysis at scale. Many organizations either need or want log analysis in real time where you can see within a second what is happening within your entire infrastructure. Today, with the hardware available and software systems we have in place, you can develop, build and use as a service these solutions.
The document discusses research data services in Australia including Research Data Services (RDS), the Australian National Data Service (ANDS), National eResearch Collaboration Tools and Resources (NeCTAR), the Australian Access Federation (AAF), and Australia's Academic Research Network (AARNET). It outlines the research data lifecycle and proposes components for a national research data system, including provisioning, storage, processing, and archiving capabilities. Examples are provided for how different types of research projects could interact with such a system.
Strata NYC 2015 - Supercharging R with Apache SparkDatabricks
R is the favorite language of many data scientists. In addition to a language and runtime, R is a rich ecosystem of libraries for a wide range of use cases from statistical inference to data visualization. However, handling large or distributed data with R is challenging. Hence R is used along with other frameworks and languages by most data scientist. In this mode most of the friction is at the interface of R and the other systems. For example, when data is sampled by a big data platform, results need to be transferred to and imported in R as native data structures. In this talk we show an alternative, and complimentary, approach to SparkR for integrating Spark and R.
Since SparkR was released in version 1.4 of Apache Spark distributed data remains inside the JVM instead of individual R processes running on workers. This approach is more convenient when dealing with external data sources such as Cassandra, Hive, and Spark’s own distributed DataFrames. We show two specific techniques to remove the data transfer friction between R and JVM: collecting Spark DataFrames as R data frames and user space filesystems. We think this model complements and improves the day-to-day workload of many data scientists who use R. Spark’s interactive query processing, especially with in-memory datasets, closely matches the R interactive session model. When integrated together Spark and R can provide state of the art tools for the entire end-to-end data science pipeline. We will show how such a pipeline works in real world use cases in a live demo at the end of the talk.
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...DataStax Academy
This document discusses real-time log analysis using Mesos, Docker, Kafka, Spark, Cassandra and Solr at scale. It provides an overview of the architecture, describing how data from various sources like syslog can be ingested into Kafka via Docker producers. It then discusses consuming from Kafka to write to Cassandra in real-time and running Spark jobs on Cassandra data. The document uses these open source tools together in a reference architecture to enable real-time analytics and search capabilities on streaming data.
Enabling exploratory data science with Spark and RDatabricks
R is a favorite language of many data scientists. In addition to a language and runtime, R is a rich ecosystem of libraries for a wide range of use cases from statistical inference to data visualization. However, handling large datasets with R is challenging, especially when data scientists use R with frameworks or tools written in other languages. In this mode most of the friction is at the interface of R and the other systems. For example, when data is sampled by a big data platform, results need to be transferred to and imported in R as native data structures. In this talk we show how SparkR solves these problems to enable a much smoother experience. In this talk we will present an overview of the SparkR architecture, including how data and control is transferred between R and JVM. This knowledge will help data scientists make better decisions when using SparkR. We will demo and explain some of the existing and supported use cases with real large datasets inside a notebook environment. The demonstration will emphasize how Spark clusters, R and interactive notebook environments, such as Jupyter or Databricks, facilitate exploratory analysis of large data.
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Michael Rys
This presentation shows how you can build solutions that follow the modern data warehouse architecture and introduces the .NET for Apache Spark support (https://dot.net/spark, https://github.com/dotnet/spark)
Writing Apache Spark and Apache Flink Applications Using Apache BahirLuciano Resende
Big Data is all about being to access and process data in various formats, and from various sources. Apache Bahir provides extensions to distributed analytic platforms providing them access to different data sources. In this talk we will introduce you to Apache Bahir and its various connectors that are available for Apache Spark and Apache Flink. We will also go over the details of how to build, test and deploy an Spark Application using the MQTT data source for the new Apache Spark 2.0 Structure Streaming functionality.
Autonomous medical coding with discriminative transformersPatrick Nicolas
Application of transformers and deep learning to the extraction of medical codes and insurance claims from electronic health records. This presentation lists modeling challenges and pitfalls and analyzes various configurations for BERT encoder. It compares techniques for pre-training and fine-tuning run in the context of classification.
This presentation discusses two AI strategies for mining medical records: ontologies and machine learning. Ontologies rely on domain experts and rules to extract information and provide high accuracy but do not scale well. Machine learning can process large amounts of data but provides less interpretable results and requires significant data. A balanced approach is needed depending on the use case and requirements for accuracy, scalability, and data availability.
Non-linear classification models rely commonly, on kernel functions. Models are highly dependent on a training (labeled) data sets. Models and therefore their underlying kernel have to adapt to the most recent labeled observations.
This presentation describes a solution to automated the evaluation and selection of a kernel function appropriate to a specific training set in online training.
This presentation describes some key features of Scala uses in the creation of machine learning algorithms:
1 Functorial definition of tensors for learning non-linear models (manifolds)
2. Monads to compose of explicit kernel functions in Euclidean space
3. Implicit class to extends Scala standard library
4. Stackable traits and dependency injection to build formal models and dynamic workflows
5. Tail recursion to implementation dynamic programming techniques
6. Streaming to reduce memory consumption for big data
7. Control of back pressure in data flows
http://patricknicolas.blogspot.com
http://bit.ly/12GjRu9
Stock Market Prediction using Hidden Markov Models and Investor sentimentPatrick Nicolas
This presentation describes hidden Markov Models to predict financial markets indices using the weekly sentiment survey from the American Association of Individual Investors.
The first section describes the hidden Markov model (HMM), followed by selection of features (investors' sentiment) and labeled data (S&P 500 index).
The second section dives into HMMs for continuous observations and detection of regime shifts/structural breaks using an auto-regressive Markov chain
The last section is devoted to alternative models to HMM.
Adaptive Intrusion Detection Using Learning ClassifiersPatrick Nicolas
The document discusses using learning classifiers and genetic algorithms to implement an adaptive intrusion detection system. Traditional data mining techniques are limited in their ability to adapt to changing environments, but learning classifiers systems combine genetic algorithms and reinforcement learning to discover and evolve security policies and rules from real-time data. The rules are represented as genes and evolved over time through processes of crossover, mutation, and selection to accurately identify threats.
This is an introduction to the concept of symbolic regression for managing effectively data stream. Symbolic regression combines genetic algorithm, reinforcement learning and flexible policies to extract meaning or knowledge from data, in an ever changing environment. As the knowledge extracted from real-time data is human readable and consumable, decision makers can validate the findings of the algorithm and act appropriately. Symbolic regression is used in signal processing, process monitoring and adaptive caching in data centers.
This is an introduction to an algorithm and methodology to extract semantics from one or several documents using Natural Language Processing and Machine learning techniques. The presentation describes the different components of the semantic analyzer using Wikipedia and Dbpedia as data sets.
There is a lot more to Hadoop than Map-Reduce. An increasing number of engineers and researchers involved in processing and analyzing large amount of data, regards Hadoop as an ever expanding ecosystem of open sources libraries, including NoSQL, scripting and analytics tools.
This presentation introduces taxonomy-based contextual ad targeting. It discusses defining taxonomies to semantically classify content and ads, then matching ads to content based on overlapping taxonomy terms. An ad targeting engine is described that uses an optimizer to select ads and a dispatcher to deliver them. The approach was tested against a rule-based system and showed click-through rate improvements for certain product categories.
This presentation introduces the different modes of deployment of applications on a private cloud. Each solution is evaluate in terms of access control, performance and scalability.
computer organization and assembly language : its about types of programming language along with variable and array description..https://www.nfciet.edu.pk/
2. Overview
3
“… and the wise man said,
thou shall embrace open source”.
21st century proverb
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
4. Overview
5
The world of data scientists accustomed to Python
scientific libraries have been shaken up by the
emergence of ’big data’ framework such as Apache
Hadoop, Spark and Kafka.
This presentation introduces a variant of the
architecture and describes the seamless integration of
various open source components to train, validate and
test deep learning models.
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
𝜆
5. Disclaimer
6
The concept and architecture are versatile enough to
accommodate a variety of open source, commercial
solutions and services beside the frameworks
prescribed in this presentation.
For instance, deep learning frameworks, such as Keras
or tensor flow are excellent alternatives to PyTorch.
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
6. Requirements
7
• Process batch and stream data, concurrently
• Enforce data immutability
• Recover gracefully from human errors
• Handle hardware failures
• Minimize latency for real-time requests
• Scale for very large data set
• Optimize full lifecycle of data set
• Guarantee quality and integrity of data
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
A ‘big data’ framework should be able to ….
7. Optimizing data life cycle
8
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
The need for optimizing the data life cycle: 79% of data
scientist time is spent collecting and organizing data.
Source Quora
8. Data quality
9
Accuracy: Correct models and representative data.
Completeness: No missing data
Consistency: Applied to semantic and format
Timeliness: Up-to-date data and notification
Accessibility: Ease of use and high availability
Validity: Comply to constraints, rules and regulations
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
Guaranteeing data quality and integrity
9. Solution …
10
- architecture is a large scale data processing that
balanced batch and real-time streamed data.
It is a one-stop shopping for various data sources that
balance latency, redundancy, easy of access and
throughput.
It breaks down into 3 layers
• Speed (streaming, real-time, …)
• Batch (training, analysis, …)
• Serving (query, visualization, …)
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
𝜆
10. … using open source
11
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
architecture using open source components?
𝜆
The task consists of reviewing and evaluating the trove
of available of open source libraries to build a robust
architecture that support the rigor of training and
tuning deep learning models.
The libraries are weaved through a set language-
agnostic REST API to form a coherent pipeline.
11. … for deep learning
12
• Python scientific libraries have been the go-to tools
for data scientists to analyze data and build models.
• PyTorch framework builds up on these libraries to
support the design and execution of deep learning
models.
• Apache Spark and Kafka complements these
frameworks for very large data set and real-time
processing.
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
architecture for deep learning?
𝜆
12. Bird-eye view
13
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
Feel overwhelmed?
... Let’s break it down
Example open source
𝜆 architecture
14. Batch layer
15
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
Batch layer objective: load batch of data to be distributed,
preprocessed to train deep learning models.
15. Batch layer
16
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
Typical use case:
1. Apache Spark loads training set from Amazon S3
2. Spark master partitions training data
3. Spark workers preprocessed data and notify
completion through Kafka event queue
4. Pytorch updated model parameters from pre-
processed training data
5. Pytorch broadcast model parameters and quality
metrics through Kafka
6. Apache Hive powered by Spark stores models related
data and metrics
16. Speed layer
17
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
Speed layer objective: process queries to predictive
models with very low latency.
17. Speed layer
18
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
Use case:
1. Kafka routes data streams to Spark master
2. Spark pre-processes requests and forward them to
deep model micro-service
3. Flask converts requests to prediction query to Pytorch
model
4. Pytorch model generate a prediction
5. Run-time metrics are broadcast through Kafka
18. Serving layer
19
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
Serving layer objective: process queries to analyze data,
model performances and execute statistical inference
19. Serving layer
20
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
Use case:
1. Analyst queries relational data base, MySQL for most
recent data, statistics using Fine report UI (low
latency)
2. Analyst queries asynchronously Hive data warehouse
for archived data, statistics (high latency)
3. Hive processes queries through Spark datasets
4. Spark updates regularly MySQL short term data
21. PyTorch
22
PyTorch is an optimized tensor library for deep
learning using GPUs and CPUs.
It extends the functionality of Numpy and Scikit-
learn to support the training, evaluation and
commercialization of complex machine learning
models.
https://pytorch.org/tutorials/
Alternatives:
Tensor flow: https://www.tensorflow.org/
Keras: https://keras.io
MxNet: https://mxnet.apache.org
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
22. Apache Spark
23
Apache Spark is an open source cluster computing
framework for fast real-time processing.
It supports Scala, Java, Python and R programming
languages and includes streaming, graph and machine
learning libraries.
https://www.scala-lang.org
https://spark.apache.org
Alternative:
PySpark: https://databricks.com/glossary/pyspark
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
23. Streaming
24
Apache Kafka is an open-source distributed event
streaming framework to large scale, real-time data
processing and analytics.
It captures data from various sources in real-time as a
continuous flow and routes it to the appropriate
processor.
https://kafka.apache.org
Alternatives:
Amazon SQS: https://aws.amazon.com/sqs/
RabbitMQ: https://www.rabbitmq.com
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
24. Model tuning
25
Ray-tune is a distributed hyper-parameters
tuning framework particularly suitable to deep learning
models.
It reduces significantly the cost of optimizing the
configuration of a model. It is a wrapper around other
open source libraries
https://docs.ray.io/en/master/tune/index.html
Alternatives:
Amazon SageMaker: https://aws.amazon.com/sagemaker/
HyperOpt: https://github.com/hyperopt/hyperopt
Optuna: https://optuna.readthedocs.io
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
25. Python REST service
26
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
Flask is an easy to use implementation of the
RESTful interface to Python applications.
It supports most of web and deployment standards
such Docker, React.js, Angular, HTML5 and WSGI
containers.
https://palletsprojects.com/p/flask/
Alternatives:
Falcon: https://falcon.readthedocs.io
Fast API: https://fastapi.tiangolo.com
26. RDBMS
27
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
MySQL is an open source relational database
supporting partitioning, sharding, replication. It can
be extended with real-time analytics (Heatwave)
and enterprise clustering (CGE)
https://www.mysql.com
Alternatives:
PosgresSQL: https://www.postgresql.org
HyperSQL http://www.hsqldb.org
Amazon RDS: http://aws.amazon.com/rds
27. Data warehouse
28
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
Apache Hive is a data warehouse framework that
leverages Spark to execute largely distributed SQL
queries.
It optimizes SQL queries through lazy evaluation of
acyclic execution graph. It is integrated with
Spark data set and HDFS.
https://hive.apache.org
Alternatives:
Vertica http://www.vertica.com
Amazon Redshift https://aws.amazon.com/redshift/
28. Dashboard
29
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
Fine report is a business intelligence and
dashboard tool that supports real time analytics,
reporting and visualization. It accomodates needs
of business managers and data scientists
https://www.finereport.com
Alternatives:
Sisense: https://www.sisense.com
Tableau: https://www.tableau.com
29. 30
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
Final disclaimer
This presentation is not an endorsement of the various
tools, libraries or frameworks described or suggested in
this presentation.
Allthough the tools listed in the slides are known to work
in the context of the architecture, there are excellent
alternative libraries that may better meet your specific
needs.
30. 31
Patrick R. Nicolas - Open Source 𝜆 -Architecture for Deep Learning
Thank you!
Q&A