Spark is a fast and general engine for large-scale data processing. It provides an interface called resilient distributed datasets (RDDs) that allow data to be distributed in memory across clusters and manipulated using parallel operations. Shark is a system built on Spark that allows running SQL queries over large datasets using Spark's speed and generality. The document discusses Spark and Shark's performance advantages over Hadoop for iterative and interactive applications.
Spark & Spark Streaming Internals - Nov 15 (1)Akhil Das
This document summarizes Spark and Spark Streaming internals. It discusses the Resilient Distributed Dataset (RDD) model in Spark, which allows for fault tolerance through lineage-based recomputation. It provides an example of log mining using RDD transformations and actions. It then discusses Spark Streaming, which provides a simple API for stream processing by treating streams as series of small batch jobs on RDDs. Key concepts discussed include Discretized Stream (DStream), transformations, and output operations. An example Twitter hashtag extraction job is outlined.
In this talk, we present two emerging, popular open source projects: Spark and Shark. Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write. It outperform Hadoop by up to 100x in many real-world applications. Spark programs are often much shorter than their MapReduce counterparts thanks to its high-level APIs and language integration in Java, Scala, and Python. Shark is an analytic query engine built on top of Spark that is compatible with Hive. It can run Hive queries much faster in existing Hive warehouses without modifications.
These systems have been adopted by many organizations large and small (e.g. Yahoo, Intel, Adobe, Alibaba, Tencent) to implement data intensive applications such as ETL, interactive SQL, and machine learning.
This document introduces Spark SQL and the Catalyst query optimizer. It discusses that Spark SQL allows executing SQL on Spark, builds SchemaRDDs, and optimizes query execution plans. It then provides details on how Catalyst works, including its use of logical expressions, operators, and rules to transform query trees and optimize queries. Finally, it outlines some interesting open issues and how to contribute to Spark SQL's development.
This deep dive attempts to "de-mystify" Spark by touching on some of the main design philosophies and diving into some of the more advanced features that make it such a flexible and powerful cluster computing framework. It will touch on some common pitfalls and attempt to build some best practices for building, configuring, and deploying Spark applications.
Resilient Distributed DataSets - Apache SPARKTaposh Roy
RDDs (Resilient Distributed Datasets) provide a fault-tolerant abstraction for data reuse across jobs in distributed applications. They allow data to be persisted in memory and manipulated using transformations like map and filter. This enables efficient processing of iterative algorithms. RDDs achieve fault tolerance by logging the transformations used to build a dataset rather than the actual data, enabling recovery of lost partitions through recomputation.
This presentation show the main Spark characteristics, like RDD, Transformations and Actions.
I used this presentation for many Spark Intro workshops from Cluj-Napoca Big Data community : http://www.meetup.com/Big-Data-Data-Science-Meetup-Cluj-Napoca/
Spark DataFrames provide a more optimized way to work with structured data compared to RDDs. DataFrames allow skipping unnecessary data partitions when querying, such as only reading data partitions that match certain criteria like date ranges. DataFrames also integrate better with storage formats like Parquet, which stores data in a columnar format and allows skipping unrelated columns during queries to improve performance. The code examples demonstrate loading a CSV file into a DataFrame, finding and removing duplicate records, and counting duplicate records by key to identify potential duplicates.
Talk given at ClojureD conference, Berlin
Apache Spark is an engine for efficiently processing large amounts of data. We show how to apply the elegance of Clojure to Spark - fully exploiting the REPL and dynamic typing. There will be live coding using our gorillalabs/sparkling API.
In the presentation, we will of course introduce the core concepts of Spark, like resilient distributed data sets (RDD). And you will learn how the Spark concepts resembles those well-known from Clojure, like persistent data structures and functional programming.
Finally, we will provide some Do’s and Don’ts for you to kick off your Spark program based upon our experience.
About Paulus Esterhazy and Christian Betz
Being a LISP hacker for several years, and a Java-guy for some more, Chris turned to Clojure for production code in 2011. He’s been Project Lead, Software Architect, and VP Tech in the meantime, interested in AI and data-visualization.
Now, working on the heart of data driven marketing for Performance Media in Hamburg, he turned to Apache Spark for some Big Data jobs. Chris released the API-wrapper ‘chrisbetz/sparkling’ to fully exploit the power of his compute cluster.
Paulus Esterhazy
Paulus is a philosophy PhD turned software engineer with an interest in functional programming and a penchant for hammock-driven development.
He currently works as Senior Web Developer at Red Pineapple Media in Berlin.
Apache Drill (http://incubator.apache.org/drill/) is a distributed system for interactive analysis of large-scale datasets, inspired by Google’s Dremel technology. It is designed to scale to thousands of servers and able to process Petabytes of data in seconds. Since its inception in mid 2012, Apache Drill has gained widespread interest in the community, attracting hundreds of interested individuals and companies. In the talk we discuss how Apache Drill enables ad-hoc interactive query at scale, walking through typical use cases and delve into Drill's architecture, the data flow and query languages as well as data sources supported.
Spark is a general engine for large-scale data processing. It introduces Resilient Distributed Datasets (RDDs) which allow in-memory caching for fault tolerance and act like familiar Scala collections for distributed computation across clusters. RDDs provide a programming model with transformations like map and reduce and actions to compute results. Spark also supports streaming, SQL, machine learning, and graph processing workloads.
Data discoveryonhadoop@yahoo! hadoopsummit2014thiruvel
This document discusses data discovery on Hadoop using Apache HCatalog. It describes how HCatalog provides a common interface for data access across Hadoop tools like Hive, Pig, and MapReduce. HCatalog allows users to register metadata for tables and partitions stored on Hadoop, enabling data discovery and access without needing to know the physical storage details. The document outlines how HCatalog is used at Yahoo to provide interoperability, notifications, and integration with data management platforms.
Hadoop Summit San Jose 2014: Data Discovery on Hadoop Sumeet Singh
In the last eight years, the Hadoop grid infrastructure has allowed us to move towards a unified source of truth for all data at Yahoo that now accounts for over 450 petabytes of raw HDFS and 1.1 billion data files. Managing data location, schema knowledge and evolution, fine-grained business rules based access control, and audit and compliance needs have become critical with the increasing scale of operations.
In this talk, we will share our approach in tackling the above challenges with Apache HCatalog, a table and storage management layer for Hadoop. We will explain how to register existing HDFS files into HCatalog, provide broader but controlled access to data through a data discovery tool, and leverage existing Hadoop ecosystem components like Pig, Hive, HBase and Oozie to seamlessly share data across applications. Integration with data movement tools automates the availability of new data into HCatalog. In addition, the approach allows ever improving Hive performance to open up easy adhoc access to analyze and visualize data through SQL on Hadoop and popular BI tools.
As we discuss our approach, we will also highlight along how our approach minimizes data duplication, eliminates wasteful data retention, and solves for data provenance, lineage and integrity.
Hive is a data warehouse system built on top of Hadoop that allows users to query large datasets using SQL. It is used at Facebook to manage over 15TB of new data added daily across a 300+ node Hadoop cluster. Key features include using SQL for queries, extensibility through custom functions and file formats, and optimizations for performance like predicate pushdown and partition pruning.
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2sf2z6i
This CloudxLab Introduction to Spark SQL & DataFrames tutorial helps you to understand Spark SQL & DataFrames in detail. Below are the topics covered in this slide:
1) Introduction to DataFrames
2) Creating DataFrames from JSON
3) DataFrame Operations
4) Running SQL Queries Programmatically
5) Datasets
6) Inferring the Schema Using Reflection
7) Programmatically Specifying the Schema
Shark is a new data analysis system that marries SQL queries with complex analytics like machine learning on large clusters. It uses Spark as an execution engine and provides in-memory columnar storage with extensions like partial DAG execution and co-partitioning tables to optimize query performance. Shark also supports expressing machine learning algorithms in SQL to avoid moving data out of the database. It aims to efficiently support both SQL and complex analytics while retaining fault tolerance and allowing users to choose loading frequently used data into memory for fast queries.
This document provides an overview of streaming architectures and compares different streaming engines. It describes a common streaming architecture that ingests data from sources like logs and sockets into Kafka for storage and organization. It then discusses several streaming engines like Apache Beam, Apache Flink, and Akka Streams that can process the data from Kafka. Beam provides the most sophisticated streaming semantics to handle scenarios like late data arrival, while Flink offers low latency processing at large scales and can run Beam data flows. Akka Streams is best suited for complex event processing on individual events.
Presentation slides for the paper on Resilient Distributed Datasets, written by Matei Zaharia et al. at the University of California, Berkeley.
The paper is not my work.
These slides were made for the course on Advanced, Distributed Systems held by prof. Bratsberg at NTNU (Norwegian University of Science and Technology, Trondheim, Norway).
This document provides an overview of Spark SQL and its architecture. Spark SQL allows users to run SQL queries over SchemaRDDs, which are RDDs with a schema and column names. It introduces a SQL-like query abstraction over RDDs and allows querying data in a declarative manner. The Spark SQL component consists of Catalyst, a logical query optimizer, and execution engines for different data sources. It can integrate with data sources like Parquet, JSON, and Cassandra.
OCF.tw's talk about "Introduction to spark"Giivee The
在 OCF and OSSF 的邀請下分享一下 Spark
If you have any interest about 財團法人開放文化基金會(OCF) or 自由軟體鑄造場(OSSF)
Please check http://ocf.tw/ or http://www.openfoundry.org/
另外感謝 CLBC 的場地
如果你想到在一個良好的工作環境下工作
歡迎跟 CLBC 接洽 http://clbc.tw/
This document provides an overview of Apache Spark, including how it compares to Hadoop, the Spark ecosystem, Resilient Distributed Datasets (RDDs), transformations and actions on RDDs, the directed acyclic graph (DAG) scheduler, Spark Streaming, and the DataFrames API. Key points covered include Spark's faster performance versus Hadoop through its use of memory instead of disk, the RDD abstraction for distributed collections, common RDD operations, and Spark's capabilities for real-time streaming data processing and SQL queries on structured data.
This document provides a summary of existing big data tools. It outlines the layered architecture of these tools, including layers for resource management, file systems, data processing frameworks, machine learning libraries, NoSQL databases and more. It also describes several common data processing models (e.g. MapReduce, DAG, graph processing) and specific tools that use each model (e.g. Hadoop for MapReduce, Spark for DAG). Examples of code for PageRank and broadcasting data in the Harp framework are also provided.
report on aadhaar anlysis using bid data hadoop and hivesiddharthboora
This document describes using Hadoop and Hive to analyze an Aadhaar dataset. The key steps taken were:
1. Transferring the CSV file from the local system to HDFS using Hadoop.
2. Creating a database and table in Hive to store the data.
3. Loading the data from HDFS into the Hive table.
4. Performing analyses on the data in Hive such as finding the number of Aadhaars generated by state, gender, and district.
This document discusses Spark, an open-source cluster computing framework. It provides a brief history of Spark, describing how it generalized MapReduce to support more types of applications. Spark allows for batch, interactive, and real-time processing within a single framework using Resilient Distributed Datasets (RDDs) and a logical plan represented as a directed acyclic graph (DAG). The document also discusses how Spark can be used for applications like machine learning via MLlib, graph processing with GraphX, and streaming data with Spark Streaming.
How Apache Spark fits into the Big Data landscapePaco Nathan
How Apache Spark fits into the Big Data landscape http://www.meetup.com/Washington-DC-Area-Spark-Interactive/events/217858832/
2014-12-02 in Herndon, VA and sponsored by Raytheon, Tetra Concepts, and MetiStream
This presentation show the main Spark characteristics, like RDD, Transformations and Actions.
I used this presentation for many Spark Intro workshops from Cluj-Napoca Big Data community : http://www.meetup.com/Big-Data-Data-Science-Meetup-Cluj-Napoca/
Spark DataFrames provide a more optimized way to work with structured data compared to RDDs. DataFrames allow skipping unnecessary data partitions when querying, such as only reading data partitions that match certain criteria like date ranges. DataFrames also integrate better with storage formats like Parquet, which stores data in a columnar format and allows skipping unrelated columns during queries to improve performance. The code examples demonstrate loading a CSV file into a DataFrame, finding and removing duplicate records, and counting duplicate records by key to identify potential duplicates.
Talk given at ClojureD conference, Berlin
Apache Spark is an engine for efficiently processing large amounts of data. We show how to apply the elegance of Clojure to Spark - fully exploiting the REPL and dynamic typing. There will be live coding using our gorillalabs/sparkling API.
In the presentation, we will of course introduce the core concepts of Spark, like resilient distributed data sets (RDD). And you will learn how the Spark concepts resembles those well-known from Clojure, like persistent data structures and functional programming.
Finally, we will provide some Do’s and Don’ts for you to kick off your Spark program based upon our experience.
About Paulus Esterhazy and Christian Betz
Being a LISP hacker for several years, and a Java-guy for some more, Chris turned to Clojure for production code in 2011. He’s been Project Lead, Software Architect, and VP Tech in the meantime, interested in AI and data-visualization.
Now, working on the heart of data driven marketing for Performance Media in Hamburg, he turned to Apache Spark for some Big Data jobs. Chris released the API-wrapper ‘chrisbetz/sparkling’ to fully exploit the power of his compute cluster.
Paulus Esterhazy
Paulus is a philosophy PhD turned software engineer with an interest in functional programming and a penchant for hammock-driven development.
He currently works as Senior Web Developer at Red Pineapple Media in Berlin.
Apache Drill (http://incubator.apache.org/drill/) is a distributed system for interactive analysis of large-scale datasets, inspired by Google’s Dremel technology. It is designed to scale to thousands of servers and able to process Petabytes of data in seconds. Since its inception in mid 2012, Apache Drill has gained widespread interest in the community, attracting hundreds of interested individuals and companies. In the talk we discuss how Apache Drill enables ad-hoc interactive query at scale, walking through typical use cases and delve into Drill's architecture, the data flow and query languages as well as data sources supported.
Spark is a general engine for large-scale data processing. It introduces Resilient Distributed Datasets (RDDs) which allow in-memory caching for fault tolerance and act like familiar Scala collections for distributed computation across clusters. RDDs provide a programming model with transformations like map and reduce and actions to compute results. Spark also supports streaming, SQL, machine learning, and graph processing workloads.
Data discoveryonhadoop@yahoo! hadoopsummit2014thiruvel
This document discusses data discovery on Hadoop using Apache HCatalog. It describes how HCatalog provides a common interface for data access across Hadoop tools like Hive, Pig, and MapReduce. HCatalog allows users to register metadata for tables and partitions stored on Hadoop, enabling data discovery and access without needing to know the physical storage details. The document outlines how HCatalog is used at Yahoo to provide interoperability, notifications, and integration with data management platforms.
Hadoop Summit San Jose 2014: Data Discovery on Hadoop Sumeet Singh
In the last eight years, the Hadoop grid infrastructure has allowed us to move towards a unified source of truth for all data at Yahoo that now accounts for over 450 petabytes of raw HDFS and 1.1 billion data files. Managing data location, schema knowledge and evolution, fine-grained business rules based access control, and audit and compliance needs have become critical with the increasing scale of operations.
In this talk, we will share our approach in tackling the above challenges with Apache HCatalog, a table and storage management layer for Hadoop. We will explain how to register existing HDFS files into HCatalog, provide broader but controlled access to data through a data discovery tool, and leverage existing Hadoop ecosystem components like Pig, Hive, HBase and Oozie to seamlessly share data across applications. Integration with data movement tools automates the availability of new data into HCatalog. In addition, the approach allows ever improving Hive performance to open up easy adhoc access to analyze and visualize data through SQL on Hadoop and popular BI tools.
As we discuss our approach, we will also highlight along how our approach minimizes data duplication, eliminates wasteful data retention, and solves for data provenance, lineage and integrity.
Hive is a data warehouse system built on top of Hadoop that allows users to query large datasets using SQL. It is used at Facebook to manage over 15TB of new data added daily across a 300+ node Hadoop cluster. Key features include using SQL for queries, extensibility through custom functions and file formats, and optimizations for performance like predicate pushdown and partition pruning.
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2sf2z6i
This CloudxLab Introduction to Spark SQL & DataFrames tutorial helps you to understand Spark SQL & DataFrames in detail. Below are the topics covered in this slide:
1) Introduction to DataFrames
2) Creating DataFrames from JSON
3) DataFrame Operations
4) Running SQL Queries Programmatically
5) Datasets
6) Inferring the Schema Using Reflection
7) Programmatically Specifying the Schema
Shark is a new data analysis system that marries SQL queries with complex analytics like machine learning on large clusters. It uses Spark as an execution engine and provides in-memory columnar storage with extensions like partial DAG execution and co-partitioning tables to optimize query performance. Shark also supports expressing machine learning algorithms in SQL to avoid moving data out of the database. It aims to efficiently support both SQL and complex analytics while retaining fault tolerance and allowing users to choose loading frequently used data into memory for fast queries.
This document provides an overview of streaming architectures and compares different streaming engines. It describes a common streaming architecture that ingests data from sources like logs and sockets into Kafka for storage and organization. It then discusses several streaming engines like Apache Beam, Apache Flink, and Akka Streams that can process the data from Kafka. Beam provides the most sophisticated streaming semantics to handle scenarios like late data arrival, while Flink offers low latency processing at large scales and can run Beam data flows. Akka Streams is best suited for complex event processing on individual events.
Presentation slides for the paper on Resilient Distributed Datasets, written by Matei Zaharia et al. at the University of California, Berkeley.
The paper is not my work.
These slides were made for the course on Advanced, Distributed Systems held by prof. Bratsberg at NTNU (Norwegian University of Science and Technology, Trondheim, Norway).
This document provides an overview of Spark SQL and its architecture. Spark SQL allows users to run SQL queries over SchemaRDDs, which are RDDs with a schema and column names. It introduces a SQL-like query abstraction over RDDs and allows querying data in a declarative manner. The Spark SQL component consists of Catalyst, a logical query optimizer, and execution engines for different data sources. It can integrate with data sources like Parquet, JSON, and Cassandra.
OCF.tw's talk about "Introduction to spark"Giivee The
在 OCF and OSSF 的邀請下分享一下 Spark
If you have any interest about 財團法人開放文化基金會(OCF) or 自由軟體鑄造場(OSSF)
Please check http://ocf.tw/ or http://www.openfoundry.org/
另外感謝 CLBC 的場地
如果你想到在一個良好的工作環境下工作
歡迎跟 CLBC 接洽 http://clbc.tw/
This document provides an overview of Apache Spark, including how it compares to Hadoop, the Spark ecosystem, Resilient Distributed Datasets (RDDs), transformations and actions on RDDs, the directed acyclic graph (DAG) scheduler, Spark Streaming, and the DataFrames API. Key points covered include Spark's faster performance versus Hadoop through its use of memory instead of disk, the RDD abstraction for distributed collections, common RDD operations, and Spark's capabilities for real-time streaming data processing and SQL queries on structured data.
This document provides a summary of existing big data tools. It outlines the layered architecture of these tools, including layers for resource management, file systems, data processing frameworks, machine learning libraries, NoSQL databases and more. It also describes several common data processing models (e.g. MapReduce, DAG, graph processing) and specific tools that use each model (e.g. Hadoop for MapReduce, Spark for DAG). Examples of code for PageRank and broadcasting data in the Harp framework are also provided.
report on aadhaar anlysis using bid data hadoop and hivesiddharthboora
This document describes using Hadoop and Hive to analyze an Aadhaar dataset. The key steps taken were:
1. Transferring the CSV file from the local system to HDFS using Hadoop.
2. Creating a database and table in Hive to store the data.
3. Loading the data from HDFS into the Hive table.
4. Performing analyses on the data in Hive such as finding the number of Aadhaars generated by state, gender, and district.
This document discusses Spark, an open-source cluster computing framework. It provides a brief history of Spark, describing how it generalized MapReduce to support more types of applications. Spark allows for batch, interactive, and real-time processing within a single framework using Resilient Distributed Datasets (RDDs) and a logical plan represented as a directed acyclic graph (DAG). The document also discusses how Spark can be used for applications like machine learning via MLlib, graph processing with GraphX, and streaming data with Spark Streaming.
How Apache Spark fits into the Big Data landscapePaco Nathan
How Apache Spark fits into the Big Data landscape http://www.meetup.com/Washington-DC-Area-Spark-Interactive/events/217858832/
2014-12-02 in Herndon, VA and sponsored by Raytheon, Tetra Concepts, and MetiStream
How Apache Spark fits into the Big Data landscapePaco Nathan
Boulder/Denver Spark Meetup, 2014-10-02 @ Datalogix
http://www.meetup.com/Boulder-Denver-Spark-Meetup/events/207581832/
Apache Spark is intended as a general purpose engine that supports combinations of Batch, Streaming, SQL, ML, Graph, etc., for apps written in Scala, Java, Python, Clojure, R, etc.
This talk provides an introduction to Spark — how it provides so much better performance, and why — and then explores how Spark fits into the Big Data landscape — e.g., other systems with which Spark pairs nicely — and why Spark is needed for the work ahead.
The document provides an overview of Apache Spark, including its history and key capabilities. It discusses how Spark was developed in 2009 at UC Berkeley and later open sourced, and how it has since become a major open-source project for big data. The document summarizes that Spark provides in-memory performance for ETL, storage, exploration, analytics and more on Hadoop clusters, and supports machine learning, graph analysis, and SQL queries.
Apache Spark and Hadoop are frameworks for distributed data processing. Spark can be used for batch processing, streaming, and machine learning. It improves on MapReduce by keeping data in memory between jobs. The document provides an overview of Spark and its components, use cases like streaming data analysis and machine learning, and how it compares to Hadoop MapReduce. Real-world examples of Spark usage at companies like Uber and Pinterest are also discussed.
Apache Hadoop and Spark: Introduction and Use Cases for Data AnalysisTrieu Nguyen
This document provides an introduction to Apache Hadoop and Spark for data analysis. It discusses the growth of big data from sources like the internet, science, and IoT. Hadoop is introduced as providing scalability on commodity hardware to handle large, diverse data types with fault tolerance. Key Hadoop components are HDFS for storage, MapReduce for processing, and HBase for non-relational databases. Spark is presented as improving on MapReduce by using in-memory computing for iterative jobs like machine learning. Real-world use cases of Spark at companies like Uber, Pinterest, and Netflix are briefly described.
Here are the steps to complete the assignment:
1. Create RDDs to filter each file for lines containing "Spark":
val readme = sc.textFile("README.md").filter(_.contains("Spark"))
val changes = sc.textFile("CHANGES.txt").filter(_.contains("Spark"))
2. Perform WordCount on each:
val readmeCounts = readme.flatMap(_.split(" ")).map((_,1)).reduceByKey(_ + _)
val changesCounts = changes.flatMap(_.split(" ")).map((_,1)).reduceByKey(_ + _)
3. Join the two RDDs:
val joined = readmeCounts.join(changes
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...BigDataEverywhere
Paco Nathan, Director of Community Evangelism at Databricks
Apache Spark is intended as a fast and powerful general purpose engine for processing Hadoop data. Spark supports combinations of batch processing, streaming, SQL, ML, Graph, etc., for applications written in Scala, Java, Python, Clojure, and R, among others. In this talk, I'll explore how Spark fits into the Big Data landscape. In addition, I'll describe other systems with which Spark pairs nicely, and will also explain why Spark is needed for the work ahead.
Microservices, Containers, and Machine LearningPaco Nathan
Session talk for Data Day Texas 2015, showing GraphX and SparkSQL for text analytics and graph analytics of an Apache developer email list -- including an implementation of TextRank in Spark.
This document provides an introduction to Apache Spark, including its history and key concepts. It discusses how Spark was developed in response to big data processing needs at Google and how it builds upon earlier systems like MapReduce. The document then covers Spark's core abstractions like RDDs and DataFrames/Datasets and common transformations and actions. It also provides an overview of Spark SQL and how to deploy Spark applications on a cluster.
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
Apache Spark has grown to be one of the largest open source communities in big data, with over 190 developers and dozens of companies contributing. The latest 1.0 release alone includes contributions from 117 people. A clean API, interactive shell, distributed in-memory computation, stream processing, interactive SQL, and libraries delivering everything from machine learning to graph processing make it an excellent unified platform to solve a number of problems. Apache Spark works very well with a growing number of big data solutions, including Cassandra and Hadoop. Come learn about Apache Spark and see how easy it is for you to get started using Spark to build your own high performance big data applications today.
This presentation is an introduction to Apache Spark. It covers the basic API, some advanced features and describes how Spark physically executes its jobs.
Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa
The document is a presentation about Apache Spark given on August 25th, 2015 in Pittsburgh by Sneha Challa. It introduces Spark as a fast and general cluster computing engine for large-scale data processing. It discusses Spark's Resilient Distributed Datasets (RDDs) and transformations/actions. It provides examples of Spark APIs like map, reduce, and explains running Spark on standalone, Mesos, YARN, or EC2 clusters. It also covers Spark libraries like MLlib and running machine learning algorithms like k-means clustering and logistic regression.
This document provides an introduction and overview of Apache Spark. It discusses why in-memory computing is important for speed, compares Spark and Ignite, describes what Spark is and how it works using Resilient Distributed Datasets (RDDs) and a directed acyclic graph (DAG) model. It also provides examples of Spark operations on RDDs and shows a word count example in Java, Scala and Python.
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsDatabricks
The document provides an outline for the Spark Camp @ Strata CA tutorial. The morning session will cover introductions and getting started with Spark, an introduction to MLlib, and exercises on working with Spark on a cluster and notebooks. The afternoon session will cover Spark SQL, visualizations, Spark streaming, building Scala applications, and GraphX examples. The tutorial will be led by several instructors from Databricks and include hands-on coding exercises.
Author: Stefan Papp, Data Architect at “The unbelievable Machine Company“. An overview of Big Data Processing engines with a focus on Apache Spark and Apache Flink, given at a Vienna Data Science Group meeting on 26 January 2017. Following questions are addressed:
• What are big data processing paradigms and how do Spark 1.x/Spark 2.x and Apache Flink solve them?
• When to use batch and when stream processing?
• What is a Lambda-Architecture and a Kappa Architecture?
• What are the best practices for your project?
This document summarizes Spark, a fast and general engine for large-scale data processing. Spark addresses limitations of MapReduce by supporting efficient sharing of data across parallel operations in memory. Resilient distributed datasets (RDDs) allow data to persist across jobs for faster iterative algorithms and interactive queries. Spark provides APIs in Scala and Java for programming RDDs and a scheduler to optimize jobs. It integrates with existing Hadoop clusters and scales to petabytes of data.
Human in the loop: a design pattern for managing teams working with MLPaco Nathan
Strata CA 2018-03-08
https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/64223
Although it has long been used for has been used for use cases like simulation, training, and UX mockups, human-in-the-loop (HITL) has emerged as a key design pattern for managing teams where people and machines collaborate. One approach, active learning (a special case of semi-supervised learning), employs mostly automated processes based on machine learning models, but exceptions are referred to human experts, whose decisions help improve new iterations of the models.
Human-in-the-loop: a design pattern for managing teams that leverage MLPaco Nathan
Strata Singapore 2017 session talk 2017-12-06
https://conferences.oreilly.com/strata/strata-sg/public/schedule/detail/65611
Human-in-the-loop is an approach which has been used for simulation, training, UX mockups, etc. A more recent design pattern is emerging for human-in-the-loop (HITL) as a way to manage teams working with machine learning (ML). A variant of semi-supervised learning called active learning allows for mostly automated processes based on ML, where exceptions get referred to human experts. Those human judgements in turn help improve new iterations of the ML models.
This talk reviews key case studies about active learning, plus other approaches for human-in-the-loop which are emerging among AI applications. We’ll consider some of the technical aspects — including available open source projects — as well as management perspectives for how to apply HITL:
* When is HITL indicated vs. when isn’t it applicable?
* How do HITL approaches compare/contrast with more “typical” use of Big Data?
* What’s the relationship between use of HITL and preparing an organization to leverage Deep Learning?
* Experiences training and managing a team which uses HITL at scale
* Caveats to know ahead of time:
* In what ways do the humans involved learn from the machines?
* In particular, we’ll examine use cases at O’Reilly Media where ML pipelines for categorizing content are trained by subject matter experts providing examples, based on HITL and leveraging open source [Project Jupyter](https://jupyter.org/ for implementation).
Human-in-a-loop: a design pattern for managing teams which leverage MLPaco Nathan
Human-in-a-loop: a design pattern for managing teams which leverage ML
Big Data Spain, 2017-11-16
https://www.bigdataspain.org/2017/talk/human-in-the-loop-a-design-pattern-for-managing-teams-which-leverage-ml
Human-in-the-loop is an approach which has been used for simulation, training, UX mockups, etc. A more recent design pattern is emerging for human-in-the-loop (HITL) as a way to manage teams working with machine learning (ML). A variant of semi-supervised learning called _active learning_ allows for mostly automated processes based on ML, where exceptions get referred to human experts. Those human judgements in turn help improve new iterations of the ML models.
This talk reviews key case studies about active learning, plus other approaches for human-in-the-loop which are emerging among AI applications. We'll consider some of the technical aspects -- including available open source projects -- as well as management perspectives for how to apply HITL:
* When is HITL indicated vs. when isn't it applicable?
* How do HITL approaches compare/contrast with more "typical" use of Big Data?
* What's the relationship between use of HITL and preparing an organization to leverage Deep Learning?
* Experiences training and managing a team which uses HITL at scale
* Caveats to know ahead of time
* In what ways do the humans involved learn from the machines?
In particular, we'll examine use cases at O'Reilly Media where ML pipelines for categorizing content are trained by subject matter experts providing examples, based on HITL and leveraging open source [Project Jupyter](https://jupyter.org/ for implementation).
Humans in a loop: Jupyter notebooks as a front-end for AIPaco Nathan
JupyterCon NY 2017-08-24
https://www.safaribooksonline.com/library/view/jupytercon-2017-/9781491985311/video313210.html
Paco Nathan reviews use cases where Jupyter provides a front-end to AI as the means for keeping "humans in the loop". This talk introduces *active learning* and the "human-in-the-loop" design pattern for managing how people and machines collaborate in AI workflows, including several case studies.
The talk also explores how O'Reilly Media leverages AI in Media, and in particular some of our use cases for active learning such as disambiguation in content discovery. We're using Jupyter as a way to manage active learning ML pipelines, where the machines generally run automated until they hit an edge case and refer the judgement back to human experts. In turn, the experts training the ML pipelines purely through examples, not feature engineering, model parameters, etc.
Jupyter notebooks serve as one part configuration file, one part data sample, one part structured log, one part data visualization tool. O'Reilly has released an open source project on GitHub called `nbtransom` which builds atop `nbformat` and `pandas` for our active learning use cases.
This work anticipates upcoming work on collaborative documents in JupyterLab, based on Google Drive. In other words, where the machines and people are collaborators on shared documents.
Humans in the loop: AI in open source and industryPaco Nathan
Nike Tech Talk, Portland, 2017-08-10
https://niketechtalks-aug2017.splashthat.com/
O'Reilly Media gets to see the forefront of trends in artificial intelligence: what the leading teams are working on, which use cases are getting the most traction, previews of advances before they get announced on stage. Through conferences, publishing, and training programs, we've been assembling resources for anyone who wants to learn. An excellent recent example: Generative Adversarial Networks for Beginners, by Jon Bruner.
This talk covers current trends in AI, industry use cases, and recent highlights from the AI Conf series presented by O'Reilly and Intel, plus related materials from Safari learning platform, Strata Data, Data Show, and the upcoming JupyterCon.
Along with reporting, we're leveraging AI in Media. This talk dives into O'Reilly uses of deep learning -- combined with ontology, graph algorithms, probabilistic data structures, and even some evolutionary software -- to help editors and customers alike accomplish more of what they need to do.
In particular, we'll show two open source projects in Python from O'Reilly's AI team:
• pytextrank built atop spaCy, NetworkX, datasketch, providing graph algorithms for advanced NLP and text analytics
• nbtransom leveraging Project Jupyter for a human-in-the-loop design pattern approach to AI work: people and machines collaborating on content annotation
Lessons learned from 3 (going on 4) generations of Jupyter use cases at O'Reilly Media. In particular, about "Oriole" tutorials which combine video with Jupyter notebooks, Docker containers, backed by services managed on a cluster by Marathon, Mesos, Redis, and Nginx.
https://conferences.oreilly.com/fluent/fl-ca/public/schedule/detail/62859
https://conferences.oreilly.com/velocity/vl-ca/public/schedule/detail/62858
O'Reilly Media has experimented with different uses of Jupyter notebooks in their publications and learning platforms. Their latest approach embeds notebooks with video narratives in online "Oriole" tutorials, allowing authors to create interactive, computable content. This new medium blends code, data, text, and video into narrated learning experiences that run in isolated Docker containers for higher engagement. Some best practices for using notebooks in teaching include focusing on concise concepts, chunking content, and alternating between text, code, and outputs to keep explanations clear and linear.
See 2020 update: https://derwen.ai/s/h88s
SF Python Meetup, 2017-02-08
https://www.meetup.com/sfpython/events/237153246/
PyTextRank is a pure Python open source implementation of *TextRank*, based on the [Mihalcea 2004 paper](http://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf) -- a graph algorithm which produces ranked keyphrases from texts. Keyphrases generally more useful than simple keyword extraction. PyTextRank integrates use of `TextBlob` and `SpaCy` for NLP analysis of texts, including full parse, named entity extraction, etc. It also produces auto-summarization of texts, making use of an approximation algorithm, `MinHash`, for better performance at scale. Overall, the package is intended to complement machine learning approaches -- specifically deep learning used for custom search and recommendations -- by developing better feature vectors from raw texts. This package is in production use at O'Reilly Media for text analytics.
Use of standards and related issues in predictive analyticsPaco Nathan
My presentation at KDD 2016 in SF, in the "Special Session on Standards in Predictive Analytics In the Era of Big and Fast Data" morning track about PMML and PFA http://dmg.org/kdd2016.html
The document discusses how data science may reinvent learning and education. It begins with background on the author's experience in data teams and teaching. It then questions what an "Uber for education" may look like and discusses definitions of learning, education, and schools. The author argues interactive notebooks like Project Jupyter and flipped classrooms can improve learning at scale compared to traditional lectures or MOOCs. Content toolchains combining Jupyter, Thebe, Atlas and Docker are proposed for authoring and sharing computational narratives and code-as-media.
Jupyter for Education: Beyond Gutenberg and ErasmusPaco Nathan
O'Reilly Learning is focusing on evolving learning experiences using Jupyter notebooks. Jupyter notebooks allow combining code, outputs, and explanations in a single document. O'Reilly is using Jupyter notebooks as a new authoring environment and is exploring features like computational narratives, code as a medium for teaching, and interactive online learning environments. The goal is to provide a better learning architecture and content workflow that leverages the capabilities of Jupyter notebooks.
GalvanizeU Seattle: Eleven Almost-Truisms About DataPaco Nathan
http://www.meetup.com/Seattle-Data-Science/events/223445403/
Almost a dozen almost-truisms about Data that almost everyone should consider carefully as they embark on a journey into Data Science. There are a number of preconceptions about working with data at scale where the realities beg to differ. This talk estimates that number to be at least eleven, through probably much larger. At least that number has a great line from a movie. Let's consider some of the less-intuitive directions in which this field is heading, along with likely consequences and corollaries -- especially for those who are just now beginning to study about the technologies, the processes, and the people involved.
Microservices, containers, and machine learningPaco Nathan
http://www.oscon.com/open-source-2015/public/schedule/detail/41579
In this presentation, an open source developer community considers itself algorithmically. This shows how to surface data insights from the developer email forums for just about any Apache open source project. It leverages advanced techniques for natural language processing, machine learning, graph algorithms, time series analysis, etc. As an example, we use data from the Apache Spark email list archives to help understand its community better; however, the code can be applied to many other communities.
Exsto is an open source project that demonstrates Apache Spark workflow examples for SQL-based ETL (Spark SQL), machine learning (MLlib), and graph algorithms (GraphX). It surfaces insights about developer communities from their email forums. Natural language processing services in Python (based on NLTK, TextBlob, WordNet, etc.), gets containerized and used to crawl and parse email archives. These produce JSON data sets, then we run machine learning on a Spark cluster to find out insights such as:
* What are the trending topic summaries?
* Who are the leaders in the community for various topics?
* Who discusses most frequently with whom?
This talk shows how to use cloud-based notebooks for organizing and running the analytics and visualizations. It reviews the background for how and why the graph analytics and machine learning algorithms generalize patterns within the data — based on open source implementations for two advanced approaches, Word2Vec and TextRank The talk also illustrates best practices for leveraging functional programming for big data.
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
The document provides an overview of Graph Analytics in Spark. It discusses Spark components and key distinctions from MapReduce. It also covers GraphX terminology and examples of composing node and edge RDDs into a graph. The document provides examples of simple traversals and routing problems on graphs. It discusses using GraphX for topic modeling with LDA and provides further reading resources on GraphX, algebraic graph theory, and graph analysis tools and frameworks.
Graph analytics can be used to analyze a social graph constructed from email messages on the Spark user mailing list. Key metrics like PageRank, in-degrees, and strongly connected components can be computed using the GraphX API in Spark. For example, PageRank was computed on the 4Q2014 email graph, identifying the top contributors to the mailing list.
Apache Spark and the Emerging Technology Landscape for Big DataPaco Nathan
The document discusses Apache Spark and its role in big data and emerging technologies for big data. It provides background on MapReduce and the emergence of specialized systems. It then discusses how Spark provides a unified engine for batch processing, iterative jobs, SQL queries, streaming, and more. It can simplify programming by using a functional approach. The document also discusses Spark's architecture and performance advantages over other frameworks.
QCon São Paulo: Real-Time Analytics with Spark StreamingPaco Nathan
The document provides an overview of real-time analytics using Spark Streaming. It discusses Spark Streaming's micro-batch approach of treating streaming data as a series of small batch jobs. This allows for low-latency analysis while integrating streaming and batch processing. The document also covers Spark Streaming's fault tolerance mechanisms and provides several examples of companies like Pearson, Guavus, and Sharethrough using Spark Streaming for real-time analytics in production environments.
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
Spark and Databricks component of the O'Reilly Media webcast "2015 Data Preview: Spark, Data Visualization, YARN, and More", as a preview of the 2015 Strata + Hadoop World conference in San Jose http://www.oreilly.com/pub/e/3289
A New Year in Data Science: ML UnpausedPaco Nathan
This document summarizes Paco Nathan's presentation at Data Day Texas in 2015. Some key points:
- Paco Nathan discussed observations and trends from the past year in machine learning, data science, big data, and open source technologies.
- He argued that the definitions of data science and statistics are flawed and ignore important areas like development, visualization, and modeling real-world business problems.
- The presentation covered topics like functional programming approaches, streaming approximations, and the importance of an interdisciplinary approach combining computer science, statistics, and other fields like physics.
- Paco Nathan advocated for newer probabilistic techniques for analyzing large datasets that provide approximations using less resources compared to traditional batch processing approaches.
cloudgenesis cloud workshop , gdg on campus mitasiyaldhande02
Step into the future of cloud computing with CloudGenesis, a power-packed workshop curated by GDG on Campus MITA, designed to equip students and aspiring cloud professionals with hands-on experience in Google Cloud Platform (GCP), Microsoft Azure, and Azure Al services.
This workshop offers a rare opportunity to explore real-world multi-cloud strategies, dive deep into cloud deployment practices, and harness the potential of Al-powered cloud solutions. Through guided labs and live demonstrations, participants will gain valuable exposure to both platforms- enabling them to think beyond silos and embrace a cross-cloud approach to
development and innovation.
nnual (33 years) study of the Israeli Enterprise / public IT market. Covering sections on Israeli Economy, IT trends 2026-28, several surveys (AI, CDOs, OCIO, CTO, staffing cyber, operations and infra) plus rankings of 760 vendors on 160 markets (market sizes and trends) and comparison of products according to support and market penetration.
European Accessibility Act & Integrated Accessibility TestingJulia Undeutsch
Emma Dawson will guide you through two important topics in this session.
Firstly, she will prepare you for the European Accessibility Act (EAA), which comes into effect on 28 June 2025, and show you how development teams can prepare for it.
In the second part of the webinar, Emma Dawson will explore with you various integrated testing methods and tools that will help you improve accessibility during the development cycle, such as Linters, Storybook, Playwright, just to name a few.
Focus: European Accessibility Act, Integrated Testing tools and methods (e.g. Linters, Storybook, Playwright)
Target audience: Everyone, Developers, Testers
Unlock your organization’s full potential with the 2025 Digital Adoption Blueprint. Discover proven strategies to streamline software onboarding, boost productivity, and drive enterprise-wide digital transformation.
Marko.js - Unsung Hero of Scalable Web Frameworks (DevDays 2025)Eugene Fidelin
Marko.js is an open-source JavaScript framework created by eBay back in 2014. It offers super-efficient server-side rendering, making it ideal for big e-commerce sites and other multi-page apps where speed and SEO really matter. After over 10 years of development, Marko has some standout features that make it an interesting choice. In this talk, I’ll dive into these unique features and showcase some of Marko's innovative solutions. You might not use Marko.js at your company, but there’s still a lot you can learn from it to bring to your next project.
Cyber Security Legal Framework in Nepal.pptxGhimire B.R.
The presentation is about the review of existing legal framework on Cyber Security in Nepal. The strength and weakness highlights of the major acts and policies so far. Further it highlights the needs of data protection act .
Contributing to WordPress With & Without Code.pptxPatrick Lumumba
Contributing to WordPress: Making an Impact on the Test Team—With or Without Coding Skills
WordPress survives on collaboration, and the Test Team plays a very important role in ensuring the CMS is stable, user-friendly, and accessible to everyone.
This talk aims to deconstruct the myth that one has to be a developer to contribute to WordPress. In this session, I will share with the audience how to get involved with the WordPress Team, whether a coder or not.
We’ll explore practical ways to contribute, from testing new features, and patches, to reporting bugs. By the end of this talk, the audience will have the tools and confidence to make a meaningful impact on WordPress—no matter the skill set.
"AI in the browser: predicting user actions in real time with TensorflowJS", ...Fwdays
With AI becoming increasingly present in our everyday lives, the latest advancements in the field now make it easier than ever to integrate it into our software projects. In this session, we’ll explore how machine learning models can be embedded directly into front-end applications. We'll walk through practical examples, including running basic models such as linear regression and random forest classifiers, all within the browser environment.
Once we grasp the fundamentals of running ML models on the client side, we’ll dive into real-world use cases for web applications—ranging from real-time data classification and interpolation to object tracking in the browser. We'll also introduce a novel approach: dynamically optimizing web applications by predicting user behavior in real time using a machine learning model. This opens the door to smarter, more adaptive user experiences and can significantly improve both performance and engagement.
In addition to the technical insights, we’ll also touch on best practices, potential challenges, and the tools that make browser-based machine learning development more accessible. Whether you're a developer looking to experiment with ML or someone aiming to bring more intelligence into your web apps, this session will offer practical takeaways and inspiration for your next project.
New Ways to Reduce Database Costs with ScyllaDBScyllaDB
How ScyllaDB’s latest capabilities can reduce your infrastructure costs
ScyllaDB has been obsessed with price-performance from day 1. Our core database is architected with low-level engineering optimizations that squeeze every ounce of power from the underlying infrastructure. And we just completed a multi-year effort to introduce a set of new capabilities for additional savings.
Join this webinar to learn about these new capabilities: the underlying challenges we wanted to address, the workloads that will benefit most from each, and how to get started. We’ll cover ways to:
- Avoid overprovisioning with “just-in-time” scaling
- Safely operate at up to ~90% storage utilization
- Cut network costs with new compression strategies and file-based streaming
We’ll also highlight a “hidden gem” capability that lets you safely balance multiple workloads in a single cluster. To conclude, we will share the efficiency-focused capabilities on our short-term and long-term roadmaps.
Multistream in SIP and NoSIP @ OpenSIPS Summit 2025Lorenzo Miniero
Slides for my "Multistream support in the Janus SIP and NoSIP plugins" presentation at the OpenSIPS Summit 2025 event.
They describe my efforts refactoring the Janus SIP and NoSIP plugins to allow for the gatewaying of an arbitrary number of audio/video streams per call (thus breaking the current 1-audio/1-video limitation), plus some additional considerations on what this could mean when dealing with application protocols negotiated via SIP as well.
AI Emotional Actors: “When Machines Learn to Feel and Perform"AkashKumar809858
Welcome to the era of AI Emotional Actors.
The entertainment landscape is undergoing a seismic transformation. What started as motion capture and CGI enhancements has evolved into a full-blown revolution: synthetic beings not only perform but express, emote, and adapt in real time.
For reading further follow this link -
https://akash97.gumroad.com/l/meioex
Agentic AI - The New Era of IntelligenceMuzammil Shah
This presentation is specifically designed to introduce final-year university students to the foundational principles of Agentic Artificial Intelligence (AI). It aims to provide a clear understanding of how Agentic AI systems function, their key components, and the underlying technologies that empower them. By exploring real-world applications and emerging trends, the session will equip students with essential knowledge to engage with this rapidly evolving area of AI, preparing them for further study or professional work in the field.
With Claude 4, Anthropic redefines AI capabilities, effectively unleashing a ...SOFTTECHHUB
With the introduction of Claude Opus 4 and Sonnet 4, Anthropic's newest generation of AI models is not just an incremental step but a pivotal moment, fundamentally reshaping what's possible in software development, complex problem-solving, and intelligent business automation.
Supercharge Your AI Development with Local LLMsFrancesco Corti
In today's AI development landscape, developers face significant challenges when building applications that leverage powerful large language models (LLMs) through SaaS platforms like ChatGPT, Gemini, and others. While these services offer impressive capabilities, they come with substantial costs that can quickly escalate especially during the development lifecycle. Additionally, the inherent latency of web-based APIs creates frustrating bottlenecks during the critical testing and iteration phases of development, slowing down innovation and frustrating developers.
This talk will introduce the transformative approach of integrating local LLMs directly into their development environments. By bringing these models closer to where the code lives, developers can dramatically accelerate development lifecycles while maintaining complete control over model selection and configuration. This methodology effectively reduces costs to zero by eliminating dependency on pay-per-use SaaS services, while opening new possibilities for comprehensive integration testing, rapid prototyping, and specialized use cases.
AI in Java - MCP in Action, Langchain4J-CDI, SmallRye-LLM, Spring AIBuhake Sindi
This is the presentation I gave with regards to AI in Java, and the work that I have been working on. I've showcased Model Context Protocol (MCP) in Java, creating server-side MCP server in Java. I've also introduced Langchain4J-CDI, previously known as SmallRye-LLM, a CDI managed too to inject AI services in enterprise Java applications. Also, honourable mention: Spring AI.
Measuring Microsoft 365 Copilot and Gen AI SuccessNikki Chapple
Session | Measuring Microsoft 365 Copilot and Gen AI Success with Viva Insights and Purview
Presenter | Nikki Chapple 2 x MVP and Principal Cloud Architect at CloudWay
Event | European Collaboration Conference 2025
Format | In person Germany
Date | 28 May 2025
📊 Measuring Copilot and Gen AI Success with Viva Insights and Purview
Presented by Nikki Chapple – Microsoft 365 MVP & Principal Cloud Architect, CloudWay
How do you measure the success—and manage the risks—of Microsoft 365 Copilot and Generative AI (Gen AI)? In this ECS 2025 session, Microsoft MVP and Principal Cloud Architect Nikki Chapple explores how to go beyond basic usage metrics to gain full-spectrum visibility into AI adoption, business impact, user sentiment, and data security.
🎯 Key Topics Covered:
Microsoft 365 Copilot usage and adoption metrics
Viva Insights Copilot Analytics and Dashboard
Microsoft Purview Data Security Posture Management (DSPM) for AI
Measuring AI readiness, impact, and sentiment
Identifying and mitigating risks from third-party Gen AI tools
Shadow IT, oversharing, and compliance risks
Microsoft 365 Admin Center reports and Copilot Readiness
Power BI-based Copilot Business Impact Report (Preview)
📊 Why AI Measurement Matters: Without meaningful measurement, organizations risk operating in the dark—unable to prove ROI, identify friction points, or detect compliance violations. Nikki presents a unified framework combining quantitative metrics, qualitative insights, and risk monitoring to help organizations:
Prove ROI on AI investments
Drive responsible adoption
Protect sensitive data
Ensure compliance and governance
🔍 Tools and Reports Highlighted:
Microsoft 365 Admin Center: Copilot Overview, Usage, Readiness, Agents, Chat, and Adoption Score
Viva Insights Copilot Dashboard: Readiness, Adoption, Impact, Sentiment
Copilot Business Impact Report: Power BI integration for business outcome mapping
Microsoft Purview DSPM for AI: Discover and govern Copilot and third-party Gen AI usage
🔐 Security and Compliance Insights: Learn how to detect unsanctioned Gen AI tools like ChatGPT, Gemini, and Claude, track oversharing, and apply eDLP and Insider Risk Management (IRM) policies. Understand how to use Microsoft Purview—even without E5 Compliance—to monitor Copilot usage and protect sensitive data.
📈 Who Should Watch: This session is ideal for IT leaders, security professionals, compliance officers, and Microsoft 365 admins looking to:
Maximize the value of Microsoft Copilot
Build a secure, measurable AI strategy
Align AI usage with business goals and compliance requirements
🔗 Read the blog https://nikkichapple.com/measuring-copilot-gen-ai/
3. What is Spark?
Developed in 2009 at UC Berkeley AMPLab, then
open sourced in 2010, Spark has since become
one of the largest OSS communities in big data,
with over 200 contributors in 50+ organizations
spark.apache.org
“Organizations that are looking at big data challenges –
including collection, ETL, storage, exploration and analytics –
should consider Spark for its in-memory performance and
the breadth of its model. It supports advanced analytics
solutions on Hadoop clusters, including the iterative model
required for machine learning and graph analysis.”
Gartner, Advanced Analytics and Data Science (2014)
5. What is Spark?
Spark Core is the general execution engine for the
Spark platform that other functionality is built atop:
!
• in-memory computing capabilities deliver speed
• general execution model supports wide variety
of use cases
• ease of development – native APIs in Java, Scala,
Python (+ SQL, Clojure, R)
6. What is Spark?
WordCount in 3 lines of Spark
WordCount in 50+ lines of Java MR
7. What is Spark?
Sustained exponential growth, as one of the most
active Apache projects ohloh.net/orgs/apache
9. A Brief History: Functional Programming for Big Data
Theory, Eight Decades Ago:
what can be computed?
Haskell Curry
haskell.org
Alonso Church
wikipedia.org
Praxis, Four Decades Ago:
algebra for applicative systems
John Backus
acm.org
David Turner
wikipedia.org
Reality, Two Decades Ago:
machine data from web apps
Pattie Maes
MIT Media Lab
10. A Brief History: Functional Programming for Big Data
The Big Data Problem –
A single machine can no longer
process or even store all the data!
!
The most feasible approach is to
distribute over large clusters…
12. A Brief History: Functional Programming for Big Data
circa 2002:
mitigate risk of large distributed workloads lost
due to disk failures on commodity hardware…
Google File System
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
research.google.com/archive/gfs.html
!
MapReduce: Simplified Data Processing on Large Clusters
Jeffrey Dean, Sanjay Ghemawat
research.google.com/archive/mapreduce.html
13. A Brief History: Functional Programming for Big Data
2002
2004
MapReduce paper
2002
MapReduce @ Google
2004 2006 2008 2010 2012 2014
2006
Hadoop @ Yahoo!
2014
Apache Spark top-level
2010
Spark paper
2008
Hadoop Summit
14. A Brief History: Functional Programming for Big Data
MapReduce
Pregel Giraph
Dremel Drill
S4 Storm
F1
MillWheel
General Batch Processing Specialized Systems:
Impala
GraphLab
iterative, interactive, streaming, graph, etc.
Tez
MR doesn’t compose well for large applications,
and so specialized systems emerged as workarounds
15. A Brief History: Functional Programming for Big Data
circa 2010:
a unified engine for enterprise data workflows,
based on commodity hardware a decade later…
Spark: Cluster Computing with Working Sets
Matei Zaharia, Mosharaf Chowdhury,
Michael Franklin, Scott Shenker, Ion Stoica
people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf
!
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for
In-Memory Cluster Computing
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica
usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf
16. A Brief History: Functional Programming for Big Data
In addition to simple map and reduce operations,
Spark supports SQL queries, streaming data, and
complex analytics such as machine learning and
graph algorithms out-of-the-box.
Better yet, combine these capabilities seamlessly
into one integrated workflow…
17. TL;DR: Generational trade-offs for handling Big Compute
Cheap
Memory
Cheap
Storage
Cheap
Network
recompute
replicate
reference
(RDD)
(DFS)
(URI)
18. TL;DR: Applicative Systems and Functional Programming – RDDs
action value
RDD
RDD
RDD
transformations RDD
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("t")).map(r => r(1))!
messages.cache()
// action 1!
messages.filter(_.contains("mysql")).count()
19. A Brief History: Smashing The Previous Petabyte Sort Record
databricks.com/blog/2014/10/10/spark-petabyte-sort.html
21. Spark Deconstructed: Log Mining Example
// load error messages from a log into memory!
// then interactively search for various patterns!
// https://gist.github.com/ceteri/8ae5b9509a08c08a1132!
!
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("t")).map(r => r(1))!
messages.cache()!
!
// action 1!
messages.filter(_.contains("mysql")).count()!
!
// action 2!
messages.filter(_.contains("php")).count()
22. Driver
Worker
Worker
Worker
Spark Deconstructed: Log Mining Example
We start with Spark running on a cluster…
submitting code to be evaluated on it:
23. Spark Deconstructed: Log Mining Example
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("t")).map(r => r(1))!
messages.cache()!
!
// action 1!
messages.filter(_.contains("mysql")).count()!
!
// discussing action 2!
the other part
messages.filter(_.contains("php")).count()
24. Spark Deconstructed: Log Mining Example
At this point, take a look at the transformed
RDD operator graph:
scala> messages.toDebugString!
res5: String = !
MappedRDD[4] at map at <console>:16 (3 partitions)!
MappedRDD[3] at map at <console>:16 (3 partitions)!
FilteredRDD[2] at filter at <console>:14 (3 partitions)!
MappedRDD[1] at textFile at <console>:12 (3 partitions)!
HadoopRDD[0] at textFile at <console>:12 (3 partitions)
25. Driver
Worker
Worker
Worker
Spark Deconstructed: Log Mining Example
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("t")).map(r => r(1))!
messages.cache()!
!
// action 1!
messages.filter(_.contains("mysql")).count()!
!
// action 2!
medssaigsesc.fuilstesr(i_n.cognt atinhs(e"ph po")t).hcoeuntr() part
26. Driver
Worker
Worker
block 1
Worker
block 2
block 3
Spark Deconstructed: Log Mining Example
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("t")).map(r => r(1))!
messages.cache()!
!
// action 1!
messages.filter(_.contains("mysql")).count()!
!
// action 2!
medssaigsesc.fuilstesr(i_n.cognt atinhs(e"ph po")t).hcoeuntr() part
27. Driver
Worker
Worker
block 1
Worker
block 2
block 3
Spark Deconstructed: Log Mining Example
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("t")).map(r => r(1))!
messages.cache()!
!
// action 1!
messages.filter(_.contains("mysql")).count()!
!
// action 2!
medssaigsesc.fuilstesr(i_n.cognt atinhs(e"ph po")t).hcoeuntr() part
28. Driver
Worker
Worker
block 1
Worker
block 2
block 3
read
HDFS
block
read
HDFS
block
read
HDFS
block
Spark Deconstructed: Log Mining Example
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("t")).map(r => r(1))!
messages.cache()!
!
// action 1!
messages.filter(_.contains("mysql")).count()!
!
// action 2!
medssaigsesc.fuilstesr(i_n.cognt atinhs(e"ph po")t).hcoeuntr() part
29. Driver
cache 1
Worker
Worker
block 1
Worker
block 2
block 3
cache 2
cache 3
process,
cache data
process,
cache data
process,
cache data
Spark Deconstructed: Log Mining Example
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("t")).map(r => r(1))!
messages.cache()!
!
// action 1!
messages.filter(_.contains("mysql")).count()!
!
// action 2!
medssaigsesc.fuilstesr(i_n.cognt atinhs(e"ph po")t).hcoeuntr() part
30. Driver
cache 1
Worker
Worker
block 1
Worker
block 2
block 3
cache 2
cache 3
Spark Deconstructed: Log Mining Example
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("t")).map(r => r(1))!
messages.cache()!
!
// action 1!
messages.filter(_.contains("mysql")).count()!
!
// action 2!
medssaigsesc.fuilstesr(i_n.cognt atinhs(e"ph po")t).hcoeuntr() part
31. // base RDD!
val lines = sc.textFile("hdfs://...")!
!
// transformed RDDs!
val errors = lines.filter(_.startsWith("ERROR"))!
val messages = errors.map(_.split("t")).map(r => r(1))!
messages.cache()!
!
// action 1!
messages.filter(_.contains("mysql")).count()!
!
// action 2!
messages.filter(_.contains("php")).count()
Driver
cache 1
Worker
Worker
block 1
Worker
block 2
block 3
cache 2
cache 3
Spark Deconstructed: Log Mining Example
discussing the other part
32. Driver
cache 1
Worker
Worker
block 1
Worker
block 2
block 3
cache 2
cache 3
process
from cache
process
from cache
process
from cache
Spark Deconstructed: Log Mining Example
// base RDD!
val lines = sc.textFile("hdfs://...")!
!
// discussing transformed RDDs!
val errors = lines.filter(_.the startsWith("other ERROR"))part
!
val messages = errors.map(_.split("t")).map(r => r(1))!
messages.cache()!
!
// action 1!
messages.filter(_.contains(“mysql")).count()!
!
// action 2!
messages.filter(_.contains("php")).count()
35. Unifying the Pieces: Spark SQL
// http://spark.apache.org/docs/latest/sql-programming-guide.html!
!
val sqlContext = new org.apache.spark.sql.SQLContext(sc)!
import sqlContext._!
!
// define the schema using a case class!
case class Person(name: String, age: Int)!
!
// create an RDD of Person objects and register it as a table!
val people = sc.textFile("examples/src/main/resources/
people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt))!
!
people.registerAsTable("people")!
!
// SQL statements can be run using the SQL methods provided by sqlContext!
val teenagers = sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")!
!
// results of SQL queries are SchemaRDDs and support all the !
// normal RDD operations…!
// columns of a row in the result can be accessed by ordinal!
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)
36. Unifying the Pieces: Spark Streaming
// http://spark.apache.org/docs/latest/streaming-programming-guide.html!
!
import org.apache.spark.streaming._!
import org.apache.spark.streaming.StreamingContext._!
!
// create a StreamingContext with a SparkConf configuration!
val ssc = new StreamingContext(sparkConf, Seconds(10))!
!
// create a DStream that will connect to serverIP:serverPort!
val lines = ssc.socketTextStream(serverIP, serverPort)!
!
// split each line into words!
val words = lines.flatMap(_.split(" "))!
!
// count each word in each batch!
val pairs = words.map(word => (word, 1))!
val wordCounts = pairs.reduceByKey(_ + _)!
!
// print a few of the counts to the console!
wordCounts.print()!
!
ssc.start() // start the computation!
ssc.awaitTermination() // wait for the computation to terminate
37. MLI: An API for Distributed Machine Learning
Evan Sparks, Ameet Talwalkar, et al.
International Conference on Data Mining (2013)
http://arxiv.org/abs/1310.5426
Unifying the Pieces: MLlib
// http://spark.apache.org/docs/latest/mllib-guide.html!
!
val train_data = // RDD of Vector!
val model = KMeans.train(train_data, k=10)!
!
// evaluate the model!
val test_data = // RDD of Vector!
test_data.map(t => model.predict(t)).collect().foreach(println)!
40. community:
spark.apache.org/community.html
video+slide archives: spark-summit.org
local events: Spark Meetups Worldwide
global events: goo.gl/2YqJZK
resources: databricks.com/spark-training-resources
workshops: databricks.com/spark-training
41. books:
Fast Data Processing
with Spark
Holden Karau
Packt (2013)
shop.oreilly.com/product/
9781782167068.do
Spark in Action
Chris Fregly
Manning (2015*)
sparkinaction.com/
Learning Spark
Holden Karau,
Andy Konwinski,
Matei Zaharia
O’Reilly (2015*)
shop.oreilly.com/product/
0636920028512.do
42. certification:
Apache Spark developer certificate program
• http://oreilly.com/go/sparkcert
To prepare for the Spark certification exam, we recommend that you:
• are comfortable coding the advanced exercises in Spark Camp
or related training
• have mastered the material released so far in the O'Reilly book,
Learning Spark
• have some hands-on experience developing Spark apps in
production already
The test includes questions in Scala, Python, Java, and SQL. However,
deep proficiency in any of those languages is not required, since the
questions focus on Spark and its model of computation.
43. events:
Strata EU
Barcelona, Nov 19-21
strataconf.com/strataeu2014
Data Day Texas
Austin, Jan 10
datadaytexas.com
Strata CA
San Jose, Feb 18-20
strataconf.com/strata2015
Spark Summit East
NYC, Mar 18-19
spark-summit.org/east
Spark Summit 2015
SF, Jun 15-17
spark-summit.org