The document provides an overview of Graph Analytics in Spark. It discusses Spark components and key distinctions from MapReduce. It also covers GraphX terminology and examples of composing node and edge RDDs into a graph. The document provides examples of simple traversals and routing problems on graphs. It discusses using GraphX for topic modeling with LDA and provides further reading resources on GraphX, algebraic graph theory, and graph analysis tools and frameworks.
The document discusses using Apache Spark's GraphX library to analyze large graph datasets. It provides an overview of graph data structures and PageRank, describes how GraphX implements graph algorithms like PageRank using a Pregel-like approach, and demonstrates analyzing large street network graphs from OpenStreetMap data to compare cities based on normalized PageRank distributions.
GraphX is Apache Spark's library for graph analytics. It allows users to analyze large graphs in parallel across a cluster. Some key capabilities include calculating centrality metrics like PageRank to identify important nodes, finding shortest and longest paths between nodes, and breaking large graphs into smaller subgraphs for individual analysis. The library represents graphs as vertices connected by edges and can be used to model many real-world networks from social networks to citation networks to computer architectures.
This document discusses exploring time-varying electrical networks using graph analytics frameworks Spark GraphX and Titan. It provides context on modeling electrical networks as graphs and describes challenges in querying time-varying graphs. It then summarizes the architectures of Spark GraphX and Titan and demonstrates how to query time-varying graphs using techniques like "Trickling" in both systems. Scaling challenges are discussed when applying these techniques to a large real-world electrical network dataset.
Graph analytics can be used to analyze a social graph constructed from email messages on the Spark user mailing list. Key metrics like PageRank, in-degrees, and strongly connected components can be computed using the GraphX API in Spark. For example, PageRank was computed on the 4Q2014 email graph, identifying the top contributors to the mailing list.
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)Ankur Dave
GraphX is a graph processing framework built into Apache Spark. This talk introduces GraphX, describes key features of its API, and gives an update on its status.
GraphFrames: Graph Queries in Spark SQL by Ankur DaveSpark Summit
GraphFrames provides a unified API for graph queries and algorithms in Spark SQL. It translates graph patterns and algorithms to relational operations optimized by the Spark SQL query optimizer. Materializing the right views, such as the triplet view for GraphX algorithms or user-defined views for queries, can improve performance. An evaluation shows GraphFrames outperforms Neo4j for unanchored queries and approaches GraphX performance for graph algorithms using Spark SQL. Future work includes automatically suggesting optimal views and exploiting attribute-based partitioning.
Graphs are everywhere! Distributed graph computing with Spark GraphXAndrea Iacono
This document discusses GraphX, a graph processing system built on Apache Spark. It defines what graphs are, including vertices and edges. It explains that GraphX uses Resilient Distributed Datasets (RDDs) to keep data in memory for iterative graph algorithms. GraphX implements the Pregel computational model where each vertex can modify its state, receive and send messages to neighbors each superstep until halting. The document provides examples of graph algorithms and notes when GraphX is well-suited versus a graph database.
GraphFrames: Graph Queries In Spark SQLSpark Summit
GraphFrames provides a unified API for graph queries and algorithms in Apache Spark SQL. It translates graph patterns and algorithms to relational operations optimized by the Spark SQL query optimizer. Materialized views can greatly improve performance of graph queries by enabling efficient join elimination and reordering. An evaluation found GraphFrames outperforms Neo4j for unanchored queries and approaches performance of GraphX for graph algorithms using whole-stage code generation in Spark SQL. Future work includes automatically suggesting optimal views and exploiting data partitioning.
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use CaseMo Patel
This document summarizes a presentation about analyzing graphs using Apache Spark's GraphFrames and GraphX libraries. It begins with an introduction of the speaker and their interests. It then discusses what graphs are and provides examples of graph analytics like node scoring and community detection. It introduces GraphX and GraphFrames, how they allow working with property graphs and integrating graph operations with DataFrames. It also provides an example of how financial institutions can use graph analytics to detect synthetic identity fraud by analyzing relationships between customer addresses.
GraphX is the blue ocean for scala engineers @ Scala Matsuri 2014鉄平 土佐
This document discusses GraphX, a component of Spark that allows for graph-parallel and data-parallel computation using a single API. It notes that GraphX provides an advantage for Scala engineers by allowing graph data to be handled similarly to Scala's collection API. While GraphX is still young, it represents a good solution for graph-parallel problems and an opportunity for Scala developers to work with graph-structured data.
This document provides an introduction to GraphX, which is an Apache Spark component for graphs and graph-parallel computations. It describes different types of graphs like regular graphs, directed graphs, and property graphs. It shows how to create a property graph in GraphX by defining vertex and edge RDDs. It also demonstrates various graph operators that can be used to perform operations on graphs, such as finding the number of vertices/edges, degrees, longest paths, and top vertices by degree. The goal is to introduce the basics of representing and analyzing graph data with GraphX.
Apache Spark and the Emerging Technology Landscape for Big DataPaco Nathan
The document discusses Apache Spark and its role in big data and emerging technologies for big data. It provides background on MapReduce and the emergence of specialized systems. It then discusses how Spark provides a unified engine for batch processing, iterative jobs, SQL queries, streaming, and more. It can simplify programming by using a functional approach. The document also discusses Spark's architecture and performance advantages over other frameworks.
- NASA has a large database of documents and lessons learned from past programs and projects dating back to the 1950s.
- Graph databases can be used to connect related information across different topics, enabling more efficient search and pattern recognition compared to isolated data silos.
- Natural language processing techniques like named entity recognition, parsing, and keyword extraction can be applied to NASA's text data and combined with a graph database to create a knowledge graph for exploring relationships in the data.
See 2020 update: https://derwen.ai/s/h88s
SF Python Meetup, 2017-02-08
https://www.meetup.com/sfpython/events/237153246/
PyTextRank is a pure Python open source implementation of *TextRank*, based on the [Mihalcea 2004 paper](http://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf) -- a graph algorithm which produces ranked keyphrases from texts. Keyphrases generally more useful than simple keyword extraction. PyTextRank integrates use of `TextBlob` and `SpaCy` for NLP analysis of texts, including full parse, named entity extraction, etc. It also produces auto-summarization of texts, making use of an approximation algorithm, `MinHash`, for better performance at scale. Overall, the package is intended to complement machine learning approaches -- specifically deep learning used for custom search and recommendations -- by developing better feature vectors from raw texts. This package is in production use at O'Reilly Media for text analytics.
Congressional PageRank: Graph Analytics of US Congress With Neo4jWilliam Lyon
The document discusses modeling the US Congress as a graph database using Neo4j and analyzing relationships between legislators to identify influential members. It describes loading congressional data into Neo4j, querying relationships between legislators and states they represent. Methods for identifying influential legislators include degree centrality, betweenness centrality, and PageRank computed on a bill co-sponsorship graph using both Neo4j and Apache Spark with GraphX.
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
Spark and Databricks component of the O'Reilly Media webcast "2015 Data Preview: Spark, Data Visualization, YARN, and More", as a preview of the 2015 Strata + Hadoop World conference in San Jose http://www.oreilly.com/pub/e/3289
Big Graph Analytics on Neo4j with Apache SparkKenny Bastani
In this talk I will introduce you to a Docker container that provides you an easy way to do distributed graph processing using Apache Spark GraphX and a Neo4j graph database. You'll learn how to analyze big data graphs that are exported from Neo4j and consequently updated from the results of a Spark GraphX analysis. The types of analysis I will be talking about are PageRank, connected components, triangle counting, and community detection.
Database technologies have evolved to be able to store big data, but are largely inflexible. For complex graph data models stored in a relational database there may be tedious transformations and shuffling around of data to perform large scale analysis.
Fast and scalable analysis of big data has become a critical competitive advantage for companies. There are open source tools like Apache Hadoop and Apache Spark that are providing opportunities for companies to solve these big data problems in a scalable way. Platforms like these have become the foundation of the big data analysis movement.
Speakers
Magellan-Spark as a Geospatial Analytics Engine by Ram SriharshaSpark Summit
This document discusses geospatial analytics using Apache Spark and introduces Magellan, a library for performing geospatial queries and analysis on Spark. It provides an overview of geospatial analytics tasks, challenges with existing approaches, and how Magellan addresses these challenges by leveraging Spark SQL and Catalyst. Magellan allows querying geospatial data in formats like Shapefiles and GeoJSON, performs operations like spatial joins and filters, and supports optimizations like geohashing to improve query performance at scale. The document outlines the current status and features of Magellan and describes plans for further improvements in future versions.
The document discusses Spark GraphX and Pregel algorithms for graph processing. It introduces GraphX and how it represents graphs as RDDs. It then covers the feedback vertex set algorithm for finding cycles in a graph and how it can be parallelized in GraphX. Finally, it discusses Pregel and how it allows for large-scale distributed graph computations through message passing between vertices.
Neo4j-Databridge: Enterprise-scale ETL for Neo4jGraphAware
Neo4j - London User Group Meetup - 28th March, 2018
If your data ingestion requirements have grown beyond importing occasional CSV files then this talk is for you. Neo4j-Databridge from GraphAware is a comprehensive ETL tool specifically built for Neo4j. It has been designed for usability, expressive power and high performance to address the most common isues faced when importing data into Neo4j - multiple data sources and type, very large data sets, bespoke data conversions, non-tabular formats, filtering, merging and de-duplication, as well as bulk imports and incremental updates.
In this talk, we'll take a quick tour of the some of the main features, loading data from Kafka, Redis, JDBC and various other data sources along the way, to understand how Neo4j Databridge solves these problems and how it can help you import your data quickly and easily into Neo4j.
Vince Bickers is a Principal Consultant at GraphAware and the main author of Spring Data Neo4j (v4). He has been writing software and leading software development teams for over 30 years at organisations like Vodafone, Deutsche Bank, HSBC, Network Rail, UBS, VMWare, ConocoPhillips, Aviva and British Gas.
Extending Spark Graph for the Enterprise with Morpheus and Neo4jDatabricks
Spark 3.0 introduces a new module: Spark Graph. Spark Graph adds the popular query language Cypher, its accompanying Property Graph Model and graph algorithms to the data science toolbox. Graphs have a plethora of useful applications in recommendation, fraud detection and research.
Morpheus is an open-source library that is API compatible with Spark Graph and extends its functionality by:
A Property Graph catalog to manage multiple Property Graphs and Views
Property Graph Data Sources that connect Spark Graph to Neo4j and SQL databases
Extended Cypher capabilities including multiple graph support and graph construction
Built-in support for the Neo4j Graph Algorithms library In this talk, we will walk you through the new Spark Graph module and demonstrate how we extend it with Morpheus to support enterprise users to integrate Spark Graph in their existing Spark and Neo4j installations.
We will demonstrate how to explore data in Spark, use Morpheus to transform data into a Property Graph, and then build a Graph Solution in Neo4j.
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DBCarol McDonald
Apache Spark GraphX made it possible to run graph algorithms within Spark, GraphFrames integrates GraphX and DataFrames and makes it possible to perform Graph pattern queries without moving data to a specialized graph database.
This presentation will help you get started using Apache Spark GraphFrames Graph Algorithms and Graph Queries with MapR-DB JSON document database.
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
The machine learning libraries in Apache Spark are an impressive piece of software engineering, and are maturing rapidly. What advantages does Spark.ml offer over scikit-learn? At Data Science Retreat we've taken a real-world dataset and worked through the stages of building a predictive model -- exploration, data cleaning, feature engineering, and model fitting; which would you use in production?
The machine learning libraries in Apache Spark are an impressive piece of software engineering, and are maturing rapidly. What advantages does Spark.ml offer over scikit-learn?
At Data Science Retreat we've taken a real-world dataset and worked through the stages of building a predictive model -- exploration, data cleaning, feature engineering, and model fitting -- in several different frameworks. We'll show what it's like to work with native Spark.ml, and compare it to scikit-learn along several dimensions: ease of use, productivity, feature set, and performance.
In some ways Spark.ml is still rather immature, but it also conveys new superpowers to those who know how to use it.
Microservices, containers, and machine learningPaco Nathan
http://www.oscon.com/open-source-2015/public/schedule/detail/41579
In this presentation, an open source developer community considers itself algorithmically. This shows how to surface data insights from the developer email forums for just about any Apache open source project. It leverages advanced techniques for natural language processing, machine learning, graph algorithms, time series analysis, etc. As an example, we use data from the Apache Spark email list archives to help understand its community better; however, the code can be applied to many other communities.
Exsto is an open source project that demonstrates Apache Spark workflow examples for SQL-based ETL (Spark SQL), machine learning (MLlib), and graph algorithms (GraphX). It surfaces insights about developer communities from their email forums. Natural language processing services in Python (based on NLTK, TextBlob, WordNet, etc.), gets containerized and used to crawl and parse email archives. These produce JSON data sets, then we run machine learning on a Spark cluster to find out insights such as:
* What are the trending topic summaries?
* Who are the leaders in the community for various topics?
* Who discusses most frequently with whom?
This talk shows how to use cloud-based notebooks for organizing and running the analytics and visualizations. It reviews the background for how and why the graph analytics and machine learning algorithms generalize patterns within the data — based on open source implementations for two advanced approaches, Word2Vec and TextRank The talk also illustrates best practices for leveraging functional programming for big data.
The document discusses how data science may reinvent learning and education. It begins with background on the author's experience in data teams and teaching. It then questions what an "Uber for education" may look like and discusses definitions of learning, education, and schools. The author argues interactive notebooks like Project Jupyter and flipped classrooms can improve learning at scale compared to traditional lectures or MOOCs. Content toolchains combining Jupyter, Thebe, Atlas and Docker are proposed for authoring and sharing computational narratives and code-as-media.
GraphFrames: Graph Queries In Spark SQLSpark Summit
GraphFrames provides a unified API for graph queries and algorithms in Apache Spark SQL. It translates graph patterns and algorithms to relational operations optimized by the Spark SQL query optimizer. Materialized views can greatly improve performance of graph queries by enabling efficient join elimination and reordering. An evaluation found GraphFrames outperforms Neo4j for unanchored queries and approaches performance of GraphX for graph algorithms using whole-stage code generation in Spark SQL. Future work includes automatically suggesting optimal views and exploiting data partitioning.
Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use CaseMo Patel
This document summarizes a presentation about analyzing graphs using Apache Spark's GraphFrames and GraphX libraries. It begins with an introduction of the speaker and their interests. It then discusses what graphs are and provides examples of graph analytics like node scoring and community detection. It introduces GraphX and GraphFrames, how they allow working with property graphs and integrating graph operations with DataFrames. It also provides an example of how financial institutions can use graph analytics to detect synthetic identity fraud by analyzing relationships between customer addresses.
GraphX is the blue ocean for scala engineers @ Scala Matsuri 2014鉄平 土佐
This document discusses GraphX, a component of Spark that allows for graph-parallel and data-parallel computation using a single API. It notes that GraphX provides an advantage for Scala engineers by allowing graph data to be handled similarly to Scala's collection API. While GraphX is still young, it represents a good solution for graph-parallel problems and an opportunity for Scala developers to work with graph-structured data.
This document provides an introduction to GraphX, which is an Apache Spark component for graphs and graph-parallel computations. It describes different types of graphs like regular graphs, directed graphs, and property graphs. It shows how to create a property graph in GraphX by defining vertex and edge RDDs. It also demonstrates various graph operators that can be used to perform operations on graphs, such as finding the number of vertices/edges, degrees, longest paths, and top vertices by degree. The goal is to introduce the basics of representing and analyzing graph data with GraphX.
Apache Spark and the Emerging Technology Landscape for Big DataPaco Nathan
The document discusses Apache Spark and its role in big data and emerging technologies for big data. It provides background on MapReduce and the emergence of specialized systems. It then discusses how Spark provides a unified engine for batch processing, iterative jobs, SQL queries, streaming, and more. It can simplify programming by using a functional approach. The document also discusses Spark's architecture and performance advantages over other frameworks.
- NASA has a large database of documents and lessons learned from past programs and projects dating back to the 1950s.
- Graph databases can be used to connect related information across different topics, enabling more efficient search and pattern recognition compared to isolated data silos.
- Natural language processing techniques like named entity recognition, parsing, and keyword extraction can be applied to NASA's text data and combined with a graph database to create a knowledge graph for exploring relationships in the data.
See 2020 update: https://derwen.ai/s/h88s
SF Python Meetup, 2017-02-08
https://www.meetup.com/sfpython/events/237153246/
PyTextRank is a pure Python open source implementation of *TextRank*, based on the [Mihalcea 2004 paper](http://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf) -- a graph algorithm which produces ranked keyphrases from texts. Keyphrases generally more useful than simple keyword extraction. PyTextRank integrates use of `TextBlob` and `SpaCy` for NLP analysis of texts, including full parse, named entity extraction, etc. It also produces auto-summarization of texts, making use of an approximation algorithm, `MinHash`, for better performance at scale. Overall, the package is intended to complement machine learning approaches -- specifically deep learning used for custom search and recommendations -- by developing better feature vectors from raw texts. This package is in production use at O'Reilly Media for text analytics.
Congressional PageRank: Graph Analytics of US Congress With Neo4jWilliam Lyon
The document discusses modeling the US Congress as a graph database using Neo4j and analyzing relationships between legislators to identify influential members. It describes loading congressional data into Neo4j, querying relationships between legislators and states they represent. Methods for identifying influential legislators include degree centrality, betweenness centrality, and PageRank computed on a bill co-sponsorship graph using both Neo4j and Apache Spark with GraphX.
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
Spark and Databricks component of the O'Reilly Media webcast "2015 Data Preview: Spark, Data Visualization, YARN, and More", as a preview of the 2015 Strata + Hadoop World conference in San Jose http://www.oreilly.com/pub/e/3289
Big Graph Analytics on Neo4j with Apache SparkKenny Bastani
In this talk I will introduce you to a Docker container that provides you an easy way to do distributed graph processing using Apache Spark GraphX and a Neo4j graph database. You'll learn how to analyze big data graphs that are exported from Neo4j and consequently updated from the results of a Spark GraphX analysis. The types of analysis I will be talking about are PageRank, connected components, triangle counting, and community detection.
Database technologies have evolved to be able to store big data, but are largely inflexible. For complex graph data models stored in a relational database there may be tedious transformations and shuffling around of data to perform large scale analysis.
Fast and scalable analysis of big data has become a critical competitive advantage for companies. There are open source tools like Apache Hadoop and Apache Spark that are providing opportunities for companies to solve these big data problems in a scalable way. Platforms like these have become the foundation of the big data analysis movement.
Speakers
Magellan-Spark as a Geospatial Analytics Engine by Ram SriharshaSpark Summit
This document discusses geospatial analytics using Apache Spark and introduces Magellan, a library for performing geospatial queries and analysis on Spark. It provides an overview of geospatial analytics tasks, challenges with existing approaches, and how Magellan addresses these challenges by leveraging Spark SQL and Catalyst. Magellan allows querying geospatial data in formats like Shapefiles and GeoJSON, performs operations like spatial joins and filters, and supports optimizations like geohashing to improve query performance at scale. The document outlines the current status and features of Magellan and describes plans for further improvements in future versions.
The document discusses Spark GraphX and Pregel algorithms for graph processing. It introduces GraphX and how it represents graphs as RDDs. It then covers the feedback vertex set algorithm for finding cycles in a graph and how it can be parallelized in GraphX. Finally, it discusses Pregel and how it allows for large-scale distributed graph computations through message passing between vertices.
Neo4j-Databridge: Enterprise-scale ETL for Neo4jGraphAware
Neo4j - London User Group Meetup - 28th March, 2018
If your data ingestion requirements have grown beyond importing occasional CSV files then this talk is for you. Neo4j-Databridge from GraphAware is a comprehensive ETL tool specifically built for Neo4j. It has been designed for usability, expressive power and high performance to address the most common isues faced when importing data into Neo4j - multiple data sources and type, very large data sets, bespoke data conversions, non-tabular formats, filtering, merging and de-duplication, as well as bulk imports and incremental updates.
In this talk, we'll take a quick tour of the some of the main features, loading data from Kafka, Redis, JDBC and various other data sources along the way, to understand how Neo4j Databridge solves these problems and how it can help you import your data quickly and easily into Neo4j.
Vince Bickers is a Principal Consultant at GraphAware and the main author of Spring Data Neo4j (v4). He has been writing software and leading software development teams for over 30 years at organisations like Vodafone, Deutsche Bank, HSBC, Network Rail, UBS, VMWare, ConocoPhillips, Aviva and British Gas.
Extending Spark Graph for the Enterprise with Morpheus and Neo4jDatabricks
Spark 3.0 introduces a new module: Spark Graph. Spark Graph adds the popular query language Cypher, its accompanying Property Graph Model and graph algorithms to the data science toolbox. Graphs have a plethora of useful applications in recommendation, fraud detection and research.
Morpheus is an open-source library that is API compatible with Spark Graph and extends its functionality by:
A Property Graph catalog to manage multiple Property Graphs and Views
Property Graph Data Sources that connect Spark Graph to Neo4j and SQL databases
Extended Cypher capabilities including multiple graph support and graph construction
Built-in support for the Neo4j Graph Algorithms library In this talk, we will walk you through the new Spark Graph module and demonstrate how we extend it with Morpheus to support enterprise users to integrate Spark Graph in their existing Spark and Neo4j installations.
We will demonstrate how to explore data in Spark, use Morpheus to transform data into a Property Graph, and then build a Graph Solution in Neo4j.
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DBCarol McDonald
Apache Spark GraphX made it possible to run graph algorithms within Spark, GraphFrames integrates GraphX and DataFrames and makes it possible to perform Graph pattern queries without moving data to a specialized graph database.
This presentation will help you get started using Apache Spark GraphFrames Graph Algorithms and Graph Queries with MapR-DB JSON document database.
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
The machine learning libraries in Apache Spark are an impressive piece of software engineering, and are maturing rapidly. What advantages does Spark.ml offer over scikit-learn? At Data Science Retreat we've taken a real-world dataset and worked through the stages of building a predictive model -- exploration, data cleaning, feature engineering, and model fitting; which would you use in production?
The machine learning libraries in Apache Spark are an impressive piece of software engineering, and are maturing rapidly. What advantages does Spark.ml offer over scikit-learn?
At Data Science Retreat we've taken a real-world dataset and worked through the stages of building a predictive model -- exploration, data cleaning, feature engineering, and model fitting -- in several different frameworks. We'll show what it's like to work with native Spark.ml, and compare it to scikit-learn along several dimensions: ease of use, productivity, feature set, and performance.
In some ways Spark.ml is still rather immature, but it also conveys new superpowers to those who know how to use it.
Microservices, containers, and machine learningPaco Nathan
http://www.oscon.com/open-source-2015/public/schedule/detail/41579
In this presentation, an open source developer community considers itself algorithmically. This shows how to surface data insights from the developer email forums for just about any Apache open source project. It leverages advanced techniques for natural language processing, machine learning, graph algorithms, time series analysis, etc. As an example, we use data from the Apache Spark email list archives to help understand its community better; however, the code can be applied to many other communities.
Exsto is an open source project that demonstrates Apache Spark workflow examples for SQL-based ETL (Spark SQL), machine learning (MLlib), and graph algorithms (GraphX). It surfaces insights about developer communities from their email forums. Natural language processing services in Python (based on NLTK, TextBlob, WordNet, etc.), gets containerized and used to crawl and parse email archives. These produce JSON data sets, then we run machine learning on a Spark cluster to find out insights such as:
* What are the trending topic summaries?
* Who are the leaders in the community for various topics?
* Who discusses most frequently with whom?
This talk shows how to use cloud-based notebooks for organizing and running the analytics and visualizations. It reviews the background for how and why the graph analytics and machine learning algorithms generalize patterns within the data — based on open source implementations for two advanced approaches, Word2Vec and TextRank The talk also illustrates best practices for leveraging functional programming for big data.
The document discusses how data science may reinvent learning and education. It begins with background on the author's experience in data teams and teaching. It then questions what an "Uber for education" may look like and discusses definitions of learning, education, and schools. The author argues interactive notebooks like Project Jupyter and flipped classrooms can improve learning at scale compared to traditional lectures or MOOCs. Content toolchains combining Jupyter, Thebe, Atlas and Docker are proposed for authoring and sharing computational narratives and code-as-media.
Use of standards and related issues in predictive analyticsPaco Nathan
My presentation at KDD 2016 in SF, in the "Special Session on Standards in Predictive Analytics In the Era of Big and Fast Data" morning track about PMML and PFA http://dmg.org/kdd2016.html
How Apache Spark fits into the Big Data landscapePaco Nathan
How Apache Spark fits into the Big Data landscape http://www.meetup.com/Washington-DC-Area-Spark-Interactive/events/217858832/
2014-12-02 in Herndon, VA and sponsored by Raytheon, Tetra Concepts, and MetiStream
GalvanizeU Seattle: Eleven Almost-Truisms About DataPaco Nathan
http://www.meetup.com/Seattle-Data-Science/events/223445403/
Almost a dozen almost-truisms about Data that almost everyone should consider carefully as they embark on a journey into Data Science. There are a number of preconceptions about working with data at scale where the realities beg to differ. This talk estimates that number to be at least eleven, through probably much larger. At least that number has a great line from a movie. Let's consider some of the less-intuitive directions in which this field is heading, along with likely consequences and corollaries -- especially for those who are just now beginning to study about the technologies, the processes, and the people involved.
Databricks Meetup @ Los Angeles Apache Spark User GroupPaco Nathan
This document summarizes a presentation on Apache Spark and Spark Streaming. It provides an overview of Spark, describing it as an in-memory cluster computing framework. It then discusses Spark Streaming, explaining that it runs streaming computations as small batch jobs to provide low latency processing. Several use cases for Spark Streaming are presented, including from companies like Stratio, Pearson, Ooyala, and Sharethrough. The presentation concludes with a demonstration of Python Spark Streaming code.
Microservices, Containers, and Machine LearningPaco Nathan
Session talk for Data Day Texas 2015, showing GraphX and SparkSQL for text analytics and graph analytics of an Apache developer email list -- including an implementation of TextRank in Spark.
A New Year in Data Science: ML UnpausedPaco Nathan
This document summarizes Paco Nathan's presentation at Data Day Texas in 2015. Some key points:
- Paco Nathan discussed observations and trends from the past year in machine learning, data science, big data, and open source technologies.
- He argued that the definitions of data science and statistics are flawed and ignore important areas like development, visualization, and modeling real-world business problems.
- The presentation covered topics like functional programming approaches, streaming approximations, and the importance of an interdisciplinary approach combining computer science, statistics, and other fields like physics.
- Paco Nathan advocated for newer probabilistic techniques for analyzing large datasets that provide approximations using less resources compared to traditional batch processing approaches.
QCon São Paulo: Real-Time Analytics with Spark StreamingPaco Nathan
The document provides an overview of real-time analytics using Spark Streaming. It discusses Spark Streaming's micro-batch approach of treating streaming data as a series of small batch jobs. This allows for low-latency analysis while integrating streaming and batch processing. The document also covers Spark Streaming's fault tolerance mechanisms and provides several examples of companies like Pearson, Guavus, and Sharethrough using Spark Streaming for real-time analytics in production environments.
The document provides an overview of Apache Spark, including its history and key capabilities. It discusses how Spark was developed in 2009 at UC Berkeley and later open sourced, and how it has since become a major open-source project for big data. The document summarizes that Spark provides in-memory performance for ETL, storage, exploration, analytics and more on Hadoop clusters, and supports machine learning, graph analysis, and SQL queries.
Jupyter for Education: Beyond Gutenberg and ErasmusPaco Nathan
O'Reilly Learning is focusing on evolving learning experiences using Jupyter notebooks. Jupyter notebooks allow combining code, outputs, and explanations in a single document. O'Reilly is using Jupyter notebooks as a new authoring environment and is exploring features like computational narratives, code as a medium for teaching, and interactive online learning environments. The goal is to provide a better learning architecture and content workflow that leverages the capabilities of Jupyter notebooks.
This document provides an overview of Apache Spark modules including Spark SQL, GraphX, and Spark Streaming. Spark SQL allows querying structured data using SQL, GraphX provides APIs for graph processing, and Spark Streaming enables scalable stream processing. The document discusses Resilient Distributed Datasets (RDDs), SchemaRDDs, querying data with SQLContext, GraphX property graphs and algorithms, StreamingContext, and input/output operations in Spark Streaming.
Building a Graph of all US Businesses Using Spark Technologies by Alexis RoosSpark Summit
This document discusses building a graph of U.S. businesses using Spark technologies. It describes how Radius Intelligence builds a comprehensive business graph from multiple data sources by acquiring and preparing raw data, clustering records, and constructing the graph by linking business and location vertices and attributes through techniques like connected components analysis. Key lessons learned include that GraphX scales well, graph construction and updates are easy using RDD operations, and connected components analysis is an expensive graph operation.
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...Spark Summit
This document discusses using Spark Streaming and GraphX to perform near-realtime analytics on large distributed systems. The authors present a model-driven approach to implement Pregel-style graph processing to handle heterogeneous graphs. They were able to achieve over 100,000 messages per second on a 4 node cluster by using sufficient batch sizes. Implementation challenges included scaling graph processing across nodes, dealing with graph heterogeneity, and hidden memory costs from intermediate RDDs. Lessons learned include the importance of partitioning, testing high availability, and addressing memory sinks.
The document discusses the future of data science, including increased use of functional programming, cloud notebooks, and probabilistic modeling of large and diverse datasets from IoT devices, drones, and satellites. It also predicts data scientists will displace traditional product managers as data becomes more important for decision making. Overall, the future involves analyzing exponentially larger volumes of diverse data using scalable cloud tools and probabilistic algorithms.
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionChetan Khatri
Scala Toronto July 2019 event at 500px.
Pure Functional API Integration
Apache Spark Internals tuning
Performance tuning
Query execution plan optimisation
Cats Effects for switching execution model runtime.
Discovery / experience with Monix, Scala Future.
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...Chetan Khatri
This document summarizes a presentation about scaling terabytes of data with Apache Spark and Scala. The key points are:
1) The presenter discusses how to use Apache Spark and Scala to process large scale data in a distributed manner across clusters. Spark operations like RDDs, DataFrames and Datasets are covered.
2) A case study is presented about reengineering a data processing platform for a retail business to improve performance. Changes included parallelizing jobs, tuning Spark hyperparameters, and building a fast data architecture using Spark, Kafka and data lakes.
3) Performance was improved through techniques like dynamic resource allocation in YARN, reducing memory and cores per executor to better utilize cluster resources, and processing data
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
London Spark Meetup 2014-11-11 @Skimlinks
http://www.meetup.com/Spark-London/events/217362972/
To paraphrase the immortal crooner Don Ho: "Tiny Batches, in the wine, make me happy, make me feel fine." http://youtu.be/mlCiDEXuxxA
Apache Spark provides support for streaming use cases, such as real-time analytics on log files, by leveraging a model called discretized streams (D-Streams). These "micro batch" computations operated on small time intervals, generally from 500 milliseconds up. One major innovation of Spark Streaming is that it leverages a unified engine. In other words, the same business logic can be used across multiple uses cases: streaming, but also interactive, iterative, machine learning, etc.
This talk will compare case studies for production deployments of Spark Streaming, emerging design patterns for integration with popular complementary OSS frameworks, plus some of the more advanced features such as approximation algorithms, and take a look at what's ahead — including the new Python support for Spark Streaming that will be in the upcoming 1.2 release.
Also, let's chat a bit about the new Databricks + O'Reilly developer certification for Apache Spark…
Strata NYC 2015 - What's coming for the Spark communityDatabricks
In the last year Spark has seen substantial growth in adoption as well as the pace and scope of development. This talk will look forward and discuss both technical initiatives and the evolution of the Spark community.
On the technical side, I’ll discuss two key initiatives ahead for Spark. The first is a tighter integration of Spark’s libraries through shared primitives such as the data frame API. The second is across-the-board performance optimizations that exploit schema information embedded in Spark’s newer APIs. These initiatives are both designed to make Spark applications easier to write and faster to run.
On the community side, this talk will focus on the growing ecosystem of extensions, tools, and integrations evolving around Spark. I’ll survey popular language bindings, data sources, notebooks, visualization libraries, statistics libraries, and other community projects. Extensions will be a major point of growth in the future, and this talk will discuss how we can position the upstream project to help encourage and foster this growth.
Spark Streaming allows processing of live data streams in Spark. It integrates streaming data and batch processing within the same Spark application. Spark SQL provides a programming abstraction called DataFrames and can be used to query structured data in Spark. Structured Streaming in Spark 2.0 provides a high-level API for building streaming applications on top of Spark SQL's engine. It allows running the same queries on streaming data as on batch data and unifies streaming, interactive, and batch processing.
This document provides an agenda and summaries for a meetup on introducing DataFrames and R on Apache Spark. The agenda includes overviews of Apache Spark 1.3, DataFrames, R on Spark, and large scale machine learning on Spark. There will also be discussions on news items, contributions so far, what's new in Spark 1.3, more data source APIs, what DataFrames are, writing DataFrames, and DataFrames with RDDs and Parquet. Presentations will cover Spark components, an introduction to SparkR, and Spark machine learning experiences.
The document provides an introduction to Apache Spark and Scala. It discusses that Apache Spark is a fast and general-purpose cluster computing system that provides high-level APIs for Scala, Java, Python and R. It supports structured data processing using Spark SQL, graph processing with GraphX, and machine learning using MLlib. Scala is a modern programming language that is object-oriented, functional, and type-safe. The document then discusses Resilient Distributed Datasets (RDDs), DataFrames, and Datasets in Spark and how they provide different levels of abstraction and functionality. It also covers Spark operations and transformations, and how the Spark logical query plan is optimized into a physical execution plan.
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonChristian Perone
This document provides an introduction to Apache Spark and collaborative filtering. It discusses big data and the limitations of MapReduce, then introduces Apache Spark including Resilient Distributed Datasets (RDDs), transformations, actions, and DataFrames. It also covers Spark Machine Learning (ML) libraries and algorithms such as classification, regression, clustering, and collaborative filtering.
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
This document discusses Apache Spark, a fast and general engine for big data processing. It describes how Spark generalizes the MapReduce model through its Resilient Distributed Datasets (RDDs) abstraction, which allows efficient sharing of data across parallel operations. This unified approach allows Spark to support multiple types of processing, like SQL queries, streaming, and machine learning, within a single framework. The document also outlines ongoing developments like Spark SQL and improved machine learning capabilities.
The document provides an agenda for a DevOps advanced class on Spark being held in June 2015. The class will cover topics such as RDD fundamentals, Spark runtime architecture, memory and persistence, Spark SQL, PySpark, and Spark Streaming. It will include labs on DevOps 101 and 102. The instructor has over 5 years of experience providing Big Data consulting and training, including over 100 classes taught.
Multiplaform Solution for Graph DatasourcesStratio
One of the top banks in Europe, needed a system to provide better performance, scaling almost linearly with the increase in information to be analyzed, and allowing to move the processes that were currently being executed in the Host to a Big Data infrastructure. During a year we've worked on a system which is able to provide greater agility, flexibility and simplicity for the user to view information when profiling and is now able to analyze the structure of profile data. It's a powerful way to make online queries to a graph database, which is integrated with Apache Spark and different graph libraries. Basically, we get all the necessary information through Cypher queries which are sent to a Neo4j database.
Using the last Big Data technologies like Spark Dataframe, HDFS, Stratio Intelligence or Stratio Crossdata, we have developed a solution which is able to obtain critical information for multiple datasources like text files o graph databases.
Composable Parallel Processing in Apache Spark and WeldDatabricks
The main reason people are productive writing software is composability -- engineers can take libraries and functions written by other developers and easily combine them into a program. However, composability has taken a back seat in early parallel processing APIs. For example, composing MapReduce jobs required writing the output of every job to a file, which is both slow and error-prone. Apache Spark helped simplify cluster programming largely because it enabled efficient composition of parallel functions, leading to a large standard library and high-level APIs in various languages. In this talk, I'll explain how composability has evolved in Spark's newer APIs, and also present a new research project I'm leading at Stanford called Weld to enable much more efficient composition of software on emerging parallel hardware (multicores, GPUs, etc).
Speaker: Matei Zaharia
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...BigDataEverywhere
Paco Nathan, Director of Community Evangelism at Databricks
Apache Spark is intended as a fast and powerful general purpose engine for processing Hadoop data. Spark supports combinations of batch processing, streaming, SQL, ML, Graph, etc., for applications written in Scala, Java, Python, Clojure, and R, among others. In this talk, I'll explore how Spark fits into the Big Data landscape. In addition, I'll describe other systems with which Spark pairs nicely, and will also explain why Spark is needed for the work ahead.
This document discusses tools for distributed data analysis including Apache Spark. It is divided into three parts:
1) An introduction to cluster computing architectures like batch processing and stream processing.
2) The Python data analysis library stack including NumPy, Matplotlib, Scikit-image, Scikit-learn, Rasterio, Fiona, Pandas, and Jupyter.
3) The Apache Spark cluster computing framework and examples of its use including contexts, HDFS, telemetry, MLlib, streaming, and deployment on AWS.
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
Spark has evolved its APIs and engine over the last 6 years to combine the best aspects of previous systems like databases, MapReduce, and data frames. Its latest structured APIs like DataFrames provide a declarative interface inspired by data frames in R/Python for ease of use, along with optimizations from databases for performance and future-proofing. This unified approach allows Spark to scale massively like MapReduce while retaining flexibility.
Unified Big Data Processing with Apache SparkC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1yNuLGF.
Matei Zaharia talks about the latest developments in Spark and shows examples of how it can combine processing algorithms to build rich data pipelines in just a few lines of code. Filmed at qconsf.com.
Matei Zaharia is an assistant professor of computer science at MIT, and CTO of Databricks, the company commercializing Apache Spark.
Building a modern Application with DataFramesDatabricks
The document discusses a meetup about building modern applications with DataFrames in Spark. It provides an agenda for the meetup that includes an introduction to Spark and DataFrames, a discussion of the Catalyst internals, and a demo. The document also provides background on Spark, noting its open source nature and large-scale usage by many organizations.
Building a modern Application with DataFramesSpark Summit
The document discusses a meetup about building modern applications with DataFrames in Spark. It provides an agenda for the meetup that includes an introduction to Spark and DataFrames, a discussion of the Catalyst internals, and a demo. The document also provides background on Spark, noting its open source nature and large-scale usage by many organizations.
Human in the loop: a design pattern for managing teams working with MLPaco Nathan
Strata CA 2018-03-08
https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/64223
Although it has long been used for has been used for use cases like simulation, training, and UX mockups, human-in-the-loop (HITL) has emerged as a key design pattern for managing teams where people and machines collaborate. One approach, active learning (a special case of semi-supervised learning), employs mostly automated processes based on machine learning models, but exceptions are referred to human experts, whose decisions help improve new iterations of the models.
Human-in-the-loop: a design pattern for managing teams that leverage MLPaco Nathan
Strata Singapore 2017 session talk 2017-12-06
https://conferences.oreilly.com/strata/strata-sg/public/schedule/detail/65611
Human-in-the-loop is an approach which has been used for simulation, training, UX mockups, etc. A more recent design pattern is emerging for human-in-the-loop (HITL) as a way to manage teams working with machine learning (ML). A variant of semi-supervised learning called active learning allows for mostly automated processes based on ML, where exceptions get referred to human experts. Those human judgements in turn help improve new iterations of the ML models.
This talk reviews key case studies about active learning, plus other approaches for human-in-the-loop which are emerging among AI applications. We’ll consider some of the technical aspects — including available open source projects — as well as management perspectives for how to apply HITL:
* When is HITL indicated vs. when isn’t it applicable?
* How do HITL approaches compare/contrast with more “typical” use of Big Data?
* What’s the relationship between use of HITL and preparing an organization to leverage Deep Learning?
* Experiences training and managing a team which uses HITL at scale
* Caveats to know ahead of time:
* In what ways do the humans involved learn from the machines?
* In particular, we’ll examine use cases at O’Reilly Media where ML pipelines for categorizing content are trained by subject matter experts providing examples, based on HITL and leveraging open source [Project Jupyter](https://jupyter.org/ for implementation).
Human-in-a-loop: a design pattern for managing teams which leverage MLPaco Nathan
Human-in-a-loop: a design pattern for managing teams which leverage ML
Big Data Spain, 2017-11-16
https://www.bigdataspain.org/2017/talk/human-in-the-loop-a-design-pattern-for-managing-teams-which-leverage-ml
Human-in-the-loop is an approach which has been used for simulation, training, UX mockups, etc. A more recent design pattern is emerging for human-in-the-loop (HITL) as a way to manage teams working with machine learning (ML). A variant of semi-supervised learning called _active learning_ allows for mostly automated processes based on ML, where exceptions get referred to human experts. Those human judgements in turn help improve new iterations of the ML models.
This talk reviews key case studies about active learning, plus other approaches for human-in-the-loop which are emerging among AI applications. We'll consider some of the technical aspects -- including available open source projects -- as well as management perspectives for how to apply HITL:
* When is HITL indicated vs. when isn't it applicable?
* How do HITL approaches compare/contrast with more "typical" use of Big Data?
* What's the relationship between use of HITL and preparing an organization to leverage Deep Learning?
* Experiences training and managing a team which uses HITL at scale
* Caveats to know ahead of time
* In what ways do the humans involved learn from the machines?
In particular, we'll examine use cases at O'Reilly Media where ML pipelines for categorizing content are trained by subject matter experts providing examples, based on HITL and leveraging open source [Project Jupyter](https://jupyter.org/ for implementation).
Humans in a loop: Jupyter notebooks as a front-end for AIPaco Nathan
JupyterCon NY 2017-08-24
https://www.safaribooksonline.com/library/view/jupytercon-2017-/9781491985311/video313210.html
Paco Nathan reviews use cases where Jupyter provides a front-end to AI as the means for keeping "humans in the loop". This talk introduces *active learning* and the "human-in-the-loop" design pattern for managing how people and machines collaborate in AI workflows, including several case studies.
The talk also explores how O'Reilly Media leverages AI in Media, and in particular some of our use cases for active learning such as disambiguation in content discovery. We're using Jupyter as a way to manage active learning ML pipelines, where the machines generally run automated until they hit an edge case and refer the judgement back to human experts. In turn, the experts training the ML pipelines purely through examples, not feature engineering, model parameters, etc.
Jupyter notebooks serve as one part configuration file, one part data sample, one part structured log, one part data visualization tool. O'Reilly has released an open source project on GitHub called `nbtransom` which builds atop `nbformat` and `pandas` for our active learning use cases.
This work anticipates upcoming work on collaborative documents in JupyterLab, based on Google Drive. In other words, where the machines and people are collaborators on shared documents.
Humans in the loop: AI in open source and industryPaco Nathan
Nike Tech Talk, Portland, 2017-08-10
https://niketechtalks-aug2017.splashthat.com/
O'Reilly Media gets to see the forefront of trends in artificial intelligence: what the leading teams are working on, which use cases are getting the most traction, previews of advances before they get announced on stage. Through conferences, publishing, and training programs, we've been assembling resources for anyone who wants to learn. An excellent recent example: Generative Adversarial Networks for Beginners, by Jon Bruner.
This talk covers current trends in AI, industry use cases, and recent highlights from the AI Conf series presented by O'Reilly and Intel, plus related materials from Safari learning platform, Strata Data, Data Show, and the upcoming JupyterCon.
Along with reporting, we're leveraging AI in Media. This talk dives into O'Reilly uses of deep learning -- combined with ontology, graph algorithms, probabilistic data structures, and even some evolutionary software -- to help editors and customers alike accomplish more of what they need to do.
In particular, we'll show two open source projects in Python from O'Reilly's AI team:
• pytextrank built atop spaCy, NetworkX, datasketch, providing graph algorithms for advanced NLP and text analytics
• nbtransom leveraging Project Jupyter for a human-in-the-loop design pattern approach to AI work: people and machines collaborating on content annotation
Lessons learned from 3 (going on 4) generations of Jupyter use cases at O'Reilly Media. In particular, about "Oriole" tutorials which combine video with Jupyter notebooks, Docker containers, backed by services managed on a cluster by Marathon, Mesos, Redis, and Nginx.
https://conferences.oreilly.com/fluent/fl-ca/public/schedule/detail/62859
https://conferences.oreilly.com/velocity/vl-ca/public/schedule/detail/62858
O'Reilly Media has experimented with different uses of Jupyter notebooks in their publications and learning platforms. Their latest approach embeds notebooks with video narratives in online "Oriole" tutorials, allowing authors to create interactive, computable content. This new medium blends code, data, text, and video into narrated learning experiences that run in isolated Docker containers for higher engagement. Some best practices for using notebooks in teaching include focusing on concise concepts, chunking content, and alternating between text, code, and outputs to keep explanations clear and linear.
Big Data is changing abruptly, and where it is likely headingPaco Nathan
Big Data technologies are changing rapidly due to shifts in hardware, data types, and software frameworks. Incumbent Big Data technologies do not fully leverage newer hardware like multicore processors and large memory spaces, while newer open source projects like Spark have emerged to better utilize these resources. Containers, clouds, functional programming, databases, approximations, and notebooks represent significant trends in how Big Data is managed and analyzed at large scale.
How Apache Spark fits into the Big Data landscapePaco Nathan
Boulder/Denver Spark Meetup, 2014-10-02 @ Datalogix
http://www.meetup.com/Boulder-Denver-Spark-Meetup/events/207581832/
Apache Spark is intended as a general purpose engine that supports combinations of Batch, Streaming, SQL, ML, Graph, etc., for apps written in Scala, Java, Python, Clojure, R, etc.
This talk provides an introduction to Spark — how it provides so much better performance, and why — and then explores how Spark fits into the Big Data landscape — e.g., other systems with which Spark pairs nicely — and why Spark is needed for the work ahead.
Supercharge Your AI Development with Local LLMsFrancesco Corti
In today's AI development landscape, developers face significant challenges when building applications that leverage powerful large language models (LLMs) through SaaS platforms like ChatGPT, Gemini, and others. While these services offer impressive capabilities, they come with substantial costs that can quickly escalate especially during the development lifecycle. Additionally, the inherent latency of web-based APIs creates frustrating bottlenecks during the critical testing and iteration phases of development, slowing down innovation and frustrating developers.
This talk will introduce the transformative approach of integrating local LLMs directly into their development environments. By bringing these models closer to where the code lives, developers can dramatically accelerate development lifecycles while maintaining complete control over model selection and configuration. This methodology effectively reduces costs to zero by eliminating dependency on pay-per-use SaaS services, while opening new possibilities for comprehensive integration testing, rapid prototyping, and specialized use cases.
What’s New in Web3 Development Trends to Watch in 2025.pptxLisa ward
Emerging Web3 development trends in 2025 include AI integration, enhanced scalability, decentralized identity, and increased enterprise adoption of blockchain technologies.
MuleSoft RTF & Flex Gateway on AKS – Setup, Insights & Real-World TipsPatryk Bandurski
This presentation was delivered during the Warsaw MuleSoft Meetup in April 2025.
Paulina Uhman (PwC Polska) shared her hands-on experience running MuleSoft Runtime Fabric (RTF) and Flex Gateway on Azure Kubernetes Service (AKS).
The deck covers:
What happens after installation (pods, services, and artifacts demystified)
Shared responsibility model: MuleSoft vs Kubernetes
Real-world tips for configuring connectivity
Key Kubernetes commands for troubleshooting
Lessons learned from practical use cases
🎙️ Hosted by: Patryk Bandurski, MuleSoft Ambassador & Meetup Leader
💡 Presented by: Paulina Uhman, Integration Specialist @ PwC Polska
Optimize IBM i with Consulting Services HelpAlice Gray
We offers a comprehensive overview of legacy system modernization, integration, and support services. It highlights key challenges businesses face with IBM i systems and presents tailored solutions such as modernization strategies, application development, and managed services. Ideal for IT leaders and enterprises relying on AS400, the deck includes real-world case studies, engagement models, and the benefits of expert consulting. Perfect for showcasing capabilities to clients or internal stakeholders.
Wondershare Filmora 14.3.2 Crack + License Key Free for Windows PCMudasir
COPY & PASTE LINK 👉👉👉
https://pcsoftsfull.org/dl
Wondershare Filmora for Windows PC is an all-in-one home video editor with powerful functionality and a fully stacked feature set. Filmora has a simple drag-and-droptop interface, allowing you to be artistic with the story you want to create.
DePIN = Real-World Infra + Blockchain
DePIN stands for Decentralized Physical Infrastructure Networks.
It connects physical devices to Web3 using token incentives.
How Does It Work?
Individuals contribute to infrastructure like:
Wireless networks (e.g., Helium)
Storage (e.g., Filecoin)
Sensors, compute, and energy
They earn tokens for their participation.
European Accessibility Act & Integrated Accessibility TestingJulia Undeutsch
Emma Dawson will guide you through two important topics in this session.
Firstly, she will prepare you for the European Accessibility Act (EAA), which comes into effect on 28 June 2025, and show you how development teams can prepare for it.
In the second part of the webinar, Emma Dawson will explore with you various integrated testing methods and tools that will help you improve accessibility during the development cycle, such as Linters, Storybook, Playwright, just to name a few.
Focus: European Accessibility Act, Integrated Testing tools and methods (e.g. Linters, Storybook, Playwright)
Target audience: Everyone, Developers, Testers
Adtran’s SDG 9000 Series brings high-performance, cloud-managed Wi-Fi 7 to homes, businesses and public spaces. Built on a unified SmartOS platform, the portfolio includes outdoor access points, ceiling-mount APs and a 10G PoE router. Intellifi and Mosaic One simplify deployment, deliver AI-driven insights and unlock powerful new revenue streams for service providers.
Multistream in SIP and NoSIP @ OpenSIPS Summit 2025Lorenzo Miniero
Slides for my "Multistream support in the Janus SIP and NoSIP plugins" presentation at the OpenSIPS Summit 2025 event.
They describe my efforts refactoring the Janus SIP and NoSIP plugins to allow for the gatewaying of an arbitrary number of audio/video streams per call (thus breaking the current 1-audio/1-video limitation), plus some additional considerations on what this could mean when dealing with application protocols negotiated via SIP as well.
UiPath Community Berlin: Studio Tips & Tricks and UiPath InsightsUiPathCommunity
Join the UiPath Community Berlin (Virtual) meetup on May 27 to discover handy Studio Tips & Tricks and get introduced to UiPath Insights. Learn how to boost your development workflow, improve efficiency, and gain visibility into your automation performance.
📕 Agenda:
- Welcome & Introductions
- UiPath Studio Tips & Tricks for Efficient Development
- Best Practices for Workflow Design
- Introduction to UiPath Insights
- Creating Dashboards & Tracking KPIs (Demo)
- Q&A and Open Discussion
Perfect for developers, analysts, and automation enthusiasts!
This session streamed live on May 27, 18:00 CET.
Check out all our upcoming UiPath Community sessions at:
👉 https://community.uipath.com/events/
Join our UiPath Community Berlin chapter:
👉 https://community.uipath.com/berlin/
Big Data Analytics Quick Research Guide by Arthur MorganArthur Morgan
This is a Quick Research Guide (QRG).
QRGs include the following:
- A brief, high-level overview of the QRG topic.
- A milestone timeline for the QRG topic.
- Links to various free online resource materials to provide a deeper dive into the QRG topic.
- Conclusion and a recommendation for at least two books available in the SJPL system on the QRG topic.
QRGs planned for the series:
- Artificial Intelligence QRG
- Quantum Computing QRG
- Big Data Analytics QRG
- Spacecraft Guidance, Navigation & Control QRG (coming 2026)
- UK Home Computing & The Birth of ARM QRG (coming 2027)
Any questions or comments?
- Please contact Arthur Morgan at [email protected].
100% human made.
Droidal: AI Agents Revolutionizing HealthcareDroidal LLC
Droidal’s AI Agents are transforming healthcare by bringing intelligence, speed, and efficiency to key areas such as Revenue Cycle Management (RCM), clinical operations, and patient engagement. Built specifically for the needs of U.S. hospitals and clinics, Droidal's solutions are designed to improve outcomes and reduce administrative burden.
Through simple visuals and clear examples, the presentation explains how AI Agents can support medical coding, streamline claims processing, manage denials, ensure compliance, and enhance communication between providers and patients. By integrating seamlessly with existing systems, these agents act as digital coworkers that deliver faster reimbursements, reduce errors, and enable teams to focus more on patient care.
Droidal's AI technology is more than just automation — it's a shift toward intelligent healthcare operations that are scalable, secure, and cost-effective. The presentation also offers insights into future developments in AI-driven healthcare, including how continuous learning and agent autonomy will redefine daily workflows.
Whether you're a healthcare administrator, a tech leader, or a provider looking for smarter solutions, this presentation offers a compelling overview of how Droidal’s AI Agents can help your organization achieve operational excellence and better patient outcomes.
A free demo trial is available for those interested in experiencing Droidal’s AI Agents firsthand. Our team will walk you through a live demo tailored to your specific workflows, helping you understand the immediate value and long-term impact of adopting AI in your healthcare environment.
To request a free trial or learn more:
https://droidal.com/
UiPath Community Zurich: Release Management and Build PipelinesUiPathCommunity
Ensuring robust, reliable, and repeatable delivery processes is more critical than ever - it's a success factor for your automations and for automation programmes as a whole. In this session, we’ll dive into modern best practices for release management and explore how tools like the UiPathCLI can streamline your CI/CD pipelines. Whether you’re just starting with automation or scaling enterprise-grade deployments, our event promises to deliver helpful insights to you. This topic is relevant for both on-premise and cloud users - as well as for automation developers and software testers alike.
📕 Agenda:
- Best Practices for Release Management
- What it is and why it matters
- UiPath Build Pipelines Deep Dive
- Exploring CI/CD workflows, the UiPathCLI and showcasing scenarios for both on-premise and cloud
- Discussion, Q&A
👨🏫 Speakers
Roman Tobler, CEO@ Routinuum
Johans Brink, CTO@ MvR Digital Workforce
We look forward to bringing best practices and showcasing build pipelines to you - and to having interesting discussions on this important topic!
If you have any questions or inputs prior to the event, don't hesitate to reach out to us.
This event streamed live on May 27, 16:00 pm CET.
Check out all our upcoming UiPath Community sessions at:
👉 https://community.uipath.com/events/
Join UiPath Community Zurich chapter:
👉 https://community.uipath.com/zurich/
Iobit Driver Booster Pro Crack Free Download [Latest] 2025Mudasir
👇👇👇👇✅✅
COPY & PASTE LINK 👉👉👉
https://pcsoftsfull.org/dl
IObit Driver Booster Pro Updating drivers is usually an initial step to avoid hardware failure, system instability, and hidden security vulnerabilities. Update drivers regularly is also an effective way to enhance your overall PC performance and maximize your gaming experience.
Content and eLearning Standards: Finding the Best Fit for Your-TrainingRustici Software
Tammy Rutherford, Managing Director of Rustici Software, walks through the pros and cons of different standards to better understand which standard is best for your content and chosen technologies.
Content and eLearning Standards: Finding the Best Fit for Your-TrainingRustici Software
GraphX: Graph analytics for insights about developer communities
1. Graph Analytics in Spark
2015-06-08 • Scala Days • Amsterdam
Paco Nathan, @pacoid
Licensed under a Creative Commons Attribution-
NonCommercial-NoDerivatives 4.0 International License
4. • generalized patterns
unified engine for many use cases
• lazy evaluation of the lineage graph
reduces wait states, better pipelining
• generational differences in hardware
off-heap use of large memory spaces
• functional programming / ease of use
reduction in cost to maintain large apps
• lower overhead for starting jobs
• less expensive shuffles
Spark Overview: Key Distinctions vs. MapReduce
12. PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs
J. Gonzalez, Y. Low, H. Gu, D. Bickson, C. Guestrin
graphlab.org/files/osdi2012-gonzalez-low-gu-bickson-
guestrin.pdf
Pregel: Large-scale graph computing at Google
Grzegorz Czajkowski, et al.
googleresearch.blogspot.com/2009/06/large-scale-graph-
computing-at-google.html
GraphX: Graph Analytics in Spark
Ankur Dave, Databricks
spark-summit.org/east-2015/talk/graphx-graph-
analytics-in-spark
Topic modeling with LDA: MLlib meets GraphX
Joseph Bradley, Databricks
databricks.com/blog/2015/03/25/topic-modeling-with-
lda-mllib-meets-graphx.html
GraphX: Further Reading…
13. GraphX: Compose Node + Edge RDDs into a Graph
val nodeRDD: RDD[(Long, ND)] = sc.parallelize(…)
val edgeRDD: RDD[Edge[ED]] = sc.parallelize(…)
val g: Graph[ND, ED] = Graph(nodeRDD, edgeRDD)
14. // http://spark.apache.org/docs/latest/graphx-programming-guide.html
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
case class Peep(name: String, age: Int)
val nodeArray = Array(
(1L, Peep("Kim", 23)), (2L, Peep("Pat", 31)),
(3L, Peep("Chris", 52)), (4L, Peep("Kelly", 39)),
(5L, Peep("Leslie", 45))
)
val edgeArray = Array(
Edge(2L, 1L, 7), Edge(2L, 4L, 2),
Edge(3L, 2L, 4), Edge(3L, 5L, 3),
Edge(4L, 1L, 1), Edge(5L, 3L, 9)
)
val nodeRDD: RDD[(Long, Peep)] = sc.parallelize(nodeArray)
val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray)
val g: Graph[Peep, Int] = Graph(nodeRDD, edgeRDD)
val results = g.triplets.filter(t => t.attr > 7)
for (triplet <- results.collect) {
println(s"${triplet.srcAttr.name} loves ${triplet.dstAttr.name}")
}
GraphX: Example – simple traversals
15. GraphX: Example – routing problems
cost
4
node
0
node
1
node
3
node
2
cost
3
cost
1
cost
2
cost
1
What is the cost to reach node 0 from any other
node in the graph? This is a common use case for
graph algorithms, e.g., Dijkstra
18. Graph Analytics: terminology
• many real-world problems are often
represented as graphs
• graphs can generally be converted into
sparse matrices (bridge to linear algebra)
• eigenvectors find the stable points in
a system defined by matrices – which
may be more efficient to compute
• beyond simpler graphs, complex data
may require work with tensors
19. Suppose we have a graph as shown below:
We call x a vertex (sometimes called a node)
An edge (sometimes called an arc) is any line
connecting two vertices
Graph Analytics: example
v
u
w
x
20. We can represent this kind of graph as an
adjacency matrix:
• label the rows and columns based
on the vertices
• entries get a 1 if an edge connects the
corresponding vertices, or 0 otherwise
Graph Analytics: representation
v
u
w
x
u v w x
u 0 1 0 1
v 1 0 1 1
w 0 1 0 1
x 1 1 1 0
21. An adjacency matrix always has certain
properties:
• it is symmetric, i.e., A = AT
• it has real eigenvalues
Therefore algebraic graph theory bridges
between linear algebra and graph theory
Graph Analytics: algebraic graph theory
22. Sparse Matrix Collection… for when you really
need a wide variety of sparse matrix examples,
e.g., to evaluate new ML algorithms
University of Florida
Sparse Matrix Collection
cise.ufl.edu/
research/sparse/
matrices/
Graph Analytics: beauty in sparsity
23. Algebraic GraphTheory
Norman Biggs
Cambridge (1974)
amazon.com/dp/0521458978
Graph Analysis andVisualization
Richard Brath, David Jonker
Wiley (2015)
shop.oreilly.com/product/9781118845844.do
See also examples in: Just Enough Math
Graph Analytics: resources
24. Although tensor factorization is considered
problematic, it may provide more general case
solutions, and some work leverages Spark:
TheTensor Renaissance in Data Science
Anima Anandkumar @UC Irvine
radar.oreilly.com/2015/05/the-tensor-
renaissance-in-data-science.html
Spacey RandomWalks and Higher Order Markov Chains
David Gleich @Purdue
slideshare.net/dgleich/spacey-random-walks-
and-higher-order-markov-chains
Graph Analytics: tensor solutions emerging
26. Data Prep: Exsto Project Overview
• insights about dev communities, via data mining
their email forums
• works with any Apache project email archive
• applies NLP and ML techniques to analyze
message threads
• graph analytics surface themes and interactions
• results provide feedback for communities, e.g.,
leaderboards
27. Data Prep: Exsto Project Overview – four links
https://github.com/ceteri/spark-exercises/tree/master/exsto/dbc
http://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf
http://mail-archives.apache.org/mod_mbox/spark-user/
https://class01.cloud.databricks.com/#notebook/67011
28. Data Prep: Scraper pipeline
github.com/ceteri/spark-exercises/tree/master/exsto/dbc
+
29. Data Prep: Scraper pipeline
Typical data rates, e.g., for [email protected]:
• ~2K msgs/month
• ~18 MB/month parsed in JSON
Six months’ list activity represents a graph of:
• 1882 senders
• 1,762,113 nodes
• 3,232,174 edges
A large graph?! In any case, it satisfies definition of a
graph-parallel system – lots of data locality to leverage
30. Data Prep: Microservices meet Parallel Processing
services
email
archives community
leaderboards
SparkSQL
Data Prep
Features
Explore
Scraper /
Parser
NLTK
data Unique
Word IDs
TextRank,
Word2Vec,
etc.
community
insights
not so big data… relatively big compute…
31. Data Prep: Scraper pipeline
message
JSON
Py
filter
quoted
content
Apache
email list
archive
urllib2
crawl
monthly list
by date
Py
segment
paragraphs
32. Data Prep: Scraper pipeline
message
JSON
Py
filter
quoted
content
Apache
email list
archive
urllib2
crawl
monthly list
by date
Py
segment
paragraphs
{
"date": "2014-10-01T00:16:08+00:00",
"id": "CA+B-+fyrBU1yGZAYJM_u=gnBVtzB=sXoBHkhmS-6L1n8K5Hhbw",
"next_thread": "CALEj8eP5hpQDM=p2xryL-JT-x_VhkRcD59Q+9Qr9LJ9sYLeLVg",
"next_url": "http://mail-archives.apache.org/mod_mbox/spark-user/201410.mbox/%3cCALEj8eP5hpQ
"prev_thread": "",
"sender": "Debasish Das <[email protected]>",
"subject": "Re: memory vs data_size",
"text": "nOnly fit the data in memory where you want to run the iterativenalgorithm....n
}
37. TextRank: original paper
TextRank: Bringing Order intoTexts
Rada Mihalcea, Paul Tarau
Conference on Empirical Methods in Natural
Language Processing (July 2004)
https://goo.gl/AJnA76
http://web.eecs.umich.edu/~mihalcea/papers.html
http://www.cse.unt.edu/~tarau/
38. TextRank: other implementations
Jeff Kubina (Perl / English):
http://search.cpan.org/~kubina/Text-Categorize-
Textrank-0.51/lib/Text/Categorize/Textrank/En.pm
Paco Nathan (Hadoop / English+Spanish):
https://github.com/ceteri/textrank/
Karin Christiasen (Java / Icelandic):
https://github.com/karchr/icetextsum
46. Social Graph: use GraphX to run graph analytics
// run graph analytics
val g: Graph[String, Int] = Graph(nodes, edges)
val r = g.pageRank(0.0001).vertices
r.join(nodes).sortBy(_._2._1, ascending=false).foreach(println)
// define a reduce operation to compute the highest degree vertex
def max(a: (VertexId, Int), b: (VertexId, Int)): (VertexId, Int) = {
if (a._2 > b._2) a else b
}
// compute the max degrees
val maxInDegree: (VertexId, Int) = g.inDegrees.reduce(max)
val maxOutDegree: (VertexId, Int) = g.outDegrees.reduce(max)
val maxDegrees: (VertexId, Int) = g.degrees.reduce(max)
// connected components
val scc = g.stronglyConnectedComponents(10).vertices
node.join(scc).foreach(println)
51. Spark Developer Certification
• go.databricks.com/spark-certified-developer
• defined by Spark experts @Databricks
• assessed by O’Reilly Media
• establishes the bar for Spark expertise
52. • 40 multiple-choice questions, 90 minutes
• mostly structured as choices among code blocks
• expect some Python, Java, Scala, SQL
• understand theory of operation
• identify best practices
• recognize code that is more parallel, less
memory constrained
Overall, you need to write Spark apps in practice
Developer Certification: Overview
54. MOOCs:
Anthony Joseph
UC Berkeley
early June 2015
edx.org/course/uc-berkeleyx/uc-
berkeleyx-cs100-1x-
introduction-big-6181
Ameet Talwalkar
UCLA
late June 2015
edx.org/course/uc-berkeleyx/
uc-berkeleyx-cs190-1x-
scalable-machine-6066
55. Resources: Spark Packages
Looking for other libraries and features? There
are a variety of third-party packages available at:
http://spark-packages.org/