Highlighting the progress in Neo4j 3.3 and 3.4 especially
Neo4j Desktop, Graph Algorithms, NLP, Date-Time, Geospatial, and performance.
Also featuring the new visualization tool Neo4j Bloom.
APOC Pearls - Whirlwind Tour Through the Neo4j APOC Procedures Libraryjexp
APOC has become the de-facto standard utility library for Neo4j. In this talk, I will demonstrate some of the lesser known but very useful components of APOC that will save you a lot of work. You will also learn how to combine individual functions into powerful constructs to achieve impressive feats
This will be a fast-paced demo/live-coding talk.
Video: https://neo4j.com/graphconnect-2018/session/neo4j-utility-library-apoc-pearls
Unicorn images by TeeTurtle.com (Unstable Unicorns is a fun game & cool t-shirts)
We recently released the Neo4j graph algorithms library.
You can use these graph algorithms on your connected data to gain new insights more easily within Neo4j. You can use these graph analytics to improve results from your graph data, for example by focusing on particular communities or favoring popular entities.
We developed this library as part of our effort to make it easier to use Neo4j for a wider variety of applications. Many users expressed interest in running graph algorithms directly on Neo4j without having to employ a secondary system.
We also tuned these algorithms to be as efficient as possible in regards to resource utilization as well as streamlined for later management and debugging.
In this session we'll look at some of these graph algorithms and the types of problems that you can use them for in your applications.
Making Nested Columns as First Citizen in Apache Spark SQLDatabricks
Apple Siri is the world's largest virtual assistant service powering every iPhone, iPad, Mac, Apple TV, Apple Watch, and HomePod. We use large amounts of data to provide our users the best possible personalized experience. Our raw event data is cleaned and pre-joined into an unified data for our data consumers to use. To keep the rich hierarchical structure of the data, our data schemas are very deep nested structures. In this talk, we will discuss how Spark handles nested structures in Spark 2.4, and we'll show the fundamental design issues in reading nested fields which is not being well considered when Spark SQL was designed. This results in Spark SQL reading unnecessary data in many operations. Given that Siri's data is super nested and humongous, this soon becomes a bottleneck in our pipelines. Then we will talk about the various approaches we have taken to tackle this problem. By making nested columns as first citizen in Spark SQL, we can achieve dramatic performance gain. In some of our production queries, the speed-up can be 20x in wall clock time and 8x less data being read. All of our work will be open source, and some has already been merged into upstream.
We will describe and demonstrate all the options for loading data into Neo4j and for getting it back out, all using Kettle (Pentaho Data Integration).
Among the topics covered will be:
high performance data loading
streaming data integration into Neo4j
metadata driven data extraction
automatic Kettle execution lineage and path finding using Neo4j
roadmap update
Q&A
Introducing Arc: A Common Intermediate Language for Unified Batch and Stream...Flink Forward
Today's end-to-end data pipelines need to combine many diverse workloads such as machine learning, relational operations, stream dataflows, tensor transformations, and graphs. For each of these workload types exist several frontends (e.g., DataFrames/SQL, Beam, Keras) based on different programming languages as well as different runtimes (e.g., Spark, Flink, Tensorflow) that target a particular frontend and possibly a hardware architecture (e.g., GPUs). Putting all the pieces of a data pipeline together simply leads to excessive data materialisation, type conversions and hardware utilisation as well as miss-matches of processing guarantees.
Our research group at RISE and KTH in Sweden has founded Arc, an intermediate language that bridges the gap between any frontend and a dataflow runtime (e.g., Flink) through a set of fundamental building blocks for expressing data pipelines. Arc incorporates Flink and Beam-inspired stream semantics such as windows, state and out of order processing as well as concepts found in batch computation models. With Arc, we can cross- compile and optimise diverse tasks written in any programming language into a unified dataflow program. Arc programs can run on various hardware backends efficiently as well as allowing seamless, distributed execution on dataflow runtimes. To that end, we showcase Arcon a concept runtime built in Rust that can execute Arc programs natively as well as presenting a minimal set of extensions to make Flink an Arc-ready runtime.
"The common use cases of Spark SQL include ad hoc analysis, logical warehouse, query federation, and ETL processing. Spark SQL also powers the other Spark libraries, including structured streaming for stream processing, MLlib for machine learning, and GraphFrame for graph-parallel computation. For boosting the speed of your Spark applications, you can perform the optimization efforts on the queries prior employing to the production systems. Spark query plans and Spark UIs provide you insight on the performance of your queries. This talk discloses how to read and tune the query plans for enhanced performance. It will also cover the major related features in the recent and upcoming releases of Apache Spark.
"
Omid: scalable and highly available transaction processing for Apache PhoenixDataWorks Summit
Apache Phoenix is an OLTP and operational analytics for Hadoop. To ensure operations correctness, Phoenix requires that a transaction processor guarantees that all data accesses satisfy the ACID properties. Traditionally, Apache Phoenix has been using the Apache Tephra transaction processing technology. Recently, we introduced into Phoenix the support for Apache Omid—an open source transaction processor for HBase that is used at Yahoo at a large scale.
A single Omid instance sustains hundreds of thousands of transactions per second and provides high availability at zero cost for mainstream processing. Omid, as well as Tephra, are now configurable choices for the Phoenix transaction processing backend, being enabled by the newly introduced Transaction Abstraction Layer (TAL) API. The integration requires introducing many new features and operations to Omid and will become generally available early 2018.
In this talk, we walk through the challenges of the project, focusing on the new use cases introduced by Phoenix and how we address them in Omid.
Speaker
Ohad Shacham, Yahoo Research, Oath, Senior Research Scientist
James Taylor
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
Spark SQL enables Spark to perform efficient and fault-tolerant relational query processing with analytics database technologies. The relational queries are compiled to the executable physical plans consisting of transformations and actions on RDDs with the generated Java code. The code is compiled to Java bytecode, executed at runtime by JVM and optimized by JIT to native machine code at runtime. This talk will take a deep dive into Spark SQL execution engine. The talk includes pipelined execution, whole-stage code generation, UDF execution, memory management, vectorized readers, lineage based RDD transformation and action.
Optimizing spark jobs through a true understanding of spark core. Learn: What is a partition? What is the difference between read/shuffle/write partitions? How to increase parallelism and decrease output files? Where does shuffle data go between stages? What is the "right" size for your spark partitions and files? Why does a job slow down with only a few tasks left and never finish? Why doesn't adding nodes decrease my compute time?
Self-Service Data Ingestion Using NiFi, StreamSets & KafkaGuido Schmutz
Many of the Big Data and IoT use cases are based on combining data from multiple data sources and to make them available on a Big Data platform for analysis. The data sources are often very heterogeneous, from simple files, databases to high-volume event streams from sensors (IoT devices). It’s important to retrieve this data in a secure and reliable manner and integrate it with the Big Data platform so that it is available for analysis in real-time (stream processing) as well as in batch (typical big data processing). In past some new tools have emerged, which are especially capable of handling the process of integrating data from outside, often called Data Ingestion. From an outside perspective, they are very similar to a traditional Enterprise Service Bus infrastructures, which in larger organization are often in use to handle message-driven and service-oriented systems. But there are also important differences, they are typically easier to scale in a horizontal fashion, offer a more distributed setup, are capable of handling high-volumes of data/messages, provide a very detailed monitoring on message level and integrate very well with the Hadoop ecosystem. This session will present and compare Apache Flume, Apache NiFi, StreamSets and the Kafka Ecosystem and show how they handle the data ingestion in a Big Data solution architecture.
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
Catalyst is becoming one of the most important components of Apache Spark, as it underpins all the major new APIs in Spark 2.0 and later versions, from DataFrames and Datasets to Streaming. At its core, Catalyst is a general library for manipulating trees.
In this talk, Yin explores a modular compiler frontend for Spark based on this library that includes a query analyzer, optimizer, and an execution planner. Yin offers a deep dive into Spark SQL’s Catalyst optimizer, introducing the core concepts of Catalyst and demonstrating how developers can extend it. You’ll leave with a deeper understanding of how Spark analyzes, optimizes, and plans a user’s query.
In Spark SQL the physical plan provides the fundamental information about the execution of the query. The objective of this talk is to convey understanding and familiarity of query plans in Spark SQL, and use that knowledge to achieve better performance of Apache Spark queries. We will walk you through the most common operators you might find in the query plan and explain some relevant information that can be useful in order to understand some details about the execution. If you understand the query plan, you can look for the weak spot and try to rewrite the query to achieve a more optimal plan that leads to more efficient execution.
The main content of this talk is based on Spark source code but it will reflect some real-life queries that we run while processing data. We will show some examples of query plans and explain how to interpret them and what information can be taken from them. We will also describe what is happening under the hood when the plan is generated focusing mainly on the phase of physical planning. In general, in this talk we want to share what we have learned from both Spark source code and real-life queries that we run in our daily data processing.
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...Databricks
A technical overview of Spark’s DataFrame API. First, we’ll review the DataFrame API and show how to create DataFrames from a variety of data sources such as Hive, RDBMS databases, or structured file formats like Avro. We’ll then give example user programs that operate on DataFrames and point out common design patterns. The second half of the talk will focus on the technical implementation of DataFrames, such as the use of Spark SQL’s Catalyst optimizer to intelligently plan user programs, and the use of fast binary data structures in Spark’s core engine to substantially improve performance and memory use for common types of operations.
Speed up UDFs with GPUs using the RAPIDS AcceleratorDatabricks
The RAPIDS Accelerator for Apache Spark is a plugin that enables the power of GPUs to be leveraged in Spark DataFrame and SQL queries, improving the performance of ETL pipelines. User-defined functions (UDFs) in the query appear as opaque transforms and can prevent the RAPIDS Accelerator from processing some query operations on the GPU.
This presentation discusses how users can leverage the RAPIDS Accelerator UDF Compiler to automatically translate some simple UDFs to equivalent Catalyst operations that are processed on the GPU. The presentation also covers how users can provide a GPU version of Scala, Java, or Hive UDFs for maximum control and performance. Sample UDFs for each case will be shown along with how the query plans are impacted when the UDFs are processed on the GPU.
This workshop will provide a hands-on introduction to Apache Spark and Apache Zeppelin in the cloud.
Format: A short introductory lecture on Apache Spark covering core modules (SQL, Streaming, MLlib, GraphX) followed by a demo, lab exercises and a Q&A session. The lecture will be followed by lab time to work through the lab exercises and ask questions.
Objective: To provide a quick and short hands-on introduction to Apache Spark. This lab will use the following Spark and Apache Hadoop components: Spark, Spark SQL, Apache Hadoop HDFS, Apache Hadoop YARN, Apache ORC, and Apache Ambari Zepellin. You will learn how to move data into HDFS using Spark APIs, create Apache Hive tables, explore the data with Spark and Spark SQL, transform the data and then issue some SQL queries.df
Lab pre-requisites: Registrants must bring a laptop with a Chrome or Firefox web browser installed (with proxies disabled). Alternatively, they may download and install an HDP Sandbox as long as they have at least 16GB of RAM available (Note that the sandbox is over 10GB in size so we recommend downloading it before the crash course).
Speakers: Robert Hryniewicz
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...Databricks
Yelp’s ad platform handles millions of ad requests everyday. To generate ad metrics and analytics in real-time, they built they ad event tracking and analyzing pipeline on top of Spark Streaming. It allows Yelp to manage large number of active ad campaigns and greatly reduce over-delivery. It also enables them to share ad metrics with advertisers in a more timely fashion.
This session will start with an overview of the entire pipeline and then focus on two specific challenges in the event consolidation part of the pipeline that Yelp had to solve. The first challenge will be about joining multiple data sources together to generate a single stream of ad events that feeds into various downstream systems. That involves solving several problems that are unique to real-time applications, such as windowed processing and handling of event delays. The second challenge covered is with regards to state management across code deployments and application restarts. Throughout the session, the speakers will share best practices for the design and development of large-scale Spark Streaming pipelines for production environments.
Optimizing the Catalyst Optimizer for Complex PlansDatabricks
For more than 6 years, Workday has been building various analytics products powered by Apache Spark. At the core of each product offering, customers use our UI to create data prep pipelines, which are then compiled to DataFrames and executed by Spark under the hood. As we built out our products, however, we started to notice places where vanilla Spark is not suitable for our workloads. For example, because our Spark plans are programmatically generated, they tend to be very complex, and often result in tens of thousands of operators. Another common issue is having case statements with thousands of branches, or worse, nested expressions containing such case statements.
With the right combination of these traits, the final DataFrame can easily take Catalyst hours to compile and optimize – that is, if it doesn’t first cause the driver JVM to run out of memory.
In this talk, we discuss how we addressed some of our pain points regarding complex pipelines. Topics covered include memory-efficient plan logging, using common subexpression elimination to remove redundant subplans, rewriting Spark’s constraint propagation mechanism to avoid exponential growth of filter constraints, as well as other performance enhancements made to Catalyst rules.
We then apply these changes to several production pipelines, showcasing the reduction of time spent in Catalyst, and list out ideas for further improvements. Finally, we share tips on how you too can better handle complex Spark plans.
Catalyst optimizer optimizes queries written in Spark SQL and DataFrame API to run faster. It uses both rule-based and cost-based optimization. Rule-based optimization applies rules to determine query execution, while cost-based generates multiple plans and selects the most efficient. Catalyst optimizer transforms logical plans through four phases - analysis, logical optimization, physical planning, and code generation. It represents queries as trees that can be manipulated using pattern matching rules to optimize queries.
Web-Scale Graph Analytics with Apache® Spark™Databricks
Graph analytics has a wide range of applications, from information propagation and network flow optimization to fraud and anomaly detection. The rise of social networks and the Internet of Things has given us complex web-scale graphs with billions of vertices and edges. However, in order to extract the hidden gems within those graphs, you need tools to analyze the graphs easily and efficiently.
At Spark Summit 2016, Databricks introduced GraphFrames, which implemented graph queries and pattern matching on top of Spark SQL to simplify graph analytics. In this talk, you’ll learn about work that has made graph algorithms in GraphFrames faster and more scalable. For example, new implementations like connected components have received algorithm improvements based on recent research, as well as performance improvements from Spark DataFrames. Discover lessons learned from scaling the implementation from millions to billions of nodes; compare its performance with other popular graph libraries, and hear about real-world applications.
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & KafkaFlink Forward
This document summarizes a presentation about Bouygues Telecom's use of Apache Flink for real-time data integration and processing of mobile network event logs. Bouygues Telecom processes over 4 billion logs per day from their network equipment to calculate mobile quality of experience (QoE) indicators within 60 seconds for business intelligence, diagnostics and alerting. They were previously using Hadoop for batch processing but needed a real-time solution. After evaluating Apache Spark and Flink, they chose Flink for its true streaming capabilities, backpressure handling, and high performance on limited resources. Flink helped them process a day's worth of logs in under an hour from 10 Kafka partitions across 10 TaskManagers, each with only
Extending Spark Graph for the Enterprise with Morpheus and Neo4jDatabricks
Spark 3.0 introduces a new module: Spark Graph. Spark Graph adds the popular query language Cypher, its accompanying Property Graph Model and graph algorithms to the data science toolbox. Graphs have a plethora of useful applications in recommendation, fraud detection and research.
Morpheus is an open-source library that is API compatible with Spark Graph and extends its functionality by:
A Property Graph catalog to manage multiple Property Graphs and Views
Property Graph Data Sources that connect Spark Graph to Neo4j and SQL databases
Extended Cypher capabilities including multiple graph support and graph construction
Built-in support for the Neo4j Graph Algorithms library In this talk, we will walk you through the new Spark Graph module and demonstrate how we extend it with Morpheus to support enterprise users to integrate Spark Graph in their existing Spark and Neo4j installations.
We will demonstrate how to explore data in Spark, use Morpheus to transform data into a Property Graph, and then build a Graph Solution in Neo4j.
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonMiklos Christine
Apache Spark is the next big data processing tool for Data Scientist. As seen on the recent StackOverflow analysis, it's the hottest big data technology on their site! In this talk, I'll use the PySpark interface to leverage the speed and performance of Apache Spark. I'll focus on the end to end workflow for getting data into a distributed platform, and leverage Spark to process the data for advanced analytics. I'll discuss the popular Spark APIs used for data preparation, SQL analysis, and ML algorithms. I'll explain the performance differences between Scala and Python, and how Spark has bridged the gap in performance. I'll focus on PySpark as the interface to the platform, and walk through a demo to showcase the APIs.
Talk Overview:
Spark's Architecture. What's out now and what's in Spark 2.0Spark APIs: Most common APIs used by Spark Common misconceptions and proper techniques for using Spark.
Demo:
Walk through ETL of the Reddit dataset. SparkSQL Analytics + Visualizations of the Dataset using MatplotLibSentiment Analysis on Reddit Comments
Discuss the different ways model can be served with MLflow. We will cover both the open source MLflow and Databricks managed MLflow ways to serve models. Will cover the basic differences between batch scoring and real-time scoring. Special emphasis on the new upcoming Databricks production-ready model serving.
Last year we decided to build an in-house solution for Funnel analysis which should be accessible to our business user through our BI tool. Backend part should run on Ap;ache Spark and since the BI tool can only run SQL queries that implies that the solution is a pure Spark SQL implementation of Funnel analysis. In this talk we will cover various Spark SQL features we have used to optimize query performance and implement various filters which enable end users to get actionable insights. KEY TAKEAWAYS: – single query approach to Funnel analysis (can be applied to any funnel-like problem) – using window functions to ensure ordering of the events in the funnel – examples of higher order functions to calculate funnel metrics
Analyzing Blockchain Transactions in Apache Spark with Jiri KremserDatabricks
Blockchain has become a buzzword: people are excited about distributed ledgers and cryptocurrencies, but these technologies are shrouded in myths, and misunderstanding. This talk will shed some light into how this awesome technology is actually used in practice by using Apache Spark to analyze blockchain transactions.
We’ll start with a brief introduction to blockchain transactions and how we can ETL transaction graph data obtained from the public binary format. Then we will look at how to model graph data in Spark, briefly comparing GraphFrames and GraphX. The majority of the presentation will be a live demo, running on Spark in the cloud, showing how we can run various queries on the transaction graph data, solve graph algorithms such as PageRank for identifying significant BTC addresses, observe network evolution, and more.
All of the work described in this talk is published as open source code and all of the data are available in public and available for community experimentation as well as all the containers. You will leave this talk with a better understanding of blockchain technology and graph processing in Spark and you will have the concrete tools to reproduce my research or start answering your own questions.
Lessons from the Field, Episode II: Applying Best Practices to Your Apache S...Databricks
Apache Spark is an excellent tool to accelerate your analytics, whether you’re doing ETL, Machine Learning, or Data Warehousing. However, to really make the most of Spark it pays to understand best practices for data storage, file formats, and query optimization.
As a follow-up of last year’s “Lessons From The Field”, this session will review some common anti-patterns I’ve seen in the field that could introduce performance or stability issues to your Spark jobs. We’ll look at ways of better understanding your Spark jobs and identifying solutions to these anti-patterns to help you write better performing and more stable applications.
Getting Started with Apache Spark on KubernetesDatabricks
Community adoption of Kubernetes (instead of YARN) as a scheduler for Apache Spark has been accelerating since the major improvements from Spark 3.0 release. Companies choose to run Spark on Kubernetes to use a single cloud-agnostic technology across their entire stack, and to benefit from improved isolation and resource sharing for concurrent workloads. In this talk, the founders of Data Mechanics, a serverless Spark platform powered by Kubernetes, will show how to easily get started with Spark on Kubernetes.
The document describes the Neo4j graph database and platform vision. It discusses key components like index-free adjacency, ACID transactions, clustering, and hardware optimizations. It outlines use cases for graph analytics, transactions, AI, and data integration. It also covers drivers, APIs, visualization, and administration tools. Finally, it previews upcoming innovations in Neo4j 3.4 like geospatial support, native string indexes, and rolling upgrades.
Complex hierarchical relationships between entities can only be mapped with difficulty in a relational database and demanding queries are usually quite slow.
Graph databases are optimized for exactly these kinds of relationships and can provide high-performance results even with huge amounts of data. Moreover, not only the entities that are stored in the database, have attributes, but also their relationships. Queries can look at entities as well as their relationships.
Get to know the basics of graph databases, using Neo4j as an example, and see how it is used C# projects.
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
Spark SQL enables Spark to perform efficient and fault-tolerant relational query processing with analytics database technologies. The relational queries are compiled to the executable physical plans consisting of transformations and actions on RDDs with the generated Java code. The code is compiled to Java bytecode, executed at runtime by JVM and optimized by JIT to native machine code at runtime. This talk will take a deep dive into Spark SQL execution engine. The talk includes pipelined execution, whole-stage code generation, UDF execution, memory management, vectorized readers, lineage based RDD transformation and action.
Optimizing spark jobs through a true understanding of spark core. Learn: What is a partition? What is the difference between read/shuffle/write partitions? How to increase parallelism and decrease output files? Where does shuffle data go between stages? What is the "right" size for your spark partitions and files? Why does a job slow down with only a few tasks left and never finish? Why doesn't adding nodes decrease my compute time?
Self-Service Data Ingestion Using NiFi, StreamSets & KafkaGuido Schmutz
Many of the Big Data and IoT use cases are based on combining data from multiple data sources and to make them available on a Big Data platform for analysis. The data sources are often very heterogeneous, from simple files, databases to high-volume event streams from sensors (IoT devices). It’s important to retrieve this data in a secure and reliable manner and integrate it with the Big Data platform so that it is available for analysis in real-time (stream processing) as well as in batch (typical big data processing). In past some new tools have emerged, which are especially capable of handling the process of integrating data from outside, often called Data Ingestion. From an outside perspective, they are very similar to a traditional Enterprise Service Bus infrastructures, which in larger organization are often in use to handle message-driven and service-oriented systems. But there are also important differences, they are typically easier to scale in a horizontal fashion, offer a more distributed setup, are capable of handling high-volumes of data/messages, provide a very detailed monitoring on message level and integrate very well with the Hadoop ecosystem. This session will present and compare Apache Flume, Apache NiFi, StreamSets and the Kafka Ecosystem and show how they handle the data ingestion in a Big Data solution architecture.
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
Catalyst is becoming one of the most important components of Apache Spark, as it underpins all the major new APIs in Spark 2.0 and later versions, from DataFrames and Datasets to Streaming. At its core, Catalyst is a general library for manipulating trees.
In this talk, Yin explores a modular compiler frontend for Spark based on this library that includes a query analyzer, optimizer, and an execution planner. Yin offers a deep dive into Spark SQL’s Catalyst optimizer, introducing the core concepts of Catalyst and demonstrating how developers can extend it. You’ll leave with a deeper understanding of how Spark analyzes, optimizes, and plans a user’s query.
In Spark SQL the physical plan provides the fundamental information about the execution of the query. The objective of this talk is to convey understanding and familiarity of query plans in Spark SQL, and use that knowledge to achieve better performance of Apache Spark queries. We will walk you through the most common operators you might find in the query plan and explain some relevant information that can be useful in order to understand some details about the execution. If you understand the query plan, you can look for the weak spot and try to rewrite the query to achieve a more optimal plan that leads to more efficient execution.
The main content of this talk is based on Spark source code but it will reflect some real-life queries that we run while processing data. We will show some examples of query plans and explain how to interpret them and what information can be taken from them. We will also describe what is happening under the hood when the plan is generated focusing mainly on the phase of physical planning. In general, in this talk we want to share what we have learned from both Spark source code and real-life queries that we run in our daily data processing.
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...Databricks
A technical overview of Spark’s DataFrame API. First, we’ll review the DataFrame API and show how to create DataFrames from a variety of data sources such as Hive, RDBMS databases, or structured file formats like Avro. We’ll then give example user programs that operate on DataFrames and point out common design patterns. The second half of the talk will focus on the technical implementation of DataFrames, such as the use of Spark SQL’s Catalyst optimizer to intelligently plan user programs, and the use of fast binary data structures in Spark’s core engine to substantially improve performance and memory use for common types of operations.
Speed up UDFs with GPUs using the RAPIDS AcceleratorDatabricks
The RAPIDS Accelerator for Apache Spark is a plugin that enables the power of GPUs to be leveraged in Spark DataFrame and SQL queries, improving the performance of ETL pipelines. User-defined functions (UDFs) in the query appear as opaque transforms and can prevent the RAPIDS Accelerator from processing some query operations on the GPU.
This presentation discusses how users can leverage the RAPIDS Accelerator UDF Compiler to automatically translate some simple UDFs to equivalent Catalyst operations that are processed on the GPU. The presentation also covers how users can provide a GPU version of Scala, Java, or Hive UDFs for maximum control and performance. Sample UDFs for each case will be shown along with how the query plans are impacted when the UDFs are processed on the GPU.
This workshop will provide a hands-on introduction to Apache Spark and Apache Zeppelin in the cloud.
Format: A short introductory lecture on Apache Spark covering core modules (SQL, Streaming, MLlib, GraphX) followed by a demo, lab exercises and a Q&A session. The lecture will be followed by lab time to work through the lab exercises and ask questions.
Objective: To provide a quick and short hands-on introduction to Apache Spark. This lab will use the following Spark and Apache Hadoop components: Spark, Spark SQL, Apache Hadoop HDFS, Apache Hadoop YARN, Apache ORC, and Apache Ambari Zepellin. You will learn how to move data into HDFS using Spark APIs, create Apache Hive tables, explore the data with Spark and Spark SQL, transform the data and then issue some SQL queries.df
Lab pre-requisites: Registrants must bring a laptop with a Chrome or Firefox web browser installed (with proxies disabled). Alternatively, they may download and install an HDP Sandbox as long as they have at least 16GB of RAM available (Note that the sandbox is over 10GB in size so we recommend downloading it before the crash course).
Speakers: Robert Hryniewicz
A Practical Approach to Building a Streaming Processing Pipeline for an Onlin...Databricks
Yelp’s ad platform handles millions of ad requests everyday. To generate ad metrics and analytics in real-time, they built they ad event tracking and analyzing pipeline on top of Spark Streaming. It allows Yelp to manage large number of active ad campaigns and greatly reduce over-delivery. It also enables them to share ad metrics with advertisers in a more timely fashion.
This session will start with an overview of the entire pipeline and then focus on two specific challenges in the event consolidation part of the pipeline that Yelp had to solve. The first challenge will be about joining multiple data sources together to generate a single stream of ad events that feeds into various downstream systems. That involves solving several problems that are unique to real-time applications, such as windowed processing and handling of event delays. The second challenge covered is with regards to state management across code deployments and application restarts. Throughout the session, the speakers will share best practices for the design and development of large-scale Spark Streaming pipelines for production environments.
Optimizing the Catalyst Optimizer for Complex PlansDatabricks
For more than 6 years, Workday has been building various analytics products powered by Apache Spark. At the core of each product offering, customers use our UI to create data prep pipelines, which are then compiled to DataFrames and executed by Spark under the hood. As we built out our products, however, we started to notice places where vanilla Spark is not suitable for our workloads. For example, because our Spark plans are programmatically generated, they tend to be very complex, and often result in tens of thousands of operators. Another common issue is having case statements with thousands of branches, or worse, nested expressions containing such case statements.
With the right combination of these traits, the final DataFrame can easily take Catalyst hours to compile and optimize – that is, if it doesn’t first cause the driver JVM to run out of memory.
In this talk, we discuss how we addressed some of our pain points regarding complex pipelines. Topics covered include memory-efficient plan logging, using common subexpression elimination to remove redundant subplans, rewriting Spark’s constraint propagation mechanism to avoid exponential growth of filter constraints, as well as other performance enhancements made to Catalyst rules.
We then apply these changes to several production pipelines, showcasing the reduction of time spent in Catalyst, and list out ideas for further improvements. Finally, we share tips on how you too can better handle complex Spark plans.
Catalyst optimizer optimizes queries written in Spark SQL and DataFrame API to run faster. It uses both rule-based and cost-based optimization. Rule-based optimization applies rules to determine query execution, while cost-based generates multiple plans and selects the most efficient. Catalyst optimizer transforms logical plans through four phases - analysis, logical optimization, physical planning, and code generation. It represents queries as trees that can be manipulated using pattern matching rules to optimize queries.
Web-Scale Graph Analytics with Apache® Spark™Databricks
Graph analytics has a wide range of applications, from information propagation and network flow optimization to fraud and anomaly detection. The rise of social networks and the Internet of Things has given us complex web-scale graphs with billions of vertices and edges. However, in order to extract the hidden gems within those graphs, you need tools to analyze the graphs easily and efficiently.
At Spark Summit 2016, Databricks introduced GraphFrames, which implemented graph queries and pattern matching on top of Spark SQL to simplify graph analytics. In this talk, you’ll learn about work that has made graph algorithms in GraphFrames faster and more scalable. For example, new implementations like connected components have received algorithm improvements based on recent research, as well as performance improvements from Spark DataFrames. Discover lessons learned from scaling the implementation from millions to billions of nodes; compare its performance with other popular graph libraries, and hear about real-world applications.
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & KafkaFlink Forward
This document summarizes a presentation about Bouygues Telecom's use of Apache Flink for real-time data integration and processing of mobile network event logs. Bouygues Telecom processes over 4 billion logs per day from their network equipment to calculate mobile quality of experience (QoE) indicators within 60 seconds for business intelligence, diagnostics and alerting. They were previously using Hadoop for batch processing but needed a real-time solution. After evaluating Apache Spark and Flink, they chose Flink for its true streaming capabilities, backpressure handling, and high performance on limited resources. Flink helped them process a day's worth of logs in under an hour from 10 Kafka partitions across 10 TaskManagers, each with only
Extending Spark Graph for the Enterprise with Morpheus and Neo4jDatabricks
Spark 3.0 introduces a new module: Spark Graph. Spark Graph adds the popular query language Cypher, its accompanying Property Graph Model and graph algorithms to the data science toolbox. Graphs have a plethora of useful applications in recommendation, fraud detection and research.
Morpheus is an open-source library that is API compatible with Spark Graph and extends its functionality by:
A Property Graph catalog to manage multiple Property Graphs and Views
Property Graph Data Sources that connect Spark Graph to Neo4j and SQL databases
Extended Cypher capabilities including multiple graph support and graph construction
Built-in support for the Neo4j Graph Algorithms library In this talk, we will walk you through the new Spark Graph module and demonstrate how we extend it with Morpheus to support enterprise users to integrate Spark Graph in their existing Spark and Neo4j installations.
We will demonstrate how to explore data in Spark, use Morpheus to transform data into a Property Graph, and then build a Graph Solution in Neo4j.
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonMiklos Christine
Apache Spark is the next big data processing tool for Data Scientist. As seen on the recent StackOverflow analysis, it's the hottest big data technology on their site! In this talk, I'll use the PySpark interface to leverage the speed and performance of Apache Spark. I'll focus on the end to end workflow for getting data into a distributed platform, and leverage Spark to process the data for advanced analytics. I'll discuss the popular Spark APIs used for data preparation, SQL analysis, and ML algorithms. I'll explain the performance differences between Scala and Python, and how Spark has bridged the gap in performance. I'll focus on PySpark as the interface to the platform, and walk through a demo to showcase the APIs.
Talk Overview:
Spark's Architecture. What's out now and what's in Spark 2.0Spark APIs: Most common APIs used by Spark Common misconceptions and proper techniques for using Spark.
Demo:
Walk through ETL of the Reddit dataset. SparkSQL Analytics + Visualizations of the Dataset using MatplotLibSentiment Analysis on Reddit Comments
Discuss the different ways model can be served with MLflow. We will cover both the open source MLflow and Databricks managed MLflow ways to serve models. Will cover the basic differences between batch scoring and real-time scoring. Special emphasis on the new upcoming Databricks production-ready model serving.
Last year we decided to build an in-house solution for Funnel analysis which should be accessible to our business user through our BI tool. Backend part should run on Ap;ache Spark and since the BI tool can only run SQL queries that implies that the solution is a pure Spark SQL implementation of Funnel analysis. In this talk we will cover various Spark SQL features we have used to optimize query performance and implement various filters which enable end users to get actionable insights. KEY TAKEAWAYS: – single query approach to Funnel analysis (can be applied to any funnel-like problem) – using window functions to ensure ordering of the events in the funnel – examples of higher order functions to calculate funnel metrics
Analyzing Blockchain Transactions in Apache Spark with Jiri KremserDatabricks
Blockchain has become a buzzword: people are excited about distributed ledgers and cryptocurrencies, but these technologies are shrouded in myths, and misunderstanding. This talk will shed some light into how this awesome technology is actually used in practice by using Apache Spark to analyze blockchain transactions.
We’ll start with a brief introduction to blockchain transactions and how we can ETL transaction graph data obtained from the public binary format. Then we will look at how to model graph data in Spark, briefly comparing GraphFrames and GraphX. The majority of the presentation will be a live demo, running on Spark in the cloud, showing how we can run various queries on the transaction graph data, solve graph algorithms such as PageRank for identifying significant BTC addresses, observe network evolution, and more.
All of the work described in this talk is published as open source code and all of the data are available in public and available for community experimentation as well as all the containers. You will leave this talk with a better understanding of blockchain technology and graph processing in Spark and you will have the concrete tools to reproduce my research or start answering your own questions.
Lessons from the Field, Episode II: Applying Best Practices to Your Apache S...Databricks
Apache Spark is an excellent tool to accelerate your analytics, whether you’re doing ETL, Machine Learning, or Data Warehousing. However, to really make the most of Spark it pays to understand best practices for data storage, file formats, and query optimization.
As a follow-up of last year’s “Lessons From The Field”, this session will review some common anti-patterns I’ve seen in the field that could introduce performance or stability issues to your Spark jobs. We’ll look at ways of better understanding your Spark jobs and identifying solutions to these anti-patterns to help you write better performing and more stable applications.
Getting Started with Apache Spark on KubernetesDatabricks
Community adoption of Kubernetes (instead of YARN) as a scheduler for Apache Spark has been accelerating since the major improvements from Spark 3.0 release. Companies choose to run Spark on Kubernetes to use a single cloud-agnostic technology across their entire stack, and to benefit from improved isolation and resource sharing for concurrent workloads. In this talk, the founders of Data Mechanics, a serverless Spark platform powered by Kubernetes, will show how to easily get started with Spark on Kubernetes.
The document describes the Neo4j graph database and platform vision. It discusses key components like index-free adjacency, ACID transactions, clustering, and hardware optimizations. It outlines use cases for graph analytics, transactions, AI, and data integration. It also covers drivers, APIs, visualization, and administration tools. Finally, it previews upcoming innovations in Neo4j 3.4 like geospatial support, native string indexes, and rolling upgrades.
Complex hierarchical relationships between entities can only be mapped with difficulty in a relational database and demanding queries are usually quite slow.
Graph databases are optimized for exactly these kinds of relationships and can provide high-performance results even with huge amounts of data. Moreover, not only the entities that are stored in the database, have attributes, but also their relationships. Queries can look at entities as well as their relationships.
Get to know the basics of graph databases, using Neo4j as an example, and see how it is used C# projects.
Extending the Stream/Table Duality into a Trinity, with Graphs (David Allen &...confluent
The presentation discussed how graphs, streams, and tables work together using a fraud detection use case at a bank. Event data about customers, accounts, and sessions is ingested from various systems into Kafka streams. A Neo4j graph database integrated with Kafka via Neo4j Streams consumes this event stream to build a graph model of entities and their relationships. A GRANDstack application exposes this graph via GraphQL to allow fraud analysts to investigate suspicious patterns and accounts flagged by graph algorithms, and update the graph based on their adjudications.
William Lyon presented on Neo4j 3.0 which introduces a new storage engine allowing unlimited graph size, new language drivers for easier application development, and improved operability for deploying Neo4j in the cloud, containers, and on premises. Key features include the new Bolt binary protocol, Java stored procedures, and an upgraded Cypher query engine with a new cost-based optimizer.
This document provides an overview of the Neo4j graph database platform. It discusses how Neo4j differs from traditional databases by efficiently storing and querying connected data. It outlines the key components of the Neo4j platform, including graph transactions, the Cypher query language, and driver APIs. The document also reviews recent improvements to Neo4j's administration capabilities, multi-cluster support, visualization tools, and graph algorithms library.
Continuum Analytics provides the Anaconda platform for data science. It includes popular Python data science packages like NumPy, SciPy, Pandas, Scikit-learn, and the Jupyter notebook. Continuum was founded by Travis Oliphant, creator of NumPy and Numba, to support the open source Python data science community and make it easier to do data analytics and visualization using Python. The Anaconda platform has over 2 million users and makes it simple to install and work with Python and related packages for data science and machine learning.
Morpheus - SQL and Cypher in Apache SparkHenning Kropp
Morpheus allows querying graphs stored in Apache Spark using the Cypher query language. It represents property graphs as compositions of DataFrames and supports operations like importing/exporting data between Spark graphs and Neo4j graphs. Morpheus also provides a catalog for managing multiple named graphs from different data sources and allows constructing new graphs using graph views and queries across multiple input graphs.
Morpheus SQL and Cypher® in Apache® Spark - Big Data Meetup MunichMartin Junghanns
Extending Apache Spark Graph for the Enterprise with Morpheus and Neo4j
The talk covers:
* Neo4j, Property Graph Model and Cypher
* Cypher query exectution in Apache Spark
* Neo4j graph algorithms
* Example Code
Neo4j GraphTalk Oslo - Building Intelligent Solutions with GraphsNeo4j
Neo4j provides professional services to help customers build intelligent solutions using graphs. These services include packaged services, staff augmentation, training, and managed services. Neo4j helps customers at every stage from requirements gathering to production support. Solution accelerators like frameworks and reference architectures help speed delivery. Customers see benefits like reduced complexity, data storage needs, and infrastructure costs when moving from traditional databases to Neo4j.
5th in the AskTOM Office Hours series on graph database technologies. https://devgym.oracle.com/pls/apex/dg/office_hours/3084
PGQL: A Query Language for Graphs
Learn how to query graphs using PGQL, an expressive and intuitive graph query language that's a lot like SQL. With PGQL, it's easy to get going writing graph analysis queries to the database in a very short time. Albert and Oskar show what you can do with PGQL, and how to write and execute PGQL code.
Neo4j Database Overview document discusses:
1. Key components and ingredients of Neo4j including index-free adjacency and ACID foundation.
2. How Neo4j fits into the larger data ecosystem and common integration patterns.
3. Latest innovations in Neo4j 3.3 including performance improvements, security enhancements, and developer productivity features.
Neo4j GraphDay Seattle- Sept19- in the enterpriseNeo4j
The document discusses Neo4j's graph database platform and features. It highlights Neo4j's native graph processing capabilities, Cypher query language, and enterprise editions that provide high availability, causal clustering, and multi-data center support. The document also discusses Neo4j's performance advantages over relational and other NoSQL databases for connected data through its index-free adjacency and in-memory architecture.
Neo4j: What's Under the Hood & How Knowing This Can Help You Neo4j
Neo4j provides a concise summary of how graph databases have evolved and their advantages over traditional databases. Specifically, graph databases can handle billions of connections between data points and enable queries that can traverse thousands of relationships between nodes, providing answers in milliseconds rather than minutes. This level of connected data insight allows for real-time fraud detection, recommendations, knowledge graphs, and other applications that require understanding relationships in large, dynamic datasets.
A Hands-on Intro to Data Science and R Presentation.pptSanket Shikhar
Using popular data science tools such as Python and R, the book offers many examples of real-life applications, with practice ranging from small to big data.
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...Databricks
This document discusses property graphs and how they are represented and queried using Morpheus, a graph query engine for Apache Spark.
Morpheus allows querying property graphs using Cypher and represents property graphs using DataFrames, with node and relationship data stored in tables. It integrates with various data sources and supports federated queries across multiple property graphs. The document provides examples of loading property graph data from sources like JSON, SQL databases and Neo4j, creating graph projections, running analytical queries, and recommending businesses based on graph algorithms.
The document describes a data science project conducted on streaming log data from Cloudera Movies, an online streaming video service. The goals of the project were to understand which user accounts are used most by younger viewers, segment user sessions to improve site usability, and build a recommendation engine. Key steps included exploring and cleaning the data, classifying users as children or adults using a SimRank approach, clustering user sessions to identify behavior patterns, and predicting user ratings through user-user and item-item similarity models to build a recommendation system. Accuracy of 99.64% was achieved in classifying users.
The document discusses building a custom data source for Grafana using Vert.x to retrieve time series data stored in MongoDB. It describes how to connect Grafana and MongoDB through a three-tier architecture using a Vert.x microservice. The microservice would handle splitting requests, querying the database in parallel, aggregating results, and performing additional processing before returning data to Grafana. Vert.x is well-suited for this due to its asynchronous, reactive, and scalable nature. Sample code is available on GitHub to demonstrate the approach.
Neo4j GraphDay Seattle- Sept19- Connected data imperativeNeo4j
The document outlines an agenda for a Neo4j Graph Day event including sessions on connected data, graphs and artificial intelligence, a lunch break, Neo4j training, and a reception. Key topics include Neo4j in production environments, its role in boosting artificial intelligence, and training opportunities.
Looming Marvelous - Virtual Threads in Java Javaland.pdfjexp
Nowadays we have 2 options for concurrency in Java:
* simple, synchronous, blocking code with limited scalability that tracks well linearly at runtime, or.
* complex, asynchronous libraries with high scalability that are harder to handle.
Project Loom aims to bring together the best aspects of these two approaches and make them available to developers.
In the talk, I'll briefly cover the history and challenges of concurrency in Java before we dive into Loom's approaches and do some behind-the-scenes implementation. To manage so many threads reasonably needs some structure - for this there are proposals for "Structured Concurrency" which we will also look at. Some examples and comparisons to test Loom will round up the talk.
Project Loom is included in Java 19 and 20 as a preview feature, it can already be tested how well it works with our applications and libraries.
Spoiler: Pretty good.
Easing the daily grind with the awesome JDK command line toolsjexp
Included in the JDK installation are a lot of handy tools for Java developers, from java, jshell and jcmd to jfr and jdeprscan. These allow you to analyze a running JVM, generate JRE's, run Java source code and much more. In this talk I would like to present a number of these tools with practical examples and thus expand the toolbox of the participants. With the command line tools, many tasks can be automated and executed more efficiently, leaving more time for the exciting things in developer life.
Today, we have 2 options for concurrency in Java:
Simple, synchronous, blocking code with limited scalability that tracks well linearly at runtime, or
complex, asynchronous libraries with high scalability, which are harder to handle
Project Loom aims to bring together the best aspects of these two approaches and make them available to developers.
In the talk, I'll briefly discuss the history and challenges of concurrency in Java before we dive into Loom's approaches and look a bit behind the scenes.
Project Loom is included since Java 17 as a preview feature, it can already be tested to see how well it works with our applications and libraries. Spoiler: Pretty good.
GraphConnect 2022 - Top 10 Cypher Tuning Tips & Tricks.pptxjexp
I was there when Cypher was invented in 2012
and have been using it ever since. The language is
extremely powerful and easy to learn. But to truly
master it, you need to understand how it works
internally and how the database executes your
queries. In this session, you'll learn to look behind
the scenes at execution plans with PROFILE and
EXPLAIN and which specific clauses, expressions,
structures, and operations help you minimize
Cypher and database operations. After this talk,
you should be able to speed up your Cypher
statements quite a bit.
The newly released Neo4j Connector for Apache Spark can be used to read and write data between the two systems.
In this demo I show how to use the investigative Data from the FinCEN files to have a full pipeline up an running.
Notebook is in https://github.com/jexp/fincen
How Graphs Help Investigative Journalists to Connect the Dotsjexp
Investigative journalists use graphs and graph databases like Neo4j to connect disparate pieces of data and uncover hidden relationships. The Panama Papers investigation involved loading over 2.6 TB of leaked data into Neo4j to allow over 370 journalists from 80 countries to collaborate and find connections between entities, addresses, intermediaries and officers. Visualizing the data in Neo4j helped journalists tell the full story and have a global impact, exposing offshore dealings of world leaders and others.
Who doesn't know him, the office hero, who sat in the office late into the evening and repaired production? The fact that perhaps another colleague sat on the sofa at home and had an equal share in this success is unfortunately not so appreciated in most company cultures. But why is that? Because we are not used to working at home? Because we think that you are not so productive at home? Because you have family, garden or other activities at home? Michael has been working for distributed companies for a long time, but has also worked in offices for a long time. He will take you on his journey through different working environments and tell you what worked well for him.
This document provides a high-level summary of GraalVM and its capabilities for running applications and languages on the Java Virtual Machine. Specifically, it discusses how GraalVM allows running JavaScript, Python, Ruby, R, Java and C/C++ efficiently on the JVM through projects like Truffle and Substrate. It also summarizes GraalVM's polyglot capabilities for interoperability between languages and ahead-of-time compilation of Java into native binaries.
Neo4j Graph Streaming Services with Apache Kafkajexp
This document discusses Neo4j Streams, which enables real-time streaming of Neo4j database changes to Apache Kafka. It includes a change data capture plugin that streams transaction events from Neo4j to Kafka, a sink plugin that ingests data from Kafka into Neo4j based on custom rules, and procedures to consume and produce data directly from Cypher. The presenters demonstrate how Neo4j Streams can be used to build real-time data pipelines and streaming applications integrated with Neo4j. They encourage attendees to try the integration and provide feedback.
How Graph Databases efficiently store, manage and query connected data at s...jexp
Graph Databases try to make it easy for developers to leverage huge amounts of connected information for everything from routing to recommendations. Doing that poses a number of challenges on the implementation side. In this talk we want to look at the different storage, query and consistency approaches that are used behind the scenes. We’ll check out current and future solutions used in Neo4j and other graph databases for addressing global consistency, query and storage optimization, indexing and more and see which papers and research database developers take inspirations from.
This document discusses refactoring and summarizes the key points from Martin Fowler's book Refactoring. It covers what refactoring is, when it should be used by recognizing code smells, and how it should be done through small, incremental changes backed by thorough testing. The benefits of refactoring include improving code quality by reducing bugs and technical debt, while making the code easier to understand and modify. Tools now make refactoring easier by providing code analysis, refactoring suggestions, and quick fixes.
GraphQL - The new "Lingua Franca" for API-Developmentjexp
Three years ago, with the release of the GraphQL specification, Facebook took a fresh stab at the topic of "API design between remote services and applications." The key aspects of GraphQL provide a common, schema-based, domain-specific language and flexible, dynamic queries at interface boundaries.
In the talk, I'd like to compare GraphQL and REST and showcase benefits for developers and architects using a concrete example in application and API development, data source and system integration.
This document provides an overview of GraphDB and Neo4j. It discusses why graphs are useful for modeling connected data and common use cases. It also summarizes Neo4j's transactional graph database capabilities, performance advantages, and deployment options. Key topics covered include causal clustering, query planning, and driver and tooling support for developers.
Despite the “Graph” in the name, GraphQL is mostly used to query relational databases, object models or APIs. But it is really easy to support GraphQL endpoints from graph databases too. In this talk, I’ll demonstrate how we implemented a GraphQL extension for the Neo4j graph database. It uses the GraphQL schema definition map arbitrary GraphQL queries into single graph queries and runs them against the data in the Graph database. Using directives in the schema, we added some cool features that are transparent to the end user like computed fields and auto-generated mutations and query types. That allows you to create GraphQL APIs of some complexity without writing a single line of code.
I will show how to use the Neo4j-GraphQL extension, by creating an endpoint for the Game of Thrones dataset, and how we then can use our well-known tools (GraphiQL, apollo-client, graphql-cli, voyager) to interact with it.
Despite the “Graph” in the name, GraphQL is mostly used to query relational databases or object models. But it is really well suited to querying graph databases too. In this talk, I’ll demonstrate how I implemented a GraphQL endpoint for the Neo4j graph database and how you would use it in your app.
The world around us is full of connected information. Neo4j was originally developed to solve two complex "network" problems in a document management system, as it was too hard to manage rich connection information efficiently in traditional and new "NOSQL" databases.During this meetup, we will talk about the technology, and about the journey that a couple of technologists from Malmö took. You will learn* how Neo Technology grew from just the three founders in to a global database company with use-cases in every domain imaginable.* how focusing on customer and community feedback allows us to provide a solution for managing connected data to everyone, not just the large internet companies.
Of course we will also introduce the graph model, it's whiteboard friendlyness and how you get started with Neo4j and it's easy and powerful query language Cypher. We'll also compare the graph and relational data model to see how they differ in shape and capabilities. Finally we discuss the foundations that enable Graph databases to provide higher join performance, faster development processes and more inclusive software for all stakeholders. With use-cases from Gaming, Dating and Finance we'll see how to apply the graph capabilities to these domains to realize new functionality or opportunities that were not possible before.
Finally, if there's a question you've always wanted to ask/discuss, we'll have plenty of time for that at the end of Michael's presentation.
This document provides an overview of graph databases and Neo4j. It begins with an introduction to graph databases and their advantages over relational databases for modeling connected data. Examples of real-world use cases that are well-suited for graph databases are given. The document then describes the core components of the graph data model including nodes, relationships, properties, and labels. It provides examples of how to model data as a graph and query graphs using Cypher, the query language for Neo4j. The document concludes by discussing Neo4j as an example of a graph database and its key features and capabilities.
Each of the files or classes of a projects source code represents a tree (AST). Looking at dependencies to other classes besides inheritance creates a graph though. Field types and method parameters are also implicit dependencies. Storing this information in a graph database like Neo4j allows for interesting queries and insights. Class-Graph provides that and is available as open-source github project.
In this talk, Michael Hunger is going to shed some light over the new High Availability architecture for the popular Neo4j Graph Database. We are going to look at the different variants of the Paxos protocol, master failover strategies and cluster management state handling. This piece of infrastructure poses non-trivial challenges to distributed consensus-finding, an interesting session for anyone into scalable systems.
Graphs are everywhere. From websites adding social capabilities to Telcos providing personalized customer services, to innovative bioinformatics research, organizations are adopting graph databases as the best way to model and query connected data. If you can whiteboard, you can model your domain in a graph database.
In this session Emil Eifrem provides a close look at the graph model and offers best use cases for effective, cost-efficient data storage and accessibility.
Take Aways: Understand the model of a graph database and how it compares to document and relational databases Understand why graph databases are best suited for the storage, mapping and querying of connected data
Emil's presentation will be followed by a Hands-on Guide to Spring Data Neo4j. Spring Data Neo4j provides straightforward object persistence into the Neo4j graph database. Conceived by Rod Johnson and Neo Technology CEO Emil Eifrem, it is the founding project of the Spring Data effort. The library leverages a tight integration with the Spring Framework and the Spring Data infrastructure. Besides the easy to use object graph mapping it offers the powerful graph manipulation and query capabilities of Neo4j with a convenient API.
The talk introduces the different aspects of Spring Data Neo4j and shows applications in several example domains.
During the session we walk through the creation of a engaging sample application that starts with the setup and annotating the domain objects. We see the usage of Neo4jTemplate and the powerful repository abstraction. After deploying the application to a cloud PaaS we execute some interesting query use-cases on the collected data.
How can one start with crypto wallet development.pptxlaravinson24
This presentation is a beginner-friendly guide to developing a crypto wallet from scratch. It covers essential concepts such as wallet types, blockchain integration, key management, and security best practices. Ideal for developers and tech enthusiasts looking to enter the world of Web3 and decentralized finance.
Not So Common Memory Leaks in Java WebinarTier1 app
This SlideShare presentation is from our May webinar, “Not So Common Memory Leaks & How to Fix Them?”, where we explored lesser-known memory leak patterns in Java applications. Unlike typical leaks, subtle issues such as thread local misuse, inner class references, uncached collections, and misbehaving frameworks often go undetected and gradually degrade performance. This deck provides in-depth insights into identifying these hidden leaks using advanced heap analysis and profiling techniques, along with real-world case studies and practical solutions. Ideal for developers and performance engineers aiming to deepen their understanding of Java memory management and improve application stability.
🌱 Green Grafana 🌱 Essentials_ Data, Visualizations and Plugins.pdfImma Valls Bernaus
eady to harness the power of Grafana for your HackUPC project? This session provides a rapid introduction to the core concepts you need to get started. We'll cover Grafana fundamentals and guide you through the initial steps of building both compelling dashboards and your very first Grafana app. Equip yourself with the essential tools to visualize your data and bring your innovative ideas to life!
Why Orangescrum Is a Game Changer for Construction Companies in 2025Orangescrum
Orangescrum revolutionizes construction project management in 2025 with real-time collaboration, resource planning, task tracking, and workflow automation, boosting efficiency, transparency, and on-time project delivery.
Exceptional Behaviors: How Frequently Are They Tested? (AST 2025)Andre Hora
Exceptions allow developers to handle error cases expected to occur infrequently. Ideally, good test suites should test both normal and exceptional behaviors to catch more bugs and avoid regressions. While current research analyzes exceptions that propagate to tests, it does not explore other exceptions that do not reach the tests. In this paper, we provide an empirical study to explore how frequently exceptional behaviors are tested in real-world systems. We consider both exceptions that propagate to tests and the ones that do not reach the tests. For this purpose, we run an instrumented version of test suites, monitor their execution, and collect information about the exceptions raised at runtime. We analyze the test suites of 25 Python systems, covering 5,372 executed methods, 17.9M calls, and 1.4M raised exceptions. We find that 21.4% of the executed methods do raise exceptions at runtime. In methods that raise exceptions, on the median, 1 in 10 calls exercise exceptional behaviors. Close to 80% of the methods that raise exceptions do so infrequently, but about 20% raise exceptions more frequently. Finally, we provide implications for researchers and practitioners. We suggest developing novel tools to support exercising exceptional behaviors and refactoring expensive try/except blocks. We also call attention to the fact that exception-raising behaviors are not necessarily “abnormal” or rare.
Societal challenges of AI: biases, multilinguism and sustainabilityJordi Cabot
Towards a fairer, inclusive and sustainable AI that works for everybody.
Reviewing the state of the art on these challenges and what we're doing at LIST to test current LLMs and help you select the one that works best for you
Mastering Fluent Bit: Ultimate Guide to Integrating Telemetry Pipelines with ...Eric D. Schabell
It's time you stopped letting your telemetry data pressure your budgets and get in the way of solving issues with agility! No more I say! Take back control of your telemetry data as we guide you through the open source project Fluent Bit. Learn how to manage your telemetry data from source to destination using the pipeline phases covering collection, parsing, aggregation, transformation, and forwarding from any source to any destination. Buckle up for a fun ride as you learn by exploring how telemetry pipelines work, how to set up your first pipeline, and exploring several common use cases that Fluent Bit helps solve. All this backed by a self-paced, hands-on workshop that attendees can pursue at home after this session (https://o11y-workshops.gitlab.io/workshop-fluentbit).
Discover why Wi-Fi 7 is set to transform wireless networking and how Router Architects is leading the way with next-gen router designs built for speed, reliability, and innovation.
Landscape of Requirements Engineering for/by AI through Literature ReviewHironori Washizaki
Hironori Washizaki, "Landscape of Requirements Engineering for/by AI through Literature Review," RAISE 2025: Workshop on Requirements engineering for AI-powered SoftwarE, 2025.
Scaling GraphRAG: Efficient Knowledge Retrieval for Enterprise AIdanshalev
If we were building a GenAI stack today, we'd start with one question: Can your retrieval system handle multi-hop logic?
Trick question, b/c most can’t. They treat retrieval as nearest-neighbor search.
Today, we discussed scaling #GraphRAG at AWS DevOps Day, and the takeaway is clear: VectorRAG is naive, lacks domain awareness, and can’t handle full dataset retrieval.
GraphRAG builds a knowledge graph from source documents, allowing for a deeper understanding of the data + higher accuracy.
Full Cracked Resolume Arena Latest Versionjonesmichealj2
Resolume Arena is a professional VJ software that lets you play, mix, and manipulate video content during live performances.
This Site is providing ✅ 100% Safe Crack Link:
Copy This Link and paste it in a new tab & get the Crack File
↓
➡ 🌍📱👉COPY & PASTE LINK👉👉👉 👉 https://yasir252.my/
Get & Download Wondershare Filmora Crack Latest [2025]saniaaftab72555
Copy & Past Link 👉👉
https://dr-up-community.info/
Wondershare Filmora is a video editing software and app designed for both beginners and experienced users. It's known for its user-friendly interface, drag-and-drop functionality, and a wide range of tools and features for creating and editing videos. Filmora is available on Windows, macOS, iOS (iPhone/iPad), and Android platforms.
Download YouTube By Click 2025 Free Full Activatedsaniamalik72555
Copy & Past Link 👉👉
https://dr-up-community.info/
"YouTube by Click" likely refers to the ByClick Downloader software, a video downloading and conversion tool, specifically designed to download content from YouTube and other video platforms. It allows users to download YouTube videos for offline viewing and to convert them to different formats.
Avast Premium Security Crack FREE Latest Version 2025mu394968
🌍📱👉COPY LINK & PASTE ON GOOGLE https://dr-kain-geera.info/👈🌍
Avast Premium Security is a paid subscription service that provides comprehensive online security and privacy protection for multiple devices. It includes features like antivirus, firewall, ransomware protection, and website scanning, all designed to safeguard against a wide range of online threats, according to Avast.
Key features of Avast Premium Security:
Antivirus: Protects against viruses, malware, and other malicious software, according to Avast.
Firewall: Controls network traffic and blocks unauthorized access to your devices, as noted by All About Cookies.
Ransomware protection: Helps prevent ransomware attacks, which can encrypt your files and hold them hostage.
Website scanning: Checks websites for malicious content before you visit them, according to Avast.
Email Guardian: Scans your emails for suspicious attachments and phishing attempts.
Multi-device protection: Covers up to 10 devices, including Windows, Mac, Android, and iOS, as stated by 2GO Software.
Privacy features: Helps protect your personal data and online privacy.
In essence, Avast Premium Security provides a robust suite of tools to keep your devices and online activity safe and secure, according to Avast.
6. Development &
Administration
Analytics
Tooling
Graph
Analytics
Graph
Transactions
Data Integration
Discovery & VisualizationDrivers & APIs
AI
Neo4j Database 3.3
• 50% faster writes
• Real-time transactions
and traversal applications
Review: The Neo4j Graph Platform, Fall 2017
Neo4j Desktop, the developers’
mission control console
• Free, registered local license
of Enterprise Edition
• APOC library installer
• Algorithm library installer
Data Integration
• Neo4j ETL reveals RDBMS
hidden relationships upon
importing to graph
• Data Importer for fast data
ingestion
Graph Analytics
• Graph Algorithms support
Community Detection,
Centrality and Path Finding
• Cypher for Apache Spark
from openCypher.org
supports graph composition
(sub-graphs) and algorithm
chaining
Discovery & Visualization
• Integration with popular
visualization vendors
• Neo4j Browser and custom
visualizations allow graph
exploration
Bolt with GraphQL and more
• Secure, Causal Clustering
• High-speed analytic processing
• On-prem, Docker & cloud delivery
7. “Least Connected” load balancing
Faster & more memory efficient runtime
Batch generation of IDs
Schema operations now take local locks
Page cache metadata moved off heap
Native GB+ Tree numeric indexes
Bulk importer paging & memory improvements
Dynamically reload config settings without restarting
Neo4j Admin & Config
Storage & Indexing
Memory Management
Kernel & Transactions
Cypher Engine
Drivers & Bolt Protocol
Neo4j 3.3 Performance Improvements
10. Neo4j Desktop 1.0
• Mission control for developers
• Connect to both local and remote
Neo4j servers
• Free with registration
• Includes development license for Neo4j Enterprise Edition
• Graph Apps
• Keeps you up to date with latest versions, plugins, etc.
• https://neo4j.com/download
11. Finds the optimal path
or evaluates route
availability and quality
• Single Source Short Path
• All-Nodes SSP
• Parallel paths
Evaluates how a
group is clustered or
partitioned
• Label Propagation
• Union Find
• Strongly Connected
Components
• Louvain
• Triangle-Count
Determines the
importance of distinct
nodes in the network
• PageRank
• Betweeness
• Closeness
• Degree
Data Science Algorithms
12. Project Goals
• high performance graph algorithms
• user friendly (procedures)
• support graph projections
• augment OLTP with OLAP
• integrate efficiently with live Neo4j
database
(read, write, Cypher)
• common Graph API to write your
own
14. Usage
1. Call as Cypher procedure
2. Pass in specification (Label, Prop, Query) and
configuration
15. Usage
1. Call as Cypher procedure
2. Pass in specification (Label, Prop, Query) and
configuration
3. ~.stream variant returns (a lot) of results
CALL algo.<name>.stream('Label','TYPE',{conf})
YIELD nodeId, score
16. Usage
1. Call as Cypher procedure
2. Pass in specification (Label, Prop, Query) and
configuration
3. ~.stream variant returns (a lot) of results
CALL algo.<name>.stream('Label','TYPE',{conf})
YIELD nodeId, score
1. non-stream variant writes results to graph and
returns statistics
CALL algo.<name>('Label','TYPE',{conf});
17. Usage
1. Call as Cypher procedure
2. Pass in specification (Label, Prop, Query) and
configuration
3. ~.stream variant returns (a lot) of results
CALL algo.<name>.stream('Label','TYPE',{conf})
YIELD nodeId, score
1. non-stream variant writes results to graph and
returns statistics
CALL algo.<name>('Label','TYPE',{conf});
1. Cypher projection:
pass in Cypher for node and relationship lists
CALL algo.<name>(
'MATCH ... RETURN id(n)',
'MATCH (n)-->(m) RETURN id(n), id(m)',
{ graph:'cypher' }
)
18. Architecture
1. Load Data in parallel from
Neo4j
2. Store in efficient data
structure
3. Run graph algorithm in
parallel using Graph API
4. Write data back in parallel Neo4j
1, 2
Algorithm
Datastructures
4
3
Graph API
29. ● Import Trump Twitter Archive
● Extract Hashtags and Mentions
● Some Analytics Queries
● Use NLP (Caution)
● for Entities and Sentiment
● Some more queries
Trump is Dooming the World one Tweet(storm) at a Time
https://medium.com/@david.allen_3172/using-nlp-in-neo4j-ac40bc92196f
30. WITH [ archive.com/data/realdonaldtrump/2018.json',..]
as urls
UNWIND urls AS url
CALL apoc.load.json(url) YIELD value as t
CREATE (tweet:Tweet {
id_str: t.id_str, text: t.text,
created_at: t.created_at, retweets: t.retweet_count,
favorites: t.favorite_count, retweet: t.is_retweet,
reply: t.in_reply_to_user_id_str, source: t.source
}) RETURN count(t);
Importing Data
https://medium.com/@david.allen_3172/using-nlp-in-neo4j-ac40bc92196f
31. MATCH (t:Tweet) WHERE t.text contains '#'
WITH t,
apoc.text.regexGroups(t.text, "#([w_]+)") as matches
UNWIND matches as match
WITH t, match[1] as hashtag
MERGE (h:Tag { name: toUpper(hashtag) })
ON CREATE SET h.text = hashtag
MERGE (h)<-[:TAGGED]-(t)
RETURN count(h);
Importing Data - Hashtags
https://medium.com/@david.allen_3172/using-nlp-in-neo4j-ac40bc92196f
32. MATCH (t:Tweet) WHERE t.text CONTAINS '@'
WITH t,
apoc.text.regexGroups(t.text, "@([w_]+)") as
matches
UNWIND matches as match
WITH t, match[1] as mention
MERGE (u:User { name: toLower(mention) })
ON CREATE SET u.text = mention
MERGE (u)<-[:MENTIONED]-(t)
RETURN count(u);
Importing Data - Mentions
https://medium.com/@david.allen_3172/using-nlp-in-neo4j-ac40bc92196f
33. MATCH (tag:Tag)<-[:TAGGED]-(tw)
RETURN tag.name, count(*) as freq
ORDER BY freq desc LIMIT 10;
MATCH (u:User)<-[:MENTIONED]-(tw)
RETURN u.name, count(*) as freq
ORDER BY freq desc LIMIT 10;
Query Data – Tags / Mentions
https://medium.com/@david.allen_3172/using-nlp-in-neo4j-ac40bc92196f
44. • property storage: local date and time, date & time with timezones
• durations
• indexed
• range scans: $before < event.time < $after
• lots of possible datetime formats including weeks, quarters
• ordering
• Events, Time Tracking, History, Auditing, ...
DateTime
https://neo4j.com/docs/developer-manual/current/cypher/syntax/temporal/
45. DateTime Types & Functions
Support date Support time Support timezone
Date x
Time x x
LocalTime x
DateTime x x x
LocalDateTime x x
46. DateTime Types & Functions
date()
time.transaction()
localtime.statement()
datetime.realtime()
datetime.realtime('Europe/Berlin')
"db.temporal.timezone"
Current date, using statement clock
Current time, using transaction clock
Current local time, using statement clock
Current date and time, using realtime clock
This datetime instance will have the default timezone
of the database
Current date and time, using realtime clock,
in Europe/Berlin timezone
Neo4j setting to configure default timezone*, taking a String, e.g.
"Europe/Berlin".
49. DateTime Types & Functions - Duration
a = localdatetime("2018-01-01T00:00")
b = localdatetime("2018-02-02T01:00")
duration.between(a, b) -> (1M, 1D, 3600s, 0ns)
duration.inMonths(a, b) -> (1M, 0D, 0s, 0ns)
duration.inDays(a, b) -> (0M, 32D, 0s, 0ns)
duration.inSeconds(a, b) -> (0M, 0D, 2768400s, 0ns)
50. DateTime Types & Functions - Duration
a = localdatetime("2018-01-01T00:00")
b = localdatetime("2018-02-02T01:00")
duration.between(a, b) -> (1M, 1D, 3600s, 0ns)
duration.inMonths(a, b) -> (1M, 0D, 0s, 0ns)
duration.inDays(a, b) -> (0M, 32D, 0s, 0ns)
duration.inSeconds(a, b) -> (0M, 0D, 2768400s, 0ns)
// Maths
instant + duration
instant – duration
duration * number
duration / number
51. DateTime Types & Functions - toString
toString
Date 2018-05-07
Time 15:37:20.05+02:00
LocalTime 15:37:20.05
DateTime 2018-05-07T15:37:20.05+02:00[Europe/Stockholm]
LocalDateTime 2018-05-07T15:37:20.05
Duration P12Y5M14DT16H13M10.01S
Prints all temporal types in a format that can be parsed back
52. DateTime Types & Functions - Pitfalls
• Don’t compare Durations. Add them to the same instant and compare
the results instead.
date + dur < date + dur2
• Don’t subtract instants. Use duration.between.
duration.between(date1, date2)
• Keep predicates simple to leverage indexing.
MATCH (n), (m) WHERE datetime({date: n.date, time: m.time}) = dt
MATCH (n), (m) WHERE n.date = date(dt) AND m.time = time(dt)
53. Support for Geospatial Search
+
medium.com/neo4j/whats-new-in-neo4j-spatial-features-586d69cda8d0
55. Geospatial Graph Queries – Index
Neo4j 3.2 introduced the GBPTree:
• Lock-free writes (high insert performance)
• Good read performance (comparable)
Neo4j 3.3 introduced the NativeSchemaNumberIndex:
• And the FusionSchemaIndex
• Allows multiple types to exist in one logical index
Neo4j 3.4 Spatial Index / Date Time Index
• Different indexes for coordinate system
• Hilbert Space Filling Curves
66. Multi-Clustering Support for Global Internet Apps
Horizontally partition graph by domain (country, product, customer, data center)
70
Multi-tenancy
Geo Partitioning
Write Scaling
Driver Support
sa cluster
uk cluster
us_east cluster
hk cluster
70. Rolling Upgrades
74
3.4 3.4 3.44.0 4.0 4.0
Auto Cache Reheating
Upgrade to new versions
with zero downtime
Store upgrades
may require downtime but
can be done subsequently
For Restarts, Restores, and Cluster Expansion
71. 75
3.4 Features By Edition Community Edition Enterprise Edition
Date / Time data types ■ ■
3D Geospatial data types ■ ■
Performance Improvements
Native String Indexes – up to 5x faster writes ■ ■
2x faster backups ■
Improved Cypher runtime Fast Faster
100B+ object bulk importer ■ Resumable
Enterprise Scaling & Administration
Multi-Clustering (partitioning of clusters) ■
Automatic cache warming ■
Rolling upgrades ■
Resumable copy/restore cluster member ■
New diagnostic metrics and support tools ■
Security: Property blacklisting by user/role ■
72. Development &
Administration
Analytics
Tooling
Graph
Analytics
Graph
Transactions
Data Integration
Discovery & VisualizationDrivers & APIs
AI
Neo4j Database 3.4
• 70% faster Cypher
• Native String Indexes
(up to 5x faster writes)
• 100B+ bulk importer
The Neo4j Graph Platform, Summer 2018
Improved Admin Experience
• Rolling upgrades
• 2X faster backups
• Cache Warming on startup
• Improved diagnostics
Morpheus for Apache Spark
• Graph analytics in the data lake
• In-memory Spark graphs from
Apache Hadoop, Hive,
Gremlin and Spark
• Save graphs into Neo4j
• High-speed data exchange
between Neo4j & data lake
• Progressive analysis using
named graphs
Graph Data Science
• High speed graph
algorithms
Neo4j Bloom
• New graph illustration and
communication tool for non-
technical users
• Explore and edit graph
• Search-based
• Create storyboards
• Foundation for graph data
discovery
• Integrated with graph platform
Multi-Cluster routing built into Bolt drivers
• Date/Time data type
• 3-D Geospatial search
• Secure, Horizontal Multi-Clustering
• Property Blacklisting Security
76. 80
Neo4j Bloom Features
• Prompted Search
• Property Browser &
editor
• Category icons and
color scheme
• Pan, Zoom & Select
77. Advancing the Platform
81
Native graph architecture extends
scale, use cases and performance
• Neo4j Database 3.4
Shipping May, 2018
New products for new users
• Neo4j Bloom visualization &
storyboard tool for business
Shipping in June, 2018