Graph Algorithms Practical Examples in Apache Spark and Neo4j 1st Edition Mark Needham All Chapters Instant Download
Graph Algorithms Practical Examples in Apache Spark and Neo4j 1st Edition Mark Needham All Chapters Instant Download
com
https://textbookfull.com/product/graph-algorithms-practical-
examples-in-apache-spark-and-neo4j-1st-edition-mark-needham/
OR CLICK BUTTON
DOWNLOAD NOW
https://textbookfull.com/product/graph-algorithms-for-data-science-
with-examples-in-neo4j-1st-edition-tomaz-bratanic/
textboxfull.com
https://textbookfull.com/product/high-performance-spark-best-
practices-for-scaling-and-optimizing-apache-spark-1st-edition-holden-
karau/
textboxfull.com
https://textbookfull.com/product/stream-processing-with-apache-spark-
mastering-structured-streaming-and-spark-streaming-1st-edition-gerard-
maas/
textboxfull.com
Graph Data Science with Neo4j: Learn how to use Neo4j 5
with Graph Data Science library 2.0 and its Python driver
for your project Scifo
https://textbookfull.com/product/graph-data-science-with-neo4j-learn-
how-to-use-neo4j-5-with-graph-data-science-library-2-0-and-its-python-
driver-for-your-project-scifo/
textboxfull.com
https://textbookfull.com/product/practical-neo4j-jordan-gregory/
textboxfull.com
https://textbookfull.com/product/introducing-net-for-apache-spark-
distributed-processing-for-massive-datasets-1st-edition-ed-elliott/
textboxfull.com
https://textbookfull.com/product/beginning-apache-spark-using-azure-
databricks-unleashing-large-cluster-analytics-in-the-cloud-robert-
ilijason/
textboxfull.com
Co
m
pl
im
en
ts
of
Graph
Algorithms
Practical Examples in Apache Spark & Neo4j
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Graph Algorithms, the cover image of a
European garden spider, and related trade dress are trademarks of O’Reilly Media, Inc.
While the publisher and the authors have used good faith efforts to ensure that the information and
instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility
for errors or omissions, including without limitation responsibility for damages resulting from the use of
or reliance on this work. Use of the information and instructions contained in this work is at your own
risk. If any code samples or other technology this work contains or describes is subject to open source
licenses or the intellectual property rights of others, it is your responsibility to ensure that your use
thereof complies with such licenses and/or rights.
This work is part of a collaboration between O’Reilly and Neo4j. See our statement of editorial independ‐
ence.
978-1-492-05781-9
[LSI]
Table of Contents
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Foreword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
1. Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
What Are Graphs? 2
What Are Graph Analytics and Algorithms? 3
Graph Processing, Databases, Queries, and Algorithms 6
OLTP and OLAP 7
Why Should We Care About Graph Algorithms? 8
Graph Analytics Use Cases 12
Conclusion 14
iii
Summary 28
iv | Table of Contents
5. Centrality Algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Example Graph Data: The Social Graph 79
Importing the Data into Apache Spark 80
Importing the Data into Neo4j 81
Degree Centrality 81
Reach 81
When Should I Use Degree Centrality? 82
Degree Centrality with Apache Spark 83
Closeness Centrality 84
When Should I Use Closeness Centrality? 85
Closeness Centrality with Apache Spark 86
Closeness Centrality with Neo4j 88
Closeness Centrality Variation: Wasserman and Faust 89
Closeness Centrality Variation: Harmonic Centrality 91
Betweenness Centrality 92
When Should I Use Betweenness Centrality? 94
Betweenness Centrality with Neo4j 95
Betweenness Centrality Variation: Randomized-Approximate Brandes 98
PageRank 99
Influence 99
The PageRank Formula 100
Iteration, Random Surfers, and Rank Sinks 102
When Should I Use PageRank? 103
PageRank with Apache Spark 103
PageRank with Neo4j 105
PageRank Variation: Personalized PageRank 107
Summary 108
Table of Contents | v
Strongly Connected Components with Neo4j 122
Connected Components 124
When Should I Use Connected Components? 124
Connected Components with Apache Spark 125
Connected Components with Neo4j 126
Label Propagation 127
Semi-Supervised Learning and Seed Labels 129
When Should I Use Label Propagation? 129
Label Propagation with Apache Spark 130
Label Propagation with Neo4j 131
Louvain Modularity 133
When Should I Use Louvain? 137
Louvain with Neo4j 138
Validating Communities 143
Summary 143
vi | Table of Contents
The Coauthorship Graph 193
Creating Balanced Training and Testing Datasets 194
How We Predict Missing Links 199
Creating a Machine Learning Pipeline 200
Predicting Links: Basic Graph Features 201
Predicting Links: Triangles and the Clustering Coefficient 214
Predicting Links: Community Detection 218
Summary 224
Wrapping Things Up 224
Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
ix
graph algorithms are used within workflows: one for general analysis and one for
machine learning.
At the beginning of each category of algorithms, there is a reference table to help you
quickly jump to the relevant algorithm. For each algorithm, you’ll find:
x | Preface
This element indicates a warning or caution.
Our unique network of experts and innovators share their knowledge and expertise
through books, articles, conferences, and our online learning platform. O’Reilly’s
online learning platform gives you on-demand access to live training courses, in-
depth learning paths, interactive coding environments, and a vast collection of text
and video from O’Reilly and 200+ other publishers. For more information, please
visit http://oreilly.com.
Preface | xi
How to Contact Us
Please address comments and questions concerning this book to the publisher:
We have a web page for this book, where we list errata, examples, and any additional
information. You can access this page at http://bit.ly/graph-algorithms.
To comment or ask technical questions about this book, send email to bookques‐
[email protected].
For more information about our books, courses, conferences, and news, see our web‐
site at http://www.oreilly.com.
Find us on Facebook: http://facebook.com/oreilly
Follow us on Twitter: http://twitter.com/oreillymedia
Watch us on YouTube: http://www.youtube.com/oreillymedia
Acknowledgments
We’ve thoroughly enjoyed putting together the material for this book and thank all
those who assisted. We’d especially like to thank Michael Hunger for his guidance, Jim
Webber for his invaluable edits, and Tomaz Bratanic for his keen research. Finally, we
greatly appreciate Yelp permitting us to use its rich dataset for powerful examples.
xii | Preface
Foreword
What do the following things all have in common: marketing attribution analysis,
anti-money laundering (AML) analysis, customer journey modeling, safety incident
causal factor analysis, literature-based discovery, fraud network detection, internet
search node analysis, map application creation, disease cluster analysis, and analyzing
the performance of a William Shakespeare play. As you might have guessed, what
these all have in common is the use of graphs, proving that Shakespeare was right
when he declared, “All the world’s a graph!”
Okay, the Bard of Avon did not actually write graph in that sentence, he wrote stage.
However, notice that the examples listed above all involve entities and the relation‐
ships between them, including both direct and indirect (transitive) relationships.
Entities are the nodes in the graph—these can be people, events, objects, concepts, or
places. The relationships between the nodes are the edges in the graph. Therefore,
isn’t the very essence of a Shakespearean play the active portrayal of entities (the
nodes) and their relationships (the edges)? Consequently, maybe Shakespeare could
have written graph in his famous declaration.
What makes graph algorithms and graph databases so interesting and powerful isn’t
the simple relationship between two entities, with A being related to B. After all, the
standard relational model of databases instantiated these types of relationships in its
foundation decades ago, in the entity relationship diagram (ERD). What makes
graphs so remarkably important are directional relationships and transitive relation‐
ships. In directional relationships, A may cause B, but not the opposite. In transitive
relationships, A can be directly related to B and B can be directly related to C, while A
is not directly related to C, so that consequently A is transitively related to C.
With these transitivity relationships—particularly when they are numerous and
diverse, with many possible relationship/network patterns and degrees of separation
between the entities—the graph model uncovers relationships between entities that
otherwise may seem disconnected or unrelated, and are undetected by a relational
xiii
database. Hence, the graph model can be applied productively and effectively in many
network analysis use cases.
Consider this marketing attribution use case: person A sees the marketing campaign;
person A talks about it on social media; person B is connected to person A and sees
the comment; and, subsequently, person B buys the product. From the marketing
campaign manager’s perspective, the standard relational model fails to identify the
attribution, since B did not see the campaign and A did not respond to the campaign.
The campaign looks like a failure, but its actual success (and positive ROI) is discov‐
ered by the graph analytics algorithm through the transitive relationship between the
marketing campaign and the final customer purchase, through an intermediary
(entity in the middle).
Next, consider an anti-money laundering (AML) analysis case: persons A and C are
suspected of illicit trafficking. Any interaction between the two (e.g., a financial trans‐
action in a financial database) would be flagged by the authorities, and heavily scruti‐
nized. However, if A and C never transact business together, but instead conduct
financial dealings through safe, respected, and unflagged financial authority B, what
could pick up on the transaction? The graph analytics algorithm! The graph engine
would discover the transitive relationship between A and C through intermediary B.
In internet searches, major search engines use a hyperlinked network (graph-based)
algorithm to find the central authoritative node across the entire internet for any
given set of search words. The directionality of the edge is vital in this case, since the
authoritative node in the network is the one that many other nodes point at.
With literature-based discovery (LBD)—a knowledge network (graph-based) applica‐
tion enabling significant discoveries across the knowledge base of thousands (or even
millions) of research journal articles—“hidden knowledge” is discovered only
through the connection between published research results that may have many
degrees of separation (transitive relationships) between them. LBD is being applied to
cancer research studies, where the massive semantic medical knowledge base of
symptoms, diagnoses, treatments, drug interactions, genetic markers, short-term
results, and long-term consequences could be “hiding” previously unknown cures or
beneficial treatments for the most impenetrable cases. The knowledge could already
be in the network, but we need to connect the dots to find it.
Similar descriptions of the power of graphing can be given for the other use cases lis‐
ted earlier, all examples of network analysis through graph algorithms. Each case
deeply involves entities (people, objects, events, actions, concepts, and places) and
their relationships (touch points, both causal and simple associations).
When considering the power of graphing, we should keep in mind that perhaps the
most powerful node in a graph model for real-world use cases might be “context.”
Context may include time, location, related events, nearby entities, and more. Incor‐
xiv | Foreword
porating context into the graph (as nodes and as edges) can thus yield impressive pre‐
dictive analytics and prescriptive analytics capabilities.
Mark Needham and Amy Hodler’s Graph Algorithms aims to broaden our knowledge
and capabilities around these important types of graph analyses, including algo‐
rithms, concepts, and practical machine learning applications of the algorithms.
From basic concepts to fundamental algorithms to processing platforms and practical
use cases, the authors have compiled an instructive and illustrative guide to the won‐
derful world of graphs.
Foreword | xv
CHAPTER 1
Introduction
Graphs are one of the unifying themes of computer science—an abstract representation that
describes the organization of transportation systems, human interactions, and telecommuni‐
cation networks. That so many different structures can be modeled using a single formalism
is a source of great power to the educated programmer.
—The Algorithm Design Manual, by Steven S. Skiena (Springer), Distinguished Teach‐
ing Professor of Computer Science at Stony Brook University
Today’s most pressing data challenges center around relationships, not just tabulating
discrete data. Graph technologies and analytics provide powerful tools for connected
data that are used in research, social initiatives, and business solutions such as:
1
What Are Graphs?
Graphs have a history dating back to 1736, when Leonhard Euler solved the “Seven
Bridges of Königsberg” problem. The problem asked whether it was possible to visit
all four areas of a city connected by seven bridges, while only crossing each bridge
once. It wasn’t.
With the insight that only the connections themselves were relevant, Euler set the
groundwork for graph theory and its mathematics. Figure 1-1 depicts Euler’s progres‐
sion with one of his original sketches, from the paper “Solutio problematis ad geome‐
triam situs pertinentis”.
Figure 1-1. The origins of graph theory. The city of Königsberg included two large islands
connected to each other and the two mainland portions of the city by seven bridges. The
puzzle was to create a walk through the city, crossing each bridge once and only once.
While graphs originated in mathematics, they are also a pragmatic and high fidelity
way of modeling and analyzing data. The objects that make up a graph are called
nodes or vertices and the links between them are known as relationships, links, or
edges. We use the terms nodes and relationships in this book: you can think of nodes
as the nouns in sentences, and relationships as verbs giving context to the nodes. To
avoid any confusion, the graphs we talk about in this book have nothing to do with
graphing equations or charts as in Figure 1-2.
2 | Chapter 1: Introduction
Figure 1-2. A graph is a representation of a network, often illustrated with circles to rep‐
resent entities which we call nodes, and lines to represent relationships.
Looking at the person graph in Figure 1-2, we can easily construct several sentences
which describe it. For example, person A lives with person B who owns a car, and
person A drives a car that person B owns. This modeling approach is compelling
because it maps easily to the real world and is very “whiteboard friendly.” This helps
align data modeling and analysis.
But modeling graphs is only half the story. We might also want to process them to
reveal insight that isn’t immediately obvious. This is the domain of graph algorithms.
Network Science
Network science is an academic field strongly rooted in graph theory that is concerned
with mathematical models of the relationships between objects. Network scientists
rely on graph algorithms and database management systems because of the size, con‐
nectedness, and complexity of their data.
There are many fantastic resources for complexity and network science. Here are a
few references for you to explore.
Graph algorithms have widespread potential, from preventing fraud and optimizing
call routing to predicting the spread of the flu. For instance, we might want to score
particular nodes that could correspond to overload conditions in a power system. Or
we might like to discover groupings in the graph which correspond to congestion in a
transport system.
In fact, in 2010 US air travel systems experienced two serious events involving multi‐
ple congested airports that were later studied using graph analytics. Network scien‐
tists P. Fleurquin, J. J. Ramasco, and V. M. Eguíluz used graph algorithms to confirm
the events as part of systematic cascading delays and use this information for correc‐
tive advice, as described in their paper, “Systemic Delay Propagation in the US Air‐
port Network”.
To visualize the network underpinning air transportation Figure 1-3 was created by
Martin Grandjean for his article, “Connected World: Untangling the Air Traffic Net‐
work”. This illustration clearly shows the highly connected structure of air transpor‐
tation clusters. Many transportation systems exhibit a concentrated distribution of
links with clear hub-and-spoke patterns that influence delays.
4 | Chapter 1: Introduction
Figure 1-3. Air transportation networks illustrate hub-and-spoke structures that evolve
over multiple scales. These structures contribute to how travel flows.
Graphs also help uncover how very small interactions and dynamics lead to global
mutations. They tie together the micro and macro scales by representing exactly
which things are interacting within global structures. These associations are used to
forecast behavior and determine missing links. Figure 1-4 is a foodweb of grassland
species interactions that used graph analysis to evaluate the hierarchical organization
and species interactions and then predict missing relationships, as detailed in the
paper by A. Clauset, C. Moore, and M. E. J. Newman, “Hierarchical Structure and the
Prediction of Missing Links in Network”.
6 | Chapter 1: Introduction
drives smarter transactions, which creates new data and opportunities for further
analysis. More recently there’s been a trend to integrate these silos for more real-time
decision making.
According to Gartner:
[HTAP] could potentially redefine the way some business processes are executed, as
real-time advanced analytics (for example, planning, forecasting and what-if analysis)
becomes an integral part of the process itself, rather than a separate activity performed
after the fact. This would enable new forms of real-time business-driven decision-
making process. Ultimately, HTAP will become a key enabling architecture for intelli‐
gent business operations.
As OLTP and OLAP become more integrated and begin to support functionality pre‐
viously offered in only one silo, it’s no longer necessary to use different data products
or systems for these workloads—we can simplify our architecture by using the same
platform for both. This means our analytical queries can take advantage of real-time
data and we can streamline the iterative process of analysis.
8 | Chapter 1: Introduction
tivity so apparent than in big data. The amount of information that has been brought
together, commingled, and dynamically updated is impressive. This is where graph
algorithms can help make sense of our volumes of data, with more sophisticated ana‐
lytics that leverage relationships and enhance artificial intelligence contextual infor‐
mation.
As our data becomes more connected, it’s increasingly important to understand its
relationships and interdependencies. Scientists that study the growth of networks
have noted that connectivity increases over time, but not uniformly. Preferential
attachment is one theory on how the dynamics of growth impact structure. This idea,
illustrated in Figure 1-6, describes the tendency of a node to link to other nodes that
already have a lot of connections.
Figure 1-6. Preferential attachment is the phenomenon where the more connected a
node is, the more likely it is to receive new links. This leads to uneven concentrations and
hubs.
In his book, Sync: How Order Emerges from Chaos in the Universe, Nature, and Daily
Life (Hachette), Steven Strogatz provides examples and explains different ways that
real-life systems self-organize. Regardless of the underlying causes, many researchers
The network analysis shown in Figure 1-7 was created by Francesco D’Orazio of Pul‐
sar to help predict the virality of content and inform distribution strategies. D’Orazio
found a correlation between the concentration of a community’s distribution and the
speed of diffusion of a piece of content.
This is significantly different than what an average distribution model would predict,
where most nodes would have the same number of connections. For instance, if the
World Wide Web had an average distribution of connections, all pages would have
about the same number of links coming in and going out. Average distribution mod‐
els assert that most nodes are equally connected, but many types of graphs and many
real networks exhibit concentrations. The web, in common with graphs like travel
and social networks, has a power-law distribution with a few nodes being highly con‐
nected and most nodes being modestly connected.
10 | Chapter 1: Introduction
Power Law
A power law (also called a scaling law) describes the relationship between two quanti‐
ties where one quantity varies as a power of another. For instance, the area of a cube is
related to the length of its sides by a power of 3. A well-known example is the Pareto
distribution or “80/20 rule,” originally used to describe the situation where 20% of a
population controlled 80% of the wealth. We see various power laws in the natural
world and networks.
Trying to “average out” a network generally won’t work well for investigating relation‐
ships or forecasting, because real-world networks have uneven distributions of nodes
and relationships. We can readily see in Figure 1-8 how using an average of character‐
istics for data that is uneven would lead to incorrect results.
Figure 1-8. Real-world networks have uneven distributions of nodes and relationships
represented in the extreme by a power-law distribution. An average distribution assumes
most nodes have the same number of relationships and results in a random network.
Because highly connected data does not adhere to an average distribution, network
scientists use graph analytics to search for and interpret structures and relationship
distributions in real-world data.
There is no network in nature that we know of that would be described by the random
network model.
— Muuten vain.
— Mikä hävettäisi?
7.
Iljuša
— Isä, isä! Miten sääli minun onkaan sinua, isä! -— voihki Iljuša
katkerasti.
— Isä, älä itke… ja kun minä kuolen, niin ota sinä hyvä poika,
toinen… valitse itse heistä kaikista, ota hyvä, pane sen nimeksi Iljuša
ja rakasta häntä minun sijastani…
Grušenjkan luona
— Palvelustytöille.
Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.
textbookfull.com