Unit I Graph Theory and concepts
Unit I Graph Theory and concepts
and concepts
Graphs -Graph Structures-Types of Graphs- -Machine learning on Graphs - Background and Traditional
Approaches: Graph Statistics and Kernel Methods - Neighborhood Overlap Detection - Graph.
Graph-Introduction
• Graphs are a versatile and widely used data structure for representing complex
systems.
• Graph Components:
• Nodes: Represent objects in the system.
• Edges: Represent interactions or relationships between pairs of objects.
• Applications of Graphs:
• Social Networks: Nodes represent individuals, and edges represent friendships.
• Biological Systems: Nodes can represent proteins, while edges represent biological
interactions like kinetic interactions.
Example
• The graph represents friendship relationships between members of a karate club studied by Wayne W.
Zachary from 1970 to 1972.
• An edge connects two individuals (nodes) if they socialized outside of the club.
• During the study, the club split into two factions, centered around nodes 0 and 33.
• Zachary correctly predicted the faction membership of individuals based on the graph's structure.
• This network is a well-known example in social network analysis and graph theory, demonstrating the
power of graph structures in predicting group behavior.
• Focuses on relationships between points rather than individual properties.
• Highly general: can represent diverse systems like social networks, drug-protein
interactions, molecular structures, and telecommunications networks.
• Provide a mathematical foundation to analyze, understand, and learn from
complex real-world systems.
• In the last 25 years, there has been a significant increase in the quantity and
quality of graph-structured data.
• Examples: Social networks
• Scientific initiatives (e.g., interactome, food webs)
• Molecular databases
• Billions of interconnected web-enabled device
• The main challenge is unlocking the potential of this massive graph data
• Machine learning offers powerful tools to address the scale and complexity of
modern graph datasets.
• While machine learning is not the only method, it is crucial for advancing the
analysis and understanding of graph data.
What is a graph
• Definition of a Graph:
• A graph G = (V, E) consists of
• A set of nodes V
• A set of edges E between these nodes
• An edge between nodes u and v is represented as (u,v) ∈ E
• Simple Graphs:
• At most one edge exists between any pair of nodes.
• No edges between a node and itself.
• All edges are undirected:
• Adjacency Matrix:
• A graph can be represented as an adjacency matrix of size |V|× ∣V ∣.
• Matrix entries
• Task Setup: The goal is to predict missing edges (𝐸∖𝐸trainE∖E train ) in a graph based
showcasing its broad utility.
on the observed nodes (𝑉V) and a partial set of edges (𝐸trainE train).
• Task Complexity: Simple graphs (e.g., social networks) can use heuristics like shared
neighbors for predictions, while multi-relational graphs (e.g., biomedical knowledge
graphs) require advanced reasoning and inference methods due to their complexity.
• Graph-Specific Challenges: Relation prediction blurs traditional machine learning
boundaries (supervised vs. unsupervised) and often involves graph-specific inductive
biases. Variants include working with single fixed graphs or making predictions across
multiple disjoint graphs
3. Clustering and community detection
where λ is a constant.
Eigenvector Equation: λe=Ae
• The centrality vector e is the eigenvector corresponding to the largest eigenvalue
of the adjacency matrix A, as per the Perron-Frobenius Theorem.
Random Walk Interpretation
• Eigenvector centrality reflects the likelihood of a node being visited during a
random walk of infinite length.
• Using power iteration, the centrality values can be computed iteratively:
e(t+1)=Ae(t)
• Starting with e(0)=(1,1,…,1)T
• after the first iteration, e(1) contains node degrees.
• After many iterations, the values converge to the eigenvector centrality.
• Application to Florentine Marriage Network
• The Medici family has the highest eigenvector centrality (normalized value: 0.43),
further emphasizing their influence.
• This value is higher than the next most influential family (normalized value: 0.36).
• Other Centrality Measures
• Betweenness Centrality: Measures how often a node lies on the shortest path
between two other nodes.
• Closeness Centrality: Measures the average shortest path length between a
node and all other nodes.
• These additional centrality measures can be even more discerning in
characterizing node importance.
• Limitations of Degree and Centrality
• Degree and eigenvector centrality are useful for identifying prominent nodes,
such as the Medici family in the Florentine marriage network.
• However, they may not fully capture distinctions between other nodes with
similar metrics (e.g., Peruzzi and Guadagni).
•Clustering Coefficient
• Definition: Measures the proportion of closed triangles in a node's local
neighborhood, reflecting how tightly clustered the node's neighbors are.
• Denominator: Total possible edges between 𝑢's neighbors (𝑑𝑢 is the degree
of 𝑢).
• Application to Florentine Marriage Network
• Peruzzi Family:
• Clustering coefficient: 0.66, indicating a tightly-knit neighborhood.
• Guadagni Family:
• Clustering coefficient: 0, reflecting a "star-like" structure where neighbors are not
interconnected.
• Significance in Real-World Networks
• Real-world networks, such as those in social and biological sciences, often exhibit
higher clustering coefficients compared to randomly generated graphs.
• This tendency highlights the prevalence of tightly-knit communities in real-world
systems.
• Variations of the Clustering Coefficient
• Variants exist to accommodate different graph types, such as directed graphs.
• These variations are extensively discussed in Newman [2018].
• The clustering coefficient is a powerful metric that complements degree and
centrality by capturing structural distinctions in a node's local neighborhood.
Clustering Coefficient and Closed Triangles
• Alternative View: The clustering coefficient measures the ratio of actual closed triangles to the
total possible triangles within a node's ego graph.
• Ego Graph: A subgraph consisting of a node, its neighbors, and all edges among those neighbors.
• Motifs and Graphlets
• Definition: Motifs (or graphlets) are small, recurring patterns or structures within a graph.
• Examples include triangles, cycles of specific lengths, and other small subgraphs.
• Generalization:
• Beyond triangles, motifs can capture more complex structures in a node's ego graph.
• This allows for richer characterization of a node's role or structural properties within the graph.
• Transforming Node-Level to Graph-Level Analysis
• By analyzing a node's ego graph and counting motifs, node-level statistics and features can be
reinterpreted as graph-level problems.
• This shift enables broader applications, leveraging the structural information at a higher level.
• Applications
• Motif-based analysis provides insights into structural patterns, such as identifying community structures,
functional units in biological networks, or social roles in social networks.
2.2 Neighborhood Overlap detection
1. Limitations of Node/Graph-Level Statistics:
Node and graph-level statistics are useful for classification tasks.
However, they fail to quantify the relationships between nodes, which
are essential for tasks like relation prediction (predicting if an edge exists
between two nodes).
2. Neighborhood Overlap Measures:
These measures quantify how "related" two nodes are by analyzing the
• where N(u) and N(v) are the neighbors of nodes u and v, respectively
3. Similarity Matrix:
A similarity matrix summarizes the pairwise neighborhood
overlap between all nodes in the graph.
4. Relation Prediction Using Overlap Measures:
• Even without machine learning, neighborhood overlap measures serve as
powerful baselines for predicting relationships.
• A common assumption is that the likelihood of an edge (u,v) is proportional
to the similarity measure S[u,v]
• Normalizes the overlap by the sum of the degrees of nodes 𝑢 and 𝑣.This
prevents the measure from being overly influenced by nodes with large degrees.
c. Salton Index or cosine similarity or cosine coefficient
• d. Jaccard Index
• Similar to the RA index but uses the logarithm of degrees for weighting.
• Captures the diminishing importance of high-degree neighbors more smoothly.
• Intuition Behind Weighting
• Common neighbors with low degrees are considered more significant because they
represent unique or less probable connections.
• High-degree nodes (e.g., hubs) are more likely to connect to many others, which
dilutes their informativeness in predicting specific relationships.
• Practical Insights
• These measures provide foundational techniques for relation prediction and are
often used as baselines for evaluating graph-based models.
• The choice of measure depends on the context:
• Simple overlap counts for straightforward cases.
• Normalization measures for fairness across varying node degrees.
• Weighted measures for emphasizing informative connections.
2.2.2 Global Overlap Measures
• 1. Katz Index
• The Katz index computes the similarity between two nodes based on the count of paths of all lengths between them, with
shorter paths receiving higher weight.
• Insights:
• Gives high similarity to nodes connected by many short paths.
• Biases heavily toward high-degree nodes, which motivates alternatives like the LHN index.
• 2. Leicht-Holme-Newman (LHN) Similarity
• The LHN similarity normalizes the Katz index by accounting for the
expected number of paths under a random graph model. This reduces
the bias toward high-degree nodes.
• 3. Random Walk-Based Measures
• Random walk methods compute similarity by simulating a random walk on the graph. These
measures consider the likelihood of reaching one node from another through random transitions.
• Personalized PageRank:
• A variant of PageRank that incorporates a restart probability: