0% found this document useful (0 votes)
23 views35 pages

Unit I Graph Theory and concepts

The document provides an overview of graph theory, detailing the structure and types of graphs, as well as their applications in various fields such as social networks and biological systems. It discusses the role of machine learning in analyzing graph data, highlighting tasks like node classification, relation prediction, and community detection, while also addressing the challenges posed by the interconnected nature of graph nodes. Additionally, it covers traditional approaches to graph analysis, including graph statistics and kernel methods, emphasizing the importance of understanding relationships between nodes for effective predictions.

Uploaded by

arasupadhmasini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views35 pages

Unit I Graph Theory and concepts

The document provides an overview of graph theory, detailing the structure and types of graphs, as well as their applications in various fields such as social networks and biological systems. It discusses the role of machine learning in analyzing graph data, highlighting tasks like node classification, relation prediction, and community detection, while also addressing the challenges posed by the interconnected nature of graph nodes. Additionally, it covers traditional approaches to graph analysis, including graph statistics and kernel methods, emphasizing the importance of understanding relationships between nodes for effective predictions.

Uploaded by

arasupadhmasini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 35

Unit I Graph Theory

and concepts

GRAPH THEORY AND CONCEPTS

Graphs -Graph Structures-Types of Graphs- -Machine learning on Graphs - Background and Traditional
Approaches: Graph Statistics and Kernel Methods - Neighborhood Overlap Detection - Graph.
Graph-Introduction
• Graphs are a versatile and widely used data structure for representing complex
systems.
• Graph Components:
• Nodes: Represent objects in the system.
• Edges: Represent interactions or relationships between pairs of objects.
• Applications of Graphs:
• Social Networks: Nodes represent individuals, and edges represent friendships.
• Biological Systems: Nodes can represent proteins, while edges represent biological
interactions like kinetic interactions.
Example
• The graph represents friendship relationships between members of a karate club studied by Wayne W.
Zachary from 1970 to 1972.
• An edge connects two individuals (nodes) if they socialized outside of the club.
• During the study, the club split into two factions, centered around nodes 0 and 33.
• Zachary correctly predicted the faction membership of individuals based on the graph's structure.
• This network is a well-known example in social network analysis and graph theory, demonstrating the
power of graph structures in predicting group behavior.
• Focuses on relationships between points rather than individual properties.
• Highly general: can represent diverse systems like social networks, drug-protein
interactions, molecular structures, and telecommunications networks.
• Provide a mathematical foundation to analyze, understand, and learn from
complex real-world systems.
• In the last 25 years, there has been a significant increase in the quantity and
quality of graph-structured data.
• Examples: Social networks
• Scientific initiatives (e.g., interactome, food webs)
• Molecular databases
• Billions of interconnected web-enabled device
• The main challenge is unlocking the potential of this massive graph data
• Machine learning offers powerful tools to address the scale and complexity of
modern graph datasets.
• While machine learning is not the only method, it is crucial for advancing the
analysis and understanding of graph data.
What is a graph
• Definition of a Graph:
• A graph G = (V, E) consists of
• A set of nodes V
• A set of edges E between these nodes
• An edge between nodes u and v is represented as (u,v) ∈ E
• Simple Graphs:
• At most one edge exists between any pair of nodes.
• No edges between a node and itself.
• All edges are undirected:
• Adjacency Matrix:
• A graph can be represented as an adjacency matrix of size |V|× ∣V ∣.
• Matrix entries

• For undirected graphs, A is symmetric.


• For directed graphs, A is not necessarily symmetric.
• Weighted Graphs:
• Adjacency matrix entries can have real values (e.g., {0,1} → arbitrary values).
• Example: In protein-protein interaction graphs, edge weights indicate the strength of association
Multi relational Graphs
• Multi-relational graphs extend standard graphs to include different
types of edges or relations. For example, in drug-drug interaction
graphs, different edges might represent different types of side effects.
• An edge includes an additional relation type . The edge is represented as
(u, τ ,v), where:
• u and v are nodes (vertices),
• τ represents the relation type.
• Each edge type has its own adjacency matrix Aτ
• Adjacency Tensor: The entire graph can be represented using an adjacency
tensor A, with dimensions A∈R∣V∣×∣R∣×∣V∣, where:
• ∣V∣ is the number of nodes,
• ∣R∣ is the number of relations (types of edges).
• Subsets of Multi-relational Graphs:
Two important types are:
• Heterogeneous Graphs: Different types of nodes and edges.
Machine Learning on Graphs
• Problem-Driven Discipline: Machine learning focuses on building models that
learn from data to solve specific tasks, categorized as supervised (predicting target
outputs) or unsupervised (identifying patterns like clusters).
• Graph-Specific Challenges: While graphs also use supervised and unsupervised
learning approaches, these categories may not always provide the most
meaningful distinctions for graph-related problems.
• Versatility of Graph Problems: Machine learning tasks on graphs often extend
beyond traditional boundaries, blending supervised and unsupervised
methodologies.
• Supervised Tasks on Graphs: These are common and include predicting node
labels, link existence, or graph-level outputs.
• Blurring Traditional Categories: Graph data tasks frequently combine multiple
approaches or introduce new dimensions, making them distinct from traditional
machine learning frameworks.
1) Node Classification
• Definition and Importance: Node classification aims to predict labels for all nodes in a
graph based on a small, labeled training set. Examples include identifying bots in social
networks or classifying protein functions in biological networks.
• Non-i.i.d Nature of Graphs: Unlike standard supervised classification, graph nodes are not
independent and identically distributed (i.i.d.). The interconnected nature of nodes
introduces dependencies and correlations that need to be modeled.
• Leveraging Connections: Successful node classification methods utilize the relationships
between nodes, often exploiting homophily (nodes tend to share attributes with
neighbors) or structural equivalence (nodes with similar neighborhood structures share
labels).
• Generalization Across Graphs: Tasks may involve classifying nodes in a single graph with
limited labels or generalizing across multiple, potentially disconnected graphs, like across
species' protein interactomes.
• Beyond Homophily: In addition to homophily, heterophily (nodes connecting to nodes with
different labels) and other relational concepts are used to design models that capture
diverse graph structures and relationships.
2. Relation Prediction
• Relation prediction, also known as link prediction or graph completion, aims to infer
missing edges or relationships in a graph, such as predicting unknown protein-protein
interactions or suggesting friendships in social networks.
• Real-World Applications: Common use cases include recommending content on social
platforms, predicting drug side effects, and inferring new facts in relational databases,

• Task Setup: The goal is to predict missing edges (𝐸∖𝐸trainE∖E train​ ) in a graph based
showcasing its broad utility.

on the observed nodes (𝑉V) and a partial set of edges (𝐸trainE train​).
• Task Complexity: Simple graphs (e.g., social networks) can use heuristics like shared
neighbors for predictions, while multi-relational graphs (e.g., biomedical knowledge
graphs) require advanced reasoning and inference methods due to their complexity.
• Graph-Specific Challenges: Relation prediction blurs traditional machine learning
boundaries (supervised vs. unsupervised) and often involves graph-specific inductive
biases. Variants include working with single fixed graphs or making predictions across
multiple disjoint graphs
3. Clustering and community detection

• Definition and Analogy: Community detection is the graph equivalent of unsupervised


clustering, where the goal is to uncover latent community structures in a graph based on its
nodes and edges.
• Intuition: Real-world networks often exhibit community structures, where nodes within the
same group are more likely to form edges with each other than with nodes in other groups.
• Example Scenario: A collaboration graph (e.g., from Google Scholar) would likely segregate
into clusters based on factors like research areas or institutions, rather than forming a
uniform "hairball" of connections.
• Applications: Community detection is widely used in areas such as identifying functional
modules in genetic interaction networks and detecting fraudulent user groups in financial
transaction networks.

(𝐺=(𝑉,𝐸)G=(V,E)), without prior information about the node groupings or labels.


• Challenge: The task involves inferring community structures solely from the input graph
4. Graph classication, regression,
and clustering
• Definition and Scope: These tasks involve making predictions (classification or regression)
or identifying patterns (clustering) over entire graphs, rather than individual nodes or
edges.
• Applications: Examples include predicting molecular properties like toxicity or solubility in
chemistry, or detecting malicious programs by analyzing graphs of syntax and data flow.
• Dataset Characteristics: These tasks use datasets with multiple independent graphs. The
objective is to predict labels or values specific to each graph, treating each graph as an
independent datapoint.
• Similarity to Standard ML: Graph classification and regression are analogous to traditional
supervised learning, while graph clustering parallels unsupervised clustering. However,
relational structure within graphs must be incorporated into feature representations.
• Challenge: A key difficulty is defining features that effectively capture the complex
relational and structural information inherent in graph data.
Background and Traditional
Approaches
• Basic Graph Statistics: Used for node and graph
classification tasks.
• Kernel Methods: Their application in graph-related
learning tasks.
• Node Neighborhood Overlap:
• Techniques to measure overlap between node neighborhoods.
• Forms the basis for strong heuristics in relation prediction.
• Spectral Clustering Using Graph Laplacians:
• A well-established algorithm for clustering and community
detection on graphs.
2.1 Graph Statistics and Kernel Methods
• Traditional classification with graph data involves extracting features using heuristic
functions or domain knowledge.
• These features are then used as input to standard machine learning classifiers (e.g.,
logistic regression).
• Node-Level Statistics:
• Important for understanding individual node properties within the graph.
• Often serve as the basis for building features used in machine learning models.
• Graph-Level Features and Kernel Methods
• Generalization to Graph-Level Statistics:
• Node-level statistics can be aggregated or extended to represent entire graphs.
• Graph Kernels:
• Kernel methods allow for comparing graphs by mapping their properties into a high-
dimensional space.
• Designed using graph statistics and properties.
2.1.1 Node-level statistics and
features
• The Florentine marriage network of the 15th century is a well-known
social network used to study power dynamics, particularly the rise of
the Medici family.
• Marriages in this era were strategic, consolidating political power.
• Objective: Identify features or statistics that distinguish the Medici
family node from others in the graph.
• These features can serve as input to a node classification model (e.g.,
logistic regression).
• While small graphs like the Florentine network are insufficient for
training machine learning models, they provide illustrative examples
of useful features
Node Degree as a Feature
• Definition:
• The degree du​of a node u is the number of edges incident to it.
• For a node u in a graph G=(V,E) the degree is calculated as

• where A[u,v] is the adjacency matrix.


Node Degree
• Variations in Degree:
• Directed graphs: Consider outgoing or incoming edges separately.
• Weighted graphs: Use edge weights instead of simple counts.
• Application to the Florentine
• GraphObservation:
• The Medici family has the highest degree in the graph, distinguishing it as a central node.
• Degree outmatches the next closest families (Strozzi and Guadagni) by a 3:2 ratio.
• Limitation:
• While degree is informative, it might not fully capture the distinguishing characteristics
of the Medici family.
• Additional features may be needed for better discrimination.
• General Relevance
• Degree is a fundamental and often highly informative feature for node
classification tasks in graphs.
Node Centrality
• Node Degree vs. Node Centrality
• Node Degree: Measures how many neighbors a node has, but may not fully
capture a node's importance.
• Node Centrality: Provides more nuanced measures of importance by
considering various aspects of a node’s role in the graph.
3.Eigenvector Centrality
• Definition:
• Accounts for the importance of a node's neighbors.
• A node’s centrality is proportional to the average centrality of its neighbors.

where λ is a constant.
Eigenvector Equation: λe=Ae
• The centrality vector e is the eigenvector corresponding to the largest eigenvalue
of the adjacency matrix A, as per the Perron-Frobenius Theorem.
Random Walk Interpretation
• Eigenvector centrality reflects the likelihood of a node being visited during a
random walk of infinite length.
• Using power iteration, the centrality values can be computed iteratively:
e(t+1)=Ae(t)
• Starting with e(0)=(1,1,…,1)T
• after the first iteration, e(1) contains node degrees.
• After many iterations, the values converge to the eigenvector centrality.
• Application to Florentine Marriage Network
• The Medici family has the highest eigenvector centrality (normalized value: 0.43),
further emphasizing their influence.
• This value is higher than the next most influential family (normalized value: 0.36).
• Other Centrality Measures
• Betweenness Centrality: Measures how often a node lies on the shortest path
between two other nodes.
• Closeness Centrality: Measures the average shortest path length between a
node and all other nodes.
• These additional centrality measures can be even more discerning in
characterizing node importance.
• Limitations of Degree and Centrality
• Degree and eigenvector centrality are useful for identifying prominent nodes,
such as the Medici family in the Florentine marriage network.
• However, they may not fully capture distinctions between other nodes with
similar metrics (e.g., Peruzzi and Guadagni).
•Clustering Coefficient
• Definition: Measures the proportion of closed triangles in a node's local
neighborhood, reflecting how tightly clustered the node's neighbors are.

• Numerator: Counts the edges between neighbors of 𝑢u.


• c u​

• Denominator: Total possible edges between 𝑢's neighbors (𝑑𝑢 is the degree
of 𝑢).
• Application to Florentine Marriage Network
• Peruzzi Family:
• Clustering coefficient: 0.66, indicating a tightly-knit neighborhood.
• Guadagni Family:
• Clustering coefficient: 0, reflecting a "star-like" structure where neighbors are not
interconnected.
• Significance in Real-World Networks
• Real-world networks, such as those in social and biological sciences, often exhibit
higher clustering coefficients compared to randomly generated graphs.
• This tendency highlights the prevalence of tightly-knit communities in real-world
systems.
• Variations of the Clustering Coefficient
• Variants exist to accommodate different graph types, such as directed graphs.
• These variations are extensively discussed in Newman [2018].
• The clustering coefficient is a powerful metric that complements degree and
centrality by capturing structural distinctions in a node's local neighborhood.
Clustering Coefficient and Closed Triangles
• Alternative View: The clustering coefficient measures the ratio of actual closed triangles to the
total possible triangles within a node's ego graph.
• Ego Graph: A subgraph consisting of a node, its neighbors, and all edges among those neighbors.
• Motifs and Graphlets
• Definition: Motifs (or graphlets) are small, recurring patterns or structures within a graph.
• Examples include triangles, cycles of specific lengths, and other small subgraphs.
• Generalization:
• Beyond triangles, motifs can capture more complex structures in a node's ego graph.
• This allows for richer characterization of a node's role or structural properties within the graph.
• Transforming Node-Level to Graph-Level Analysis
• By analyzing a node's ego graph and counting motifs, node-level statistics and features can be
reinterpreted as graph-level problems.
• This shift enables broader applications, leveraging the structural information at a higher level.
• Applications
• Motif-based analysis provides insights into structural patterns, such as identifying community structures,
functional units in biological networks, or social roles in social networks.
2.2 Neighborhood Overlap detection
1. Limitations of Node/Graph-Level Statistics:
Node and graph-level statistics are useful for classification tasks.
However, they fail to quantify the relationships between nodes, which
are essential for tasks like relation prediction (predicting if an edge exists
between two nodes).
2. Neighborhood Overlap Measures:
These measures quantify how "related" two nodes are by analyzing the

number of shared neighbors between two nodes 𝑢 and 𝑣


overlap in their neighborhoods. The simplest measure calculates the

• where N(u) and N(v) are the neighbors of nodes u and v, respectively
3. Similarity Matrix:
A similarity matrix summarizes the pairwise neighborhood
overlap between all nodes in the graph.
4. Relation Prediction Using Overlap Measures:
• Even without machine learning, neighborhood overlap measures serve as
powerful baselines for predicting relationships.
• A common assumption is that the likelihood of an edge (u,v) is proportional
to the similarity measure S[u,v]

• By setting a threshold on S[u,v] one can decide whether to predict an edge


between two nodes.
5. Training and Testing in Relation Prediction:
• In practice, only a subset of edges (training edges is known.
• The goal is to compute similarity measures based on Etrain and accurately
predict the existence of unseen test edges.
Full Graph:
•Represents the entire network with all edges (both training and test edges).
•The solid blue edges are the training edges.
•The dashed red edges are the test edges that need to be predicted.
Training Graph:
•A subsampled version of the full graph, where the test edges (red dashed lines) have been removed.
•This graph is used for training models or computing overlap statistics.
2.2.1 Local overlap measures
1. Basic Overlap Count
• The simplest measure of neighborhood overlap is the count of common
neighbors:
∣N(u)∩ N(v)|
• However, this measure can be biased toward nodes with high degrees.
2. Normalized Overlap Measures
• To address the bias caused by node degree, several normalization techniques
are introduced:
• a. Sorensen Index

• Normalizes the overlap by the sum of the degrees of nodes 𝑢 and 𝑣.This
prevents the measure from being overly influenced by nodes with large degrees.
c. Salton Index or cosine similarity or cosine coefficient

• Normalizes using the product of the degrees.


• Useful in balancing the influence of degrees in large, heterogeneous
graphs.

• d. Jaccard Index

• Considers both the shared and total neighbors of nodes 𝑢 and 𝑣.


• A widely used metric for neighborhood similarity.
3. Weighted Overlap Measures
• These measures extend beyond simple counting by considering the importance
of common neighbors:
a. Resource Allocation (RA) Index

• Assigns more weight to common neighbors with low degrees.


• Low-degree neighbors are more informative since their connections are fewer
and more specific.
b. Adamic-Adar (AA) Index

• Similar to the RA index but uses the logarithm of degrees for weighting.
• Captures the diminishing importance of high-degree neighbors more smoothly.
• Intuition Behind Weighting
• Common neighbors with low degrees are considered more significant because they
represent unique or less probable connections.
• High-degree nodes (e.g., hubs) are more likely to connect to many others, which
dilutes their informativeness in predicting specific relationships.
• Practical Insights
• These measures provide foundational techniques for relation prediction and are
often used as baselines for evaluating graph-based models.
• The choice of measure depends on the context:
• Simple overlap counts for straightforward cases.
• Normalization measures for fairness across varying node degrees.
• Weighted measures for emphasizing informative connections.
2.2.2 Global Overlap Measures
• 1. Katz Index
• The Katz index computes the similarity between two nodes based on the count of paths of all lengths between them, with
shorter paths receiving higher weight.

• Insights:
• Gives high similarity to nodes connected by many short paths.
• Biases heavily toward high-degree nodes, which motivates alternatives like the LHN index.
• 2. Leicht-Holme-Newman (LHN) Similarity
• The LHN similarity normalizes the Katz index by accounting for the
expected number of paths under a random graph model. This reduces
the bias toward high-degree nodes.
• 3. Random Walk-Based Measures
• Random walk methods compute similarity by simulating a random walk on the graph. These
measures consider the likelihood of reaching one node from another through random transitions.
• Personalized PageRank:
• A variant of PageRank that incorporates a restart probability:

You might also like