0% found this document useful (0 votes)
23 views

Seminars in bio lecture6 2022 Graphنينااااا

Uploaded by

ai241234
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Seminars in bio lecture6 2022 Graphنينااااا

Uploaded by

ai241234
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Lecture 6: Seminars in

Bioinformatics:
Graph Mining
Prof. Dr. Taysir Hassan A. Soliman
Information Systems Department
Faculty of Computers and Information
Assiut University
[email protected]
November 27, 2022

31/12/2022 Prof. Taysir Hassan Soliman ; Seminars in Bioinformatics 1


Applications of Graph Mining
• Graphs become increasingly important in modeling complicated structures, such as:
1. Chemical compounds… (whether a particular substructure exists in a chemical compound or not)
2. Protein structures (whether a specific protein structure exists)
3. Biological networks
4. Social networks … detection of friendship, detection of communities and influence between users
5. The Web
6. workflows
7. XML documents

• Can analyze the properties of a real world graph


• Predict how the structure and properties of a real graph might affect some application
• Develop models that can generate realistic graphs that match patterns found in real
graphs.
31/12/2022 Prof. Taysir Hassan Soliman ; Seminars in Bioinformatics 2
Applications of Graph Mining (Cont…)
• There have been studies on the use of frequent structures:
✓As features to classify chemical compounds
✓To study protein structural families
✓On the detection of considerably large frequent subpathways in
metabolic networks
✓On the use of frequent graph patterns for graph indexing and
similarity search in graph databases.

31/12/2022 Prof. Taysir Hassan Soliman ; Seminars in Bioinformatics 3


Applications (Cont…)
• Find frequent subgraphs Within a graph itself
• Find frequent subgraphs between several graphs
• They are useful for
• characterizing graph sets,
• discriminating different groups of graphs,
• classifying and clustering graphs,
• building graph indices, and facilitating similarity search in graph
databases.

31/12/2022 Prof. Taysir Hassan Soliman ; Seminars in Bioinformatics 4


31/12/2022 Prof. Taysir Hassan Soliman ; Seminars in Bioinformatics 5
Basic Terminologies:
• Directed vs Undirected graphs
• Weighted vs Unweighted graphs
• Rooted vs unrooted … the graph has a root ( a main node) to start at
• NP-complete problem: any of a class of computational problems for which
no efficient solution algorithm has been found. Many significant computer-
science problems belong to this class—e.g., the traveling salesman problem,
satisfiability problems, and graph-covering problems.

31/12/2022 Prof. Taysir Hassan Soliman ; Seminars in Bioinformatics 6


UnDirected Graph Rooted Graph
Directed Graph

Weighted Graph

31/12/2022 Prof. Taysir Hassan Soliman ; Seminars in Bioinformatics 7


Frequent Graph Mining
• Among the various kinds of graph patterns, frequent substructures
are the very basic patterns that can be discovered in a collection of
graphs.

31/12/2022 Prof. Taysir Hassan Soliman ; Seminars in Bioinformatics 8


Important Notations & Definitions:
• A graph g
• A vertex set of a graph g by V(g)
• The edge set by E(g).
• A label function, L, maps a vertex or an edge to a label.
• A graph g is a subgraph of another graph g0 if there exists a subgraph
isomorphism from g to g0.
• Given a labeled graph dataset, D = {G1,G2,...,Gn}, we define support(g) (or
frequency(g)) as the percentage (or number) of graphs in D where g is a
subgraph.
• A frequent graph is a graph whose support is no less than a minimum
support threshold, min sup.

31/12/2022 Prof. Taysir Hassan Soliman ; Seminars in Bioinformatics 9


How can we discover frequent substructures?
The discovery of frequent substructures usually consists of two steps:
• In the first step, we generate frequent substructure candidates.
• In the second step: The frequency of each candidate is checked

• Most studies on frequent substructure discovery focus on the optimization of the


first step, because the
second step involves a subgraph isomorphism test whose computational
complexity is excessively high (i.e., NP-complete)

31/12/2022 Prof. Taysir Hassan Soliman ; Seminars in Bioinformatics 10


Example

31/12/2022 Prof. Taysir Hassan Soliman ; Seminars in Bioinformatics 11


We want to see whether
Does this subgraph exist in this graph
dataset ???

31/12/2022 Prof. Taysir Hassan Soliman ; Seminars in Bioinformatics 12


Does this subgraph exist in this graph
dataset ???

31/12/2022 Prof. Taysir Hassan Soliman ; Seminars in Bioinformatics 13


Approaches: Apriori-based Approach
• There are two basic approaches to this problem: an Apriori-based approach and a
pattern-growth approach.
• Apriori-based frequent substructure mining algorithms share similar
characteristics with Apriori-based frequent itemset mining algorithms
• The search for frequent graphs starts with graphs of small “size,” and proceeds in
a bottom-up manner by generating candidates having an extra vertex, edge, or
path.
• The definition of graph size depends on the algorithm used.

31/12/2022 Prof. Taysir Hassan Soliman ; Seminars in Bioinformatics 14


AprioriGraph Algorithm

• Sk is the frequent substructure set of size k. 1. AprioriGraph adopts a level-wise


mining methodology.
2. At each iteration, the size of newly
discovered frequent substructures is
increased by one.
3. These new substructures are first
generated by joining two similar but
slightly different frequent subgraphs
that were discovered in the previous
call to AprioriGraph.

31/12/2022 Prof. Taysir Hassan Soliman ; Seminars in Bioinformatics 15


Other Apriori-based Graph Mining Algorithms
• Apriori-based algorithms for frequent substructure mining include AGM, FSG, and a path-join method.
• AGM shares similar characteristics with Apriori-based itemset mining.
• FSG and the path-join method explore edges and connections in an Apriori-based fashion.

• Each of these methods explores various candidate generation strategies.

31/12/2022 Prof. Taysir Hassan Soliman ; Seminars in Bioinformatics 16


AGM Algorithm
• Uses a vertex-based candidate generation method that increases the substructure size by one vertex at each
iteration of AprioriGraph.
• Two size-k frequent graphs are joined only if they have the same size-(k - 1) subgraph.
• Here, graph size is the number of vertices in the graph.
• The newly formed candidate includes the size-(k - 1) subgraph in common and the additional two vertices
from the two sizek patterns.

31/12/2022 Prof. Taysir Hassan Soliman ; Seminars in Bioinformatics 17


FSG (Frequent Subgraph Mining) Algorithm
• FSG adopts an edge-based candidate generation strategy that increases the substructure size by one edge in
each call of AprioriGraph.
• Two size-k patterns are merged if and only if they share the same subgraph having k - 1 edges, which is
called the core.
• Here, graph size is taken to be the number of edges in the graph.

• The newly formed candidate includes the core and the additional two edges from the size-k patterns.
Each candidate has one more edge than

these two patterns.

31/12/2022 Prof. Taysir Hassan Soliman ; Seminars in Bioinformatics 18


Edge-disjoint Path Method

• Graphs are classified by the number of disjoint paths they have


• Two paths are edge-disjoint if they do not share any common edge.

• A substructure pattern with k +1 disjoint paths is generated by joining substructures with k disjoint paths.

Disadvantages of Apriori-based approaches:


• Considerable overhead when joining two size-k frequent substructures to generate size-(k+1) graph candidates

• Uses Breadth-First Search for candidate generation approach (to determine whether a size-(k +1) graph
is frequent, it must check all of its corresponding size-k subgraphs to obtain an upper bound of its frequency).

31/12/2022 Prof. Taysir Hassan Soliman ; Seminars in Bioinformatics 19


Pattern-growth approach
• Pattern-growth approach uses breadth-first search as well as depth-first search
(DFS), the latter of which consumes less memory.
• The gSpan algorithm is designed to reduce the generation of duplicate graphs.
• It need not search previously discovered frequent graphs for duplicate detection.
• It does not extend any duplicate graph, yet still guarantees the discovery of the
complete set of frequent graphs.
• Adopts depth-first search.
• A starting vertex is randomly chosen and the vertices in a graph are marked so
that we can tell which vertices have been visited.

31/12/2022 Prof. Taysir Hassan Soliman ; Seminars in Bioinformatics 20


.
• The visited vertex set is expanded repeatedly until a full depth-first
search (DFS) tree is built.
• One graph may have various DFS trees depending on how the depth-
first search is performed (i.e., the vertex visiting order).
• Given a DFS tree T , we call the starting vertex in T , v0, the root. The
last visited vertex, vn, is called the right-most vertex. The straight path
from v0 to vn is called the right-most path.

31/12/2022 Prof. Taysir Hassan Soliman ; Seminars in Bioinformatics 21


• Gspan: Given a graph G and a DFS tree T in G, a new edge e can be added between the right-most vertex
and other vertices on the right-most path (backward extension);
• or it can introduce a new vertex and connect to vertices on the right-most path (forward extension).
• we call them right-most extension,

31/12/2022 Prof. Taysir Hassan Soliman ; Seminars in Bioinformatics 22

You might also like