Data Mining and BI: Social Network Analytics: Random Graphs
Data Mining and BI: Social Network Analytics: Random Graphs
Network Analytics
Random Graphs
Can be approximated
by Poisson distribution
Degree Distribution
● What is the probability that a node has 0,1,2,3, … edges?
● Probabilities sum to 1
Quiz
● The maximum degree of a node in a simple (no multiple edges between the
same two nodes) N node graph is
○ N
○ N-1
○ N-2
Fact
● In an Erdos-Renyi random graph the maximal degree does not vary much
from the average
○ The degrees of the nodes tend to be similar
Fact
● Random networks do not have large hubs
Giant Component
● As N increases, a giant component emerges
○ I.e. a subgraph that comprises a fraction of the whole graph
● What is the average degree z at which the giant component starts to emerge?
○ 0
○ 1
○ 3/2
○ 3
Percolation threshold
● Percolation threshold: how many edges need to be
added before the giant component appears?
● As the average degree increases to z = 1,
a giant component suddenly appears average degree
Giant component: Another angle
● How many other friends besides you does each of your friends have?
○ By property of degree distribution the average degree of your friends, you excluded, is z
○ so at z = 1, each of your friends is expected to have another friend, who in turn have another
friend, etc.
○ the giant component emerges
Why just one giant component?
● What if you had 2, how long could they be sustained as the network
densifies?
Average Shortest Path
● How many hops on average between each pair of nodes?
● again, each of your friends has z = avg. degree friends besides you
● ignoring loops, the number of people you have at distance l is zl
Friends at distance l
Nl = zl
lav ~ logN/logz
What does it mean in practice
● Erdös-Renyi networks can grow to be very large but nodes will be just a few
hops apart
Logarithmic axes
● powers of a number will be uniformly spaced (20, 21, 22, 23, 24,...)
Erdös-Renyi avg. shortest path in log-log
Realism
● Consider alternative mechanisms of constructing a network that are also fairly
“random”.
● How do they stack up against Erdös-Renyi?
Other models
Introduction model
● Prob-link is the p (probability of any two nodes sharing an edge) that we are
used to
● But, with probability prob-intro the other node is selected among one of our
friends’ friends and not completely at random
Static Geographical model
● Each node connects to num-neighbors of its closest neighbors
Random encounter
● People move around randomly and connect to people they bump into
Growth model
● Instead of starting out with a fixed number of nodes, nodes are added over
time
Conclusion
● in some instances the ER model is plausible
● if dynamics are different, ER model may be a poor fit
Growth and preferential attachment
models
Example online Q&A site
Uneven participation
● Many people having replied few
Times Vs Few people having
replied many times
Real-world degree distributions
● Sexual networks
● Great variation in contact numbers
● Many people with small number of
partners Vs Few people with high
number of partners
Power-law distribution
● High skew (asymmetry)
● Straight line on a loglog plot (right) Vs linear plot (left)
Poisson distribution
● Little skew (asymmetry)
● Curved on a loglog plot (right) Vs linear plot (left)
Power law distribution
● Straight line on a log-log plot
ln(p(k))=c-αln(k)
● Exponentiate both sides to get that p(k), the probability of observing an node
of degree ‘k’ is given by:
p(k)=Ck-α
m=2
Random network growth
● one node is born at each time tick
● at time t there are t nodes
● change in degree ki of node i (born at time i, with 0 < i < t)
m/t
● There are m new edges being added per unit time (with 1 new node)
● The m edges are being distributed among t nodes
Age and degree
● On average ki(t)>kj(t)
● Older nodes on average have mode degrees
Ingredient #2: preferential attachment
● Preferential attachment
○ new nodes prefer to attach to well-connected nodes over less-well connected nodes
● Process also known as:
○ Cumulative advantage
○ Rich-get-richer
○ Matthew effect
Price's preferential attachment model for citation networks
● [Price 65]
○ each new paper is generated with m citations (mean)
○ new papers cite previous papers with probability proportional to their indegree (citations)
○ what about papers without any citations?
■ each paper is considered to have a “default” citation
■ probability of citing a paper with degree k, proportional to k+1
● Power law with exponent α = 2+1/m
Cumulative advantage: how?
● Copying mechanism
● Visibility
Barabasi-Albert model
● First used to describe skewed degree distribution of the World Wide Web
● Each node connects to other nodes with probability proportional to their
degree
○ the process starts with some initial subgraph
○ each new node comes in with m edges
○ probability of connecting to node i
● Results in power-law with exponent α = 3
Random Vs Preferential
Properties of the BA graph
● The distribution is scale free with exponent α = 3
P(k) = 2m2/k3