Social Network Analysis Engineering Proj
Social Network Analysis Engineering Proj
et de la Recherche Scientifique
جامعـة قرطاج
Université de Carthage
______________________________________
Biware Consulting
______________________________________
Academic/University Year
2019-2020
This work comes within the scope of the second year’s Engineering Internship project at
Biware Consulting. The project turns over Social Networks Analysis. Firstly, we have been through
the theoretical aspects of graphs. Then, we have applied the social networks analysis methods on
different social networks. We have analyzed a Facebook network and identified influencers and
bridges in it. In addition, we have analyzed two bitcoin trading over-the-counter platforms and we have
estimated the traders’ trustworthiness in these two platforms.
1
Acknowledgments
In conducting this report, I have received meaningful assistance from many quarters which we like to
put on record here with deep gratitude and great pleasure.
First and foremost, I would like to express my sincere gratitude to the co-founder and CEO, Mr. Amine
Boussarsar, for his support and encouragement, all along the internship.
I would also like to thank my supervisor, Ms. Roua Hammami, for her advices and remarks, all along
the internship.
I would also like to thank all our teachers at Tunisia Polytechnic School for their continuous help and
treasurable training during our study years.
Finally, special thanks go to the jury members who honored us by examining and evaluating this
modest contribution.
2
Summary
3
b. Robustness of the Algorithm ……………………………………………….………...24
c. Correlation Analysis with features in the Network …………………………..............25
d. Measuring Edge Scores Using Trustworthiness Estimations ………………...............26
7. Conclusion ..…………………………………………………………………………..……...26
General Conclusion ……….………………………………………………………………………...28
References ………………………………………………………...………………………................29
4
List of Figures
5
Glossary Acronyms
6
General Introduction
I have carried out my engineering internship in the R&D department (Biware Solutions) of
Biware Consulting.
Biware Consulting is a Tunisian company created in 2011. It offers services related to the
decision making. It has an R&D department called: Biware Solutions. The main purpose of this
department is to develop products and solutions based on data science and artificial intelligence and
all other emerging technologies.
As I have been through three major steps in my engineering project, I decided to divide the
report on three major chapters.
The first chapter contains an introduction to some graph theory notions. Describing important
notions, features, coefficients, and algorithms I have used in my analysis of different social networks.
The second chapter is mainly a basic application to the notions I have discovered during the
first weeks of my internship. In fact, I applied what I have found in different papers turning over social
network analysis and what I have learned from the courses I have enrolled. Although the first step is
not highly complex, it helped me master graph notions and social network analysis methods which
helped me go further in next applications.
Finally, the third chapter turns over the analysis of bitcoin trading network analysis. So, I have
applied what I have learned during the bibliography period to solve the trust measurement in trading
networks. In this chapter I describe an algorithm which was developed from scratch and implemented
to estimate trustworthiness of each trader, the analysis of this algorithm, the study of its efficiency and
convergence and the use of the estimated trustworthiness in the prediction of trust scores using machine
learning algorithms.
7
Chapter 1
1. Introduction
In this first chapter, we will introduce some essential notions to social networks analysis. In fact, we
have taken most of the definitions and notions from the course ‘Applied Social Network Analysis by
Michigan University’. [9]. This chapter contains the important highlights of the cited course.
8
Weighted network: a network where edges are assigned a (typically numerical) weight. Edges can
have many labels or attributes other than weights. Nodes can also have attributes.
Signed network: a network where edges are assigned positive or negative sign.
Multigraph: A network where multiple edges can connect the same nodes (parallel edges).
Bipartite Graph: a graph whose nodes can be split into two sets L and R and every edge connects a
node in L with a node in R.
4. Clustering coefficients
Eccentricity of a node n is the largest distance between n and all other nodes.
9
Radius: the minimum eccentricity in the graph.
Node connectivity: Minimum number of nodes needed to disconnect a graph or pair of nodes.
Edge connectivity: Minimum number of edges needed to disconnect a graph or pair of nodes.
7. Node Importance
a. Degree Centrality
We assume that important nodes have many connections.
𝑑𝑣
𝐶𝑑𝑒𝑔 (𝑣) =
|𝑁| − 1
10
b. Closeness centrality
We assume that important nodes are close to other nodes.
|𝑁| − 1
𝐶𝑐𝑙𝑜𝑠𝑒 (𝑣) =
∑𝑢∈𝑁\{𝑣} 𝑑(𝑣, 𝑢)
Where 𝑁 is the set of nodes in the network, and 𝑑 (𝑣, 𝑢) =length of shortest path from 𝑣 to 𝑢.
c. Betweenness centrality
We assume that important nodes connect other nodes.
𝜎𝑠 ,𝑡 (𝑣)
𝐶𝑏𝑡𝑤 (𝑣) = ∑
𝜎𝑠 ,𝑡
𝑠,𝑡∈𝑁
8. Popular algorithms
a. PageRank
Developed by Google founders to measure the importance of webpages from the hyperlink network
structure. PageRank assigns a score of importance to each node.
Important nodes are those with many in-links from important pages. It can be used for any type of
network, but it is mainly useful for directed networks.
The Algorithm:
1. Assign all nodes a PageRank of 1/𝑛
2. Perform the Basic PageRank Update Rule k times.
Basic PageRank Update Rule: Each node gives an equal share of its current PageRank to all the nodes
it links to. The new PageRank of each node is the sum of all the PageRank it received from other
nodes.
11
b. HITS Algorithm
Computing 𝑘 iterations of the HITS algorithm to assign an authority score and hub score to each node.
12
Chapter 2
1. Introduction
Identifying influencers is highly useful and has many applications. The spread of information is more
efficient and faster when we target influencers in the network. For example, considering a company
wanting to make an advertising, rather than sending messages to all the network (which is highly
costly) it better detects the top influencers and bridges in the network and just targeting them and
leaving them spread the information.
Identifying influencers can has serious impacts, like controlling groups, by manipulating the targeted
influencer’s minds. For example, in the company itself, especially for big companies they can run
algorithms on employee’s connections data to detect the biggest influencers in the company to control
their influence on other employees.
In this part of the project we have used a Facebook social network. The Facebook social network we
have used only contains edges(friendship) between nodes(persons). The only details we know about
‘x’ is the list of friends he has. We neither have details about the intensity connections nor about the
frequency of contact between persons. So, detecting the influencers in the network was based only on
the friendship connections. So, we used many classical social network analysis metrics and we have
developed some to quantify the centrality, the popularity and the influence of each node. We have
developed many functions (In Python) to calculate these metrics and to extract the top nodes for each
metric.
Node 1 Node 2
0 236 186
1 236 84
2 236 62
3 236 142
4 236 252
Figure 1: 5 lines of the initial data
13
In the network we have 84243 connection(edge) and 3959 person(node).
In the analysis, we will transform this initial table to graph G. We will represent each person with a
node. Where U is a set of nodes. And we will represent the connection between nodes with an ed
ge, where E is a set of edges. So, G=G (U, E).
As connections in this network are friendships so the graph is undirected. As we have no data other t
han connections. The network is unweighted and unsigned. The problem is modeled with Undirected
Unweighted Unsigned Social Network.
a. Degree Centrality
Figure 2: Red dots in the network have the highest degree centrality scores
(Their degree centralities belong to the highest 10% of the scores)
The chosen nodes are the top 10% nodes having the highest numbers of connections. Which is a good
estimation of their influence in the network, but it is not enough to conclude.
14
b. Closeness Centrality
Figure 3: The red dots in the network have the highest closeness centrality scores
(Their closeness centralities belong to the highest 10% of the scores)
c. Betweenness Centrality
Figure 4: Red dots in the network have the highest betweenness centrality scores
15
4. Influencers and Bridges
a. Bridges
Figure 5: Red dots are the bridges in this Facebook social network
b. Influencers
We defined the influencers those who have higher degree centrality, closeness centrality and
betweenness centrality. In other words, degree influence is the intersection of the three notions. Being
influencer is being highly connected to the network, having a higher number of friends, being close to
most of the network to have an impact on most of the network and being in between many optimal
connections to have impact on these paths.
𝐼 ∈ (𝐿1 ∩ 𝐿2 ∩ 𝐿3)
Where:
I: List of influencers
L1: List of nodes having highest n% degree centrality
L2: List of nodes having highest n% closeness centrality
L3: List of nodes having highest n% betweenness centrality
16
‘n%’ is defined by the number of influencers we want to extract from the network.
Figure 6: Red dots are the influencers and green dots are the bridges
c. Conclusion
It is possible to quantify many social notions, like popularity, influence, closeness and connectivity in
social networks. As it is possible, to extract popular people in the network by just considering their
connections on the network.
17
Chapter 3
1. Introduction
To accomplish any transaction, trust is highly required, because if we do not trust, we will not be able
to judge the riskiness of the transaction. That is why it is essential to quantify the trustworthiness of
the institution or the person we are dealing with. This paper shows an example where we use social
network analysis to estimate the trustworthiness of individuals in some networks. The algorithm
proposed in this paper can approach people’s trustworthiness by considering many features in the
network. In addition to solving the measurement of trust in trading platforms, this algorithm can be
highly useful in other contexts, for example, to estimate people’s skills (problem solving skills, social
skills, technical skills, managerial skills, …) in social networks.
We have worked on the data of two trading platforms that are: bitcoin-OTC [10] and BTC-Alpha [11].
Previous literature has worked on these two datasets [1,6]. However, my work approached the topic
differently: Both platforms represent a place to trade currencies and goods. Trading is achieved
directly between counterparties without the intervention of the platform. As such, it is everyone’s
responsibility to act prudently and wisely and to choose whom to trust and whom to avoid. Traders
should prevent fraudulent users if they are trading in these risky platforms. In this kind of trading, trust
is a crucial element to proceed. As discussed by David and Andrew [2], Trust has three dimensions
that are cognitive, emotional and behavioral. In fact, humans only need 100 ms of face exposure to
others to extract the needed information to make a trustworthiness judgement [3]. However, in these
kinds of online trading we cannot consider the emotional dimension of trust as we are not in real
contact with the other traders. So, the remaining two involved dimensions in the experience are the
cognitive and the behavioral dimensions. According to David and Andrew ‘When faced by the totally
unknown, we can gamble but we cannot trust. ‘[2]. Fortunately, these platforms offer some data on
each user. In fact, the idea is that each user, can give a trust score to another one. So, each trader has
received from some traders with whom he has made a transaction a trust score. Each trader has scored
some of the other traders. The trust scores vary from +10 to -10. You can check the rating guidelines
on bitcoin-OTC platform guide [4]. So, then the platform calculates the mean of the received scores
for each one and rank people according to their mean received scores. In my point of view, this scoring
and rating system is a good way to, at least, estimate the trustworthiness of the other traders. In this
paper, we will discuss the possible frauds in Over-the-counter platforms. In addition, we will discuss
the risks of frauds in this system. we will propose an algorithm that can simplify trader’s life by
estimating a trustworthiness score for each one and ranking people, rigorously and continuously, in
18
the platform. we insist that my proposed score is different from just calculating the mean of the
received trust scores, and we assume that the output of that algorithm is highly representative of the
trustworthiness of the traders. We have used my algorithm’s output to predict what traders have scored
other traders, we have also measured the correlations between the estimated trustworthiness scores and
other SNA features.
19
Figure 7 : BTC-Alpha network Figure 8 : Bitcoin-OTC Network
Both networks are weighted, signed and directed networks. The Bitcoin-OTC has 5881 nodes and
35592 edges. However, the BTC-Alpha network has 3783 nodes and 24186 edges.
5. Trustworthiness Algorithm
Definitions of the Hyperparameters:
α: the percentage taken from the received score from non-popular nodes
The Algorithm:
Step I: Initialization
1) Initializing the trust score to 0 to all the nodes
2) For the nodes having ‘in degree’ (The number of edges coming into a vertex in a directed
graph) greater than ‘β’, calculating the mean of all the received values. (We suppose that under
β in degree the mean received trustworthiness values are biased)
For nodes in 𝑈1 :
∑𝑈 𝑟𝑒𝑐𝑒𝑖𝑣𝑒𝑑𝑡𝑟𝑢𝑠𝑡
𝑛𝑜𝑑𝑒𝑇𝑟𝑢𝑠𝑡𝑤𝑜𝑟𝑡ℎ𝑖𝑛𝑒𝑠𝑠 .= 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑐𝑒𝑖𝑣𝑒𝑑 𝑡𝑟𝑢𝑠𝑡 𝑠𝑐𝑜𝑟𝑒𝑠
;
Where 𝑈1 is the set of nodes that have in degree equal, or greater, than β.
20
Comments:
The choice of β should depend on many features including the data itself, because the
more we increase the value of β the more we get more 0 values as output of the first step.
β bitcoin-OTC BTC-Alpha
Minimum % of 0 we get as Minimum % of 0 we get as
output of step 1 output of step 1
2 59.37 57.02
3 68.98 66.98
4 74.68 72.82
5 78.67 76.97
6 81.89 80.25
Figure 9: Percentage of the traders we will not calculate their mean received scores in
function of β and for both platforms.
We insist that the hyperparameter β is highly important because it solves the problem
we have discussed earlier in Section (III) of this paper. People creating new profiles and
sending scores to themselves will be more affected by a higher threshold β. They will find
themselves obliged to create a higher number of profiles.
i = 1;
While i <= N:
𝑈2 = set of nodes having strictly positive scores;
For node in 𝑈1 :
∑𝑈2 𝑟𝑒𝑐𝑒𝑖𝑣𝑒𝑑𝑡𝑟𝑢𝑠𝑡
𝑛𝑜𝑑𝑒𝑇𝑟𝑢𝑠𝑡𝑤𝑜𝑟𝑡ℎ𝑖𝑛𝑒𝑠𝑠 .= ;
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑐𝑒𝑖𝑣𝑒𝑑 𝑡𝑟𝑢𝑠𝑡 𝑠𝑐𝑜𝑟𝑒𝑠 𝑓𝑟𝑜𝑚 𝑈2
i = i + 1;
21
Comments:
As we have tested in both networks: bitcoin-OTC and BTC-Alpha we have convergence after
less than 5 iterations you can check Section (VI. 1)
The algorithm can run until convergence, for sure it will converge in both platforms as in
section (VI. 1). But In case we want to run it on other sort of data we cannot be sure about
convergence. That is why we preferred to use the hyperparameter N.
This step is highly important, and it solves some of the problems we have discussed in Section
(III). In fact, when we let positive people vote we are eliminating the votes of the untrustworthy
traders which is very important because untrustworthy people will tend to give false
trustworthiness scores to fellow traders. We assume the iterative idea we proposed is important to
rigorously classify trustworthy, untrustworthy and neutral people.
2) Making a list L of the top ranked nodes (top γ % popularity scores of the nodes)
For node in 𝑈1 :
(∑𝐿⊓𝑈2 𝑟𝑒𝑐𝑒𝑖𝑣𝑒𝑑𝑡𝑟𝑢𝑠𝑡 + ∑𝑈2 \𝐿 α×receivedtrust )
𝑛𝑜𝑑𝑒𝑇𝑟𝑢𝑠𝑡𝑤𝑜𝑟𝑡ℎ𝑖𝑛𝑒𝑠𝑠 = [𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑐𝑒𝑖𝑣𝑒𝑑 𝑡𝑟𝑢𝑠𝑡 𝑠𝑐𝑜𝑟𝑒𝑠 𝑓𝑟𝑜𝑚 𝐿⊓𝑈2 +
α ×(𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑐𝑒𝑖𝑣𝑒𝑑 𝑡𝑟𝑢𝑠𝑡 𝑠𝑐𝑜𝑟𝑒𝑠 𝑓𝑟𝑜𝑚 𝑈2 /𝐿)]
Where 𝑈1 is the set of nodes that have in degree equal, or greater, than β.
And 𝑈2 set of nodes that have positive trustworthiness scores calculated in the step II.
22
Comments:
This last step is carried out to consider the social connections of the user. In fact, receiving
a trustworthy score from a popular trader in the platform should have a different weight than
receiving a trustworthy score from a new trader. So, this step uses the output of the two first
steps, considering the same spirit of the two first steps of only considering the scores received
from positive people and considering the same threshold β and adding the γ considerations.
The choice of α is not obvious, we intuitively take α = 0.70, to give more considerations
for the popular nodes, but we have no intention to eliminate the votes of non-popular nodes.
As the choice of α, the choice of γ is not obvious. However, considering the Figure 2 in the
comments of step 1, we know that approximately only 30% of the nodes have more than 3
received edges, that is why we considered γ=0.30.
PageRank (PR), an algorithm developed by Google Search to rank web pages in
their search engine results. In fact, it is a way of measuring the importance of website pages.
However, we can use it to measure the importance of nodes in other networks. [7,9]
Hyperlink-Induced Topic Search (HITS; also known as hubs and authorities) is
another algorithm that rates Web pages, developed by Jon Kleinberg. So, we get two rankings
of nodes (hubs ranking and authorities ranking) [8,9]
In fact, step 2 is iterative and depends of the number of iterations N. In this exact case, we have proven
that it needs less than 5 iterations to converge for both networks (bitcoin-OTC and BTC-Alpha)
Convergence is when error (called Ɛ) becomes null. Which means nothing will changes even if we
keep iterating.
Mathematical definition of the error where U is the set of nodes:
Ɛ= ∑𝑈 |𝑝𝑟𝑒𝑣𝑖𝑜𝑢𝑠𝑡𝑟𝑢𝑠𝑡𝑤𝑜𝑟𝑡ℎ𝑖𝑛𝑒𝑠𝑠 − 𝑛𝑒𝑤𝑡𝑟𝑢𝑠𝑡𝑤𝑜𝑟𝑡ℎ𝑖𝑛𝑒𝑠𝑠 |
23
Figure 11: Convergence of the second step Figure 12: Convergence of the second step
for BTC-Alpha platform for Bitcoin-OTC platform
ζ = ∑𝑈 |𝑝𝑟𝑒𝑣𝑖𝑜𝑢𝑠𝑡𝑟𝑢𝑠𝑡𝑤𝑜𝑟𝑡ℎ𝑖𝑛𝑒𝑠𝑠 − 𝑛𝑒𝑤𝑡𝑟𝑢𝑠𝑡𝑤𝑜𝑟𝑡ℎ𝑖𝑛𝑒𝑠𝑠 |
24
Figure 13: Variation of error Figure 14: Variation of error
for BTC-Alpha platform for Bitcoin-OTC platform
Both graphs show the variation of ζ in function of percentage of edges we used to calculate the
trustworthiness scores.
Features Trustworthiness
Trustworthiness 1
Mean of all received trust scores 0.53
Hub score 0.18
Out degree 0.15
Mean of sent trust scores 0.15
In degree 0.14
PageRank 0.13
Auth score 0.12
Figure 15: The correlation between some features and Trustworthiness score estimated by the
algorithm.
The mean of received trust scores is highly correlated with the trustworthiness values which is logical,
because the more trustworthy we are, the more we get higher trustworthy scores form traders and the
higher will be the mean.
In addition, the more we are popular on the platform, the more we have higher trustworthiness score.
Because being popular on the platform means having many connections, i.e trading with many traders
and being probably trustworthy.
25
d. Measuring Edge Scores Using Trustworthiness Estimations
We used the trustworthiness scores of nodes that we have calculated using the trustworthiness
algorithm to predict the signs of the edges. The more trustworthy a person is, the more he will get
better scores from other traders. We have got interesting resutls for both networks. In fact, we used
logistic regression model [13]. We only used two features to make the predictions that are the
trustworthiness of trustee and the trustworthiness of the trustor.
For example, for bitcoin-OTC platform, we have got an accuracy of 92%.
7. Conclusion
This algorithm can optimize the user experience on the trading platforms by labeling each one with a
score that can summarize wisely his/her trustworthiness. In fact, it will help traders recognize the
untrustworthy traders and the trustworthy traders. Also, it gives people the opportunity to trade even
though they received some negative trust scores from spammers. In addition, this algorithm is flexible,
and it depends on the proposed hyperparameters α, β, γ, and N. The choice can depend on the data, the
context but can also depend on the vision of the user of this algorithm. Finally, we insist that this
algorithm can be used in different domains not only the measurement of trustworthiness.
26
General Conclusion
This internship offered me the chance to dive through the graph theory and expand my
knowledge about different social networks analysis methods.
I have been through different graph analysis online courses. Including ‘Applied Social
Networks Analysis using Python’ Offred by Michigan University which helped understand deeply
social networks analysis methods and its potential applications.
At the same time, I have been through different articles and papers turning over graph theory,
social network analysis, machine learning, the sociology of trust, over the counter trading and bitcoin
currency.
As, I have been through implementing different solutions and analysis on different social
graphs which helped me master manipulating graphs and developing my programming skills using
Python.
Furthermore, I had the chance to write a paper about estimating trader’s trustworthiness and I
am looking forward to publishing it.
Being in meetings twice a week, working in open space, communicating with engineers and
researchers and discussing applications of my project and other close projects helped me develop my
understanding of different topics and most importantly develop my professional skills.
Finally, I enjoyed networking with engineers, reading papers discovering new notions,
analyzing social graphs, solving real problems, bringing innovative solutions and finally summarizing
all this in a scientific paper.
27
References
[1] S. Kumar, F. Spezzano, V.S. Subrahmanian, C. Faloutsos. Edge Weight Prediction in Weighted
Signed Networks. IEEE International Conference on Data Mining (ICDM), 2016.
[2] Trust as a Social Reality J. DAVID LEWIS, Portland, Oregon ANDREW WEIGERT, University
of Notre Dame
[3] Evaluating faces on trustworthiness after minimal time exposure, Alexander Todorov, Manish
Pakrashi, and Nikolaas N. Oosterhof Princeton University
[4] https://wiki.bitcoin-otc.com/wiki/OTC_Rating_System
[5] https://snap.stanford.edu/data/soc-sign-bitcoin-otc.html
[6] S. Kumar, B. Hooi, D. Makhija, M. Kumar, V.S. Subrahmanian, C. Faloutsos. REV2:
Fraudulent User Prediction in Rating Platforms. 11th ACM International Conference on Web
Searchand Data Mining (WSDM), 2018.
[7] The Google PageRank Algorithm and How It Works, Ian Rogers IPR Computing
[8] https://en.wikipedia.org/wiki/HITS_algorithm
[9] Applied Social Networks Analysis, Michigan University Online Course, Coursera
[10] https://www.bitcoin-otc.com/
[11] https://btc-alpha.com/en/exchange/BTC_USD
[12] A model-based approach for robustness testing Jean-Claude Fernandez, Laurent Mounier, and
Cyril Pachon
[13] https://en.wikipedia.org/wiki/Logistic_regression
[14] Counterparty Risk, Investopedia , REVIEWED BY JAMES CHEN AND CHRIS B
MURPHY, RISK MANAGEMENT
[15] https://wiki.bitcoin-otc.com/wiki/Using_bitcoin-otc#Risk_of_fraud
[16] IEEE Journal on Selected Areas in Communications Volume 31 issue 9 2013 Shafiq, M. Z.;
Ilyas, M. U.; Liu, A. X.; Radha, H. -- Identifying Leaders and Followers
[17] The Academy of Management Review Volume 4 issue 4 1979 [doi 10.2307%2F257851] Noel
M. Tichy, Michael L. Tushman and Charles Fombrun -- Social Network Analysis for Organizations
[18] Visualising My Facebook Network Clusters – Towards Data Science
[19] Homophily - The New York Times
[20] Homophily in online dating 2005
[21] https://en.wikipedia.org/wiki/Social_network_analysis
28