Implementation and Performance of A SDN Cluster-Controller Based On The OpenDayLight Framework
Implementation and Performance of A SDN Cluster-Controller Based On The OpenDayLight Framework
Author:
Esteban Hernandez
Supervisor:
Prof. Maier Guido Alberto
April 2016
Contents
1. Introduction ................................................................................................................................ 5
1.1. Outline ................................................................................................................................. 6
2. Software Defined Networking ..................................................................................................... 8
2.1. SDN theory .......................................................................................................................... 9
2.2. SDN definition ................................................................................................................... 11
2.3. OpenFlow .......................................................................................................................... 11
2.4. Distributed Controllers ...................................................................................................... 12
3. OpenDayLight (ODL) .................................................................................................................. 14
3.1. Architecture....................................................................................................................... 15
3.2. Services in ODL .................................................................................................................. 16
4. Distributed systems ................................................................................................................... 18
4.1. Cluster computing systems ............................................................................................... 21
4.2. Load balancing................................................................................................................... 22
4.3. High availability ................................................................................................................. 22
5. Raft, consensus algorithm ......................................................................................................... 23
5.1. Leader election .................................................................................................................. 27
5.2. Log replication ................................................................................................................... 30
5.3. Safety................................................................................................................................. 32
5.3.1. Election Restriction ................................................................................................... 33
5.3.2. Committing Entries from Previous Terms ................................................................. 34
5.4. Summary (Raft) ................................................................................................................. 34
5.4.1. Journal Replication .................................................................................................... 35
5.4.2. Snapshot Replication ................................................................................................. 36
5.4.3. Durability/Recovery................................................................................................... 36
6. Akka ........................................................................................................................................... 38
6.1. Membership ...................................................................................................................... 38
2
6.2. Failure detector ................................................................................................................. 39
6.3. Seed nodes ........................................................................................................................ 39
6.4. Membership lifecycle ........................................................................................................ 40
6.5. Joining to seed nodes ........................................................................................................ 41
6.6. Leaving .............................................................................................................................. 42
6.7. Node roles ......................................................................................................................... 43
6.8. Persistence module ........................................................................................................... 43
6.9. Snapshots .......................................................................................................................... 44
7. Gossip protocol ......................................................................................................................... 45
8. Clustering in OpenDayLight ....................................................................................................... 48
8.1. How is this build on? ......................................................................................................... 49
8.2. Data synchronization......................................................................................................... 51
8.3. Communication ................................................................................................................. 52
8.4. Data Distribution ............................................................................................................... 52
8.5. High Availability (HA)......................................................................................................... 53
8.6. Data Store Flows ............................................................................................................... 54
8.7. Startup ............................................................................................................................... 54
9. Set Up and Testing of the Cluster.............................................................................................. 56
9.1. Testing ............................................................................................................................... 61
10. Log Analysis ........................................................................................................................... 63
11. Messages between Controllers ............................................................................................. 69
11.1. Capture .......................................................................................................................... 69
11.1.1. Akka tool Part ............................................................................................................ 70
11.1.2. Raft algorithm part .................................................................................................... 73
12. Bandwidth Usage Analysis .................................................................................................... 78
12.1. Experimentation Methodology ..................................................................................... 78
12.2. Mininet .......................................................................................................................... 78
12.3. Data capture and processing ......................................................................................... 81
12.4. Modeling ....................................................................................................................... 85
3
13. Conclusions ........................................................................................................................... 90
13.1. Future work ................................................................................................................... 91
4
1. Introduction
In the last decade the world have witnessed as Internet traffic grows in a very fast
way, this is due to different factors such as More Internet Users, Proliferation of
Devices and Connections, Video Services, Mobility Momentum and many others [1].
This has resulted in the increase of large data centers with the aim of processing all
this information. Due to these factors computational complexity and storage have
increased tremendously, therefore making the networking complexity higher.
The fundamental idea behind SDN is a network architecture where the Control Plane
is separate from the Data Plane [2], this architecture opens up the possibilities of
what there was before. This is because with SDN it is possible to have programmable
networks where network administrators can custom the network in order to satisfy
their needs. In the SDN paradigm there is a central software program called
controller, which handles the behavior of the entire network. Becoming the brain of
the network and, making network devices become simples forwarding devices.
This new way of managing networks leads to the need to communicate simple
forwarding devices with SDN controllers and, for this was created the
communications protocol called OpenFlow.
In the SDN architecture network devices only rely on a single controller, this gives
the system the weaknesses of having a single point of failure. This issue can be
5
solve by deploying a cluster of SDN controllers, which will give the system an
improvement in scalability, high availability and persistence of the information.
The principal goal of this thesis is to deploy and analyze different aspects as the
internal performing, messages exchanged between controllers and, the bandwidth
usage of a cluster of OpenDayLight controllers. For this was created a cluster with
three instances of the controller, subsequently it was connected a Mininet network
with a change in the topology size. One of the objectives was also to provide a cluster
with high availability (HA), for this purpose two nodes in the cluster are acting as
redundancy in case of failure. In OpenDayLight they act as followers and the other
node acts as the leader of the cluster. The leader is the one in charge of
communicate with all the network devices, in case of failure the system will select a
new leader. Network devices will be passed out to be controlled for it.
1.1. Outline
This report is organized in the following way: Chapter 2 contains some background
about software defined networking. Chapter 3 contains a brief introduction to the
OpenDayLight controller. Chapter 4 describes how a distributed system works.
Chapter 5 explains how Raft (the consensus algorithm) works and, why is so
important for the deployment of the cluster. Chapter 6 describes Akka, tool used in
the discovering part. Chapter 7 explains the gossip protocol. Chapter 8 describes
the cluster architecture in ODL. Chapter 9 explains the configuration of the cluster
and, how to check that the cluster is running properly. Chapter 10 explains the Logs
6
of the system. Chapter 11 shows the different messages used for ODL. Chapter 12
contains the Model for the bandwidth usage. Chapter 13 concludes the thesis and,
talks about future works.
7
2. Software Defined Networking
The idea behind SDN networks is to pull out the intelligence from the network
hardware, something that has been done in other fields of technology. Right now is
being using the same that was in 1999, router, switches, routing protocols, they are
faster, with bigger Backplane, more throughput, Qos and some more things were
added. But intelligence is basically the same.
Networking equipment have remained unchanged along the years. In SDN, network
equipment have become dumber. Allowing the creation of a management system in
order to make the whole system more intelligent by having control of the network
architecture. Before was only a matter of the size of the pipe, in other words speed
(ex, how much time does it takes to transport something from one point A to B). Now
with the arrival of new applications such as real-time communications (YouTube,
skype, VoIP) the really importance was the latency or jitter. Now there is the problem
of real-time communications, where is used Qos to make some kind of prioritization
with packages. For example making a packet VoIP more important than an FTP and
so sending it first through the network.
In the past was common to say that FTP traffic had lower priority than the SIP traffic.
Now that more devices are connected to the network, it may happen that under some
conditions the FTP traffic could be more important than SIP traffic. The problem with
the systems that are available in today days to manage Qos is that it cannot be
dynamically configured this information, this must be programmed statically. Is there
where SDN gets to play a very important role, traffic can be modeled and shaped
dynamically depending on what is needed, basically the control plane and the data
plane are separated.
With this separation on the control plane and data plane, all the networking devices
have become dumber, they began to be only forwarding devices. The control plane
is now implemented in a centralized controller.
Data plane: is where reside all the switches and routers (forwarding devices)
that allow a package to go from a point A to a point b.
Control plane: is where reside a set of management servers which
communicate with all the forwarding devices and say how data should move
9
in the plane data. This can be changed dynamically over the time, allowing to
control the entire network from a single point. This is done by separating
different components of the network infrastructure, so being able to deal with
them separately.
Now that the control plane and the data plane are separated, it is needed a new form
of communication for them. In SDN there is something called OpenFlow that is a
protocol for controlling all the network devices.
The principle characteristics in a SDN network are: 1) control plane and the data
plane are decoupled. 2) Forwarding decisions are flow based [3], a flow means a
sequence of packets from a point to another. 3) Now that control plane and data
plane are separated, then the logic control is moved to an external entity, In SDN
this external entity is called SDN controller. 4) SDN networks are highly
10
programmable through applications, these application are running on top of the
control plane.
All the above mentioned make the SDN architecture agile, centrally managed, direct
programmable and scalable [4].
1) Forwarding Devices (FD): are the switches and routers that are in charge of
implement all the flow rules given for the controller. They are connected to the
controller through the southbound interface and, this is done by using the
OpenFlow protocol.
2) Data plane (DP): is where all the forwarding devices reside.
3) Control plane (CP): is the brain of the network, it is connected with the data
plane with the southbound interface and to the applications with the
northbound interface.
4) Southbound interface (SI): is the way in which the controller communicate
with the forwarding devices, the protocol used is the OpenFlow.
5) Northbound interface (NI): SDN architecture offer a way to program the
controller and modify thing inside the controller, this is done by using this
interface.
2.3. OpenFlow
This is a communication standard interface managed for the Open Networking
Foundation (ONF) that is used between the control plane and the data plane in SDN
network [2], basically allows the configuration of forwarding devices such switches
and routers.
11
This protocol eliminates the problem of having static network architectures [5]. With
OpenFlow is possible to create a single network control policy that can be spread
through the entire network, allowing a central controller to remotely manage the
forwarding information in all the forwarding devices of the data plane. This is an
amazing approach because this makes the network more automatic, eliminating the
problem of configure all the devices and interfaces manually one by one. Another
advantage is that it won’t differ from a vendor to another, making the process easier.
12
Figure 2-2 - Distributed Controllers [6]
Having multiple controllers running at the same time and working together also gives
the network the ability to improve its scalability, persistency, share workload and,
work in a high availability mode.
13
3. OpenDayLight (ODL)
OpenDayLight is a collaborative open source project that is hosted for The Linux
Foundation, it was founded in April 2013 but the first release was in February 2014.
It was created with the aim of reducing the known “vendor locking” and therefore
supporting more protocols than only OpenFlow.
OpenDayLight is a modular open SDN platform for networks of any size and scale,
enabling network services across a spectrum of hardware in multivendor
environments. The micro services architecture allows users to control applications,
protocols and plugins, as well as to provide connections between external
consumers and providers [7].
The controller is very adaptable to needs, this enable the ability of combine multiple
services and protocols to solve different problems.
The fact that ODL is open source has been the key for the rapidly growing of it,
making possible that many programmers around the world can contribute to develop
software for this management system [8].
14
3.1. Architecture
The OpenDayLight controller has 3 different layers that are separated into: Top layer,
middle layer and bottom layer as is show in Figure 3-1.
15
In this layer the northbound interface provides controller services and common
REST APIs. This helps for the managing of the network infrastructure configuration.
In this layer the controller communicates with the underlying network infrastructure
with help of the southbound plug-ins. This is in charge of provide basic networking
services, including topology manager and switch manager.
In this layer is where all the protocols for manage and control the underlying
networking infrastructure reside. It has different plug-ins that also implement various
networking protocols which directly communicate with hardware. Here is where the
OpenFlow protocol reside.
Statistics manager: is in charge of collecting all the statistic information, this can be
done by sending statistic requests to the nodes and storing the responses. The
statistic manager also communicates with northbound APIs to provide information
about nodes, flows, tables and group statistic.
Switch manager: provides information about switches and ports. It can also
communicates with northbound APIs to provide information.
16
Inventory manager: guarantees that the database of the inventory can be always as
updated as possible. It queries and updates information about switches and ports
managed by OpenDayLight.
17
4. Distributed systems
What is it?
Is truth that in today days this is actually enough, but in very large scale projects, for
example, while doing 3D graphics, video rendering or in fact in the even larger scale,
projects, for example, if a researcher is trying to correct a complicated scientific
problem. In such situations the processing power of a single computer may not be
enough.
18
A single computer can be maybe too slow to solve a large problem, and that’s how
distributed computing comes it, the idea is pretty simple. It is taken a large complex
task and it is chopped up into little bits, distributing the workload over a large number
of computers so that each computer only needs to work in a small job. All the
computers are supposed to work in unity and as a result a solution will be obtained
in far less time than the last computation.
Taking this idea to the main work of the thesis, in which there are very large scale
services, and those services are getting lots of requests from people who are trying
to access them. Is there when comes out a big problem. If it is wanted to serve that
amount of people at the same time with only one machine, it will be basically
impossible. It cannot be built a computer that can serve anyone at the same time.
19
Figure 4-2 – Distributed System
The most suitable tasks for distributed systems are parallelizable tasks, it may
require a large number of complicated operations but many of these operations can
take place independently of each other. Which means that each task is distributed
and, since one task does not rely on the result of another different task all the tasks
can be done at the same time without regard of the other task.
The way this is done is simple, basically there is a host computer as well as an array
of computers that are going to help with the distributed system. The host computer
is where is set up the task and where is running the main program, this computer
has the task of defining all the little jobs and, distribute them out to the rest of the
computers. Then each computer does the processing of those little tasks and, sends
back the results of processing. The host computer takes all the results from the
individual tasks and then it puts them all together again to generate the final result.
In this project the main focus is the cluster computing system architecture.
20
4.1. Cluster computing systems
In this kind of architecture there are a collection of computers (also called nodes)
linked together through a network [10], this enables computers to have a
coordination of their activities and to share the resources of the system. Users
typically perceive this architecture to be a single system.
This systems are very efficient due to its scalability and fault tolerance characteristic.
They can easily allocate more users or respond faster to requests, just with adding
more nodes to handle that extra load. They also avoid the problem of having a single
point of failure, this is achieved by adding a good recovery and redundancy system.
Cluster mechanisms allow to have two or more process working together as a unique
entity. In OpenDayLight is possible to have multiple instances of the controller
working together as one entity.
Advantages:
Data persistence: when a controller crash, it won't lose the data on it.
21
There are many uses and configurations, depending on the type of cluster that is
needed. They can go from web services to scientific computations.
22
5. Raft, consensus algorithm
Raft is a consensus algorithm for managing replicated logs [12]. Before to start to
explain the algorithm it is necessary to define what consensus is?
Each node has its own copy of the state machine but the system as a whole has the
illusion that there is only one coherent state machine as is shown in Figure 5-1, even
if some of the nodes are down. The distribution of the state machine can be used to
solve different problems in large scale systems with single leader.
A typical consensus cluster can recover from a server failure autonomously, there
are 2 cases for failures and they are the followings.
- Only the minority of servers fails. In this case the cluster can continue
operating in the same way without having any problem
- The majority of the servers fails. In this case the cluster won’t be available
anymore until that a new majority of the servers runs again, but even with not
availability the cluster will retain consistency of the information.
Replication is perform by using a replicate log, each node has its own log but they
have to be identical to the ones of the other nodes, even in the same order.
23
Figure 5-1 - Replicated state machine architecture. [12]
All the commands from clients are replicated in the other nodes, once those
commands are replicated and processed for the nodes then the leader can send an
answer to the client. For this reason nodes appear to be a single state machine.
- Not time dependent, ensuring the consistency of the logs in case of clock
failure
The consensus is implemented first by electing a leader for the system, after the
election the leader obtains all the responsibility for managing the replicated log. The
24
procedure is the following, the leader receives log entries from the clients and, then
it replicates them to the other nodes. When the majority receives and confirms the
log entries from the leader, it informs to all the nodes to apply those log entries to
their state machine or “commit the transaction”. If the leader fails for any reason,
there has to be place for a new election.
It was used the consensus module to ensure the proper log replication. As was
mentioned before, the system won’t make any progress as long as a majority of the
servers are down.
Raft algorithm is splitted into three parts, Leader election, Log replication and Safety.
It is going to be described every part of the algorithm in the same order.
In leader election the idea is to select one of the servers to act as a cluster leader
and, if that server goes down there has to be place for a new election.
In log replication, the leader task is to take commands from clients, appends those
commands to its log and, then replicate its logs to other nodes with the aim of make
match its logs with the ones in other servers. This is done in order to overwrite
inconsistencies.
In safety, the idea is to add restrictions to the leader election process, so only the
server with the more updated log can become leader.
In a cluster architecture the typical number of nodes is 5, this gives the system the
possibility to handle up to 2 failures. Raft implements 3 different states for the nodes
of the cluster, they are leader, candidate and, follower as is shown in Figure 5-2.
When a node is initiated it always starts at follower state that is a passive state, this
means that it does not issue any request, it only responds to requests from leaders
25
and candidates. The leader state handles all the requests from clients, if a client
contacts to a follower it has to be redirected to the leader. The candidate state is the
state in which a new leader is elected.
Another characteristic of Raft is that it divides time into Terms with arbitrary duration
and, those terms are enumerated in a consecutive way as is shown in Figure 5-3.
A term always starts with an election process, where one or more nodes in the
candidate state try to become leader. When one of the candidates wins the election
and became leader, the term is conserve until the leader fails. There are also some
26
situations, in which during a term there is not any leader election, this can be caused
for a split vote’s situation. When this happens the term will ends up without any
leader, then a new term starts in order to have a new election. This ensure that only
a node can become leader within a given term.
The term value plays an important role in Raft, it is used for the nodes in order to
detect obsolete information. The term is exchanged in any communication between
nodes, the idea is that when a given node receives a larger term, it will update its
term value to the one sent for the other node. When this happens the node
immediately comes back to the follower state. The other important part of the term
is when there is an election process, because when a node receives a vote request
from a node with a smaller term, it will reject it.
There are two different messages in Raft, one is the "Request Vote" and the other
is the "Append Entries", they both are RPC messages. The "Request Vote" is used
for the candidate nodes in order to obtain votes from the other nodes and, the
"Append Entries" is used for the leader in order to replicate log entries. When the
"Append Entries" message is empty, it is called "heartbeat" message.
27
Figure 5-4 – Election Timeout [13]
As was mentioned before if a node in follower state wants to start a new election,
after timing out it has to increment the term and change its state to candidate. When
election process starts, the candidate node always votes for itself and sends a
"Request Vote" message to try to obtain votes from the other members as shown in
Figure 5-5.
28
Figure 5-5 – Vote Requests [13]
1 It gets the majority of the votes and becomes leader of the cluster.
When a candidate wins an election is because it got the majority of the votes, this is
done in a first come first served basic, as is describe in [12]. This ensure that only
one node can become leader in a given term. When it becomes leader it starts to
send "heartbeats" to the other nodes in order to inform that there is a leader and
prevent new elections.
29
In the case of having multiples nodes changing their state to candidate at the same
time, it could ends up in a split vote situation. When this happens the candidates will
start a new election in the next term.
In the third case above mentioned it may be a situation with split votes indefinitely,
that’s why Raft uses extra measures to prevent this. It uses randomized election
timeouts in order to solve this problem. They are typically between 150-300 ms, in
this way only one server will timeout in a given term. It will also sends "Heartbeats"
and receives confirmation before another node timeout.
30
"Append Entries" to the other nodes in order to perform the replication. When that
entry is safety replicated to the majority, the leader can finally apply it to its state
machine. This process is also called “commitment process", which means that a
given command has been replicated and applied to the state machine and now is
safe. The entry is durable and will never be overwrite.
The way Raft organizes logs is the following, each log entry has a command along
with a term number which says when the entry was received by the leader. The term
number is really important because with it the system detects inconsistencies in logs.
Each log entry also has an integer index in order to identify its position in the log as
is shown in Figure 5-7.
Another properties of Raft are: log entries never change their position in the log in
order to have consistency and it also perform consistency check. This is done by
using "Append Entries" messages. Within the message is include the term and log
31
index of the entry preceding the new entry. This information is used for the follower
node. If the follower does not find any entry with that term and log index, then it will
drop the new entry. But if the follower finds the entry with that term and log index,
that means that the leader log and the follower log are exactly the same. It will return
a success "Append Entries" message to the leader.
When the system operates normally, the consistency check always returns a
success operation which means that the log of the leader and followers stays
consistent. In case of inconsistency in the operation of the system, it’s possible to be
in a situation of leader or followers crash, which leads to logs inconsistent.
Log inconsistency means that follower’s logs may be different to the logs in the
leader, it can be in different ways. Having extra entries or having missing entries,
those inconsistencies are solved by overwriting conflicting entries with entries from
the leader, this is a safe method but it has to be done with some restrictions that will
be explained in the safety part.
The procedure is the following: the leader has to figure out which was the last entry
where there is a match between its logs and the follower logs in order to delete the
other entries after that point in the follower node, then sends all the entries after that
point from the leader log.
5.3. Safety
After describing how Raft performs leader election and how it replicates logs, it’s
also necessary to talk about some mechanisms to ensure that all the state machines
32
to be the same in all the nodes. For example, when a leader is committing log entries
while one of the nodes is unavailable and after sometime, that node is elected leader.
It starts overwriting the entries committed for the previous leader with new entries.
In the above case it may result in a loose of consistency and it is necessary to apply
some restrictions in order to ensure that the leader for any given term contains all
the entries committed in previous terms.
They way in which Raft prevents candidates without previous committed entries to
become leader is during the voting process. When a candidate node makes contact
with other nodes in the system for a Request Vote, the message includes information
about the candidate log. This is used for the nodes in order to determine which log
entry is more up to date, the one of the leader or its own. If the follower has a log
more up to date with respect to the one in the leader, it will deny the vote and, if it
has the log less up to date then it will confirm the vote.
With this procedure is possible to ensure that the leader elected has all the log
entries committed in previous terms.
33
5.3.2. Committing Entries from Previous Terms
An entry is said to be committed once is replicated to the majority. When a node is
doing its replication duty and it crashes before it can commit a given entry. That entry
won’t be safe of being overwritten for future leaders. Future leaders will try to finish
replicating the entry, but unfortunately a new leader is not able to know if an entry
from previous terms is committed.
In order to solve this problem, Raft never commits log entries from previous terms.
Only the entries replicated from the current leader can be committed, this is a way
to ensure that prior entries are being committed too.
Once it becomes the leader then it has the authority to replicate data to the other
nodes, it happens by sending "Append Entries" messages. Those messages can
also act as a "Heartbeat" when they don’t have any payload as a way to replicate
data, that’s how replication works.
The consensus works in the following way, if there is for example a leader and two
followers. It is necessary to be able to replicate to at least one follower and, see the
confirmation from it saying that the data has been stored. Then the leader can finally
34
commit it and put it into the data tree, if the leader doesn't get any response it cannot
put it into the data tree and that information stays in the journal.
35
5.4.2. Snapshot Replication
Typically when a cluster brings up a node, it’s not very efficient to send one by one
the entries to that node for complete the replication because it will take too much
time. So essentially when a node restarts instead of sending "Append Entries" it just
sends the snapshot as is shown in Figure 5-9. So the whole data tree is sent, in
addition to this it also breaks up the data tree in smaller chunks to perform the
replication because normally this data tree can be really large. The size of those
chunks are typically fixed to 2 Mb.
5.4.3. Durability/Recovery
Durability is useful in the recovery process, in Figure 5-10 there are two components:
the first one is the data tree that is stored in memory but there is also the journal
which is persisted and the reason for this is that when there is a restart, a node has
to recover from persistence. For example, if there is a configuration data and a bunch
of flows is added into the configuration then when there is a controller restart, it is
36
desired to see all those flows in there because otherwise it won’t be able to
reconfigure all the switches in the same way. So the journal is essentially all the
modification that were ever made and stored in the journal one by one and, the
snapshot is used because it’s so important to recover faster. For example, when
there are thousands of flows in the journal, it’s not so good to wait a long time for
each flow to be read from disk and then be added it into the data tree. It’s just better
to put all the data tree into a snapshot that is a disk file and, it will read all at once
and from it will be construct the data tree.
37
6. Akka
It’s important to explain some definitions before get more in deep with Akka [16].
- Leader: A single node in the cluster which acts as a Master. When a node is
the leader, it has full access to the switch.
6.1. Membership
A cluster is compose for a set of logical nodes, each node is identified for its
hostname , port, and an identifier number UID that is given by the system with the
aim of differentiate all the members and provide a better control in join and death
process. The membership process is initiated by sending a "join message" to one of
the seed nodes in the system, this communication between members is performed
by using the Gossip Protocol. The current state of the cluster is gossiped in a
random way to the members in the cluster, with some priority to the members that
have not seen the updated version of the state.
38
6.2. Failure detector
The failure detector is in charge of detecting unreachable nodes within the cluster.
The way in which it acts is the following: the idea is to keep the history of failures
statistics. This is calculated from heartbeats received from other nodes, having that
in mind was created a threshold to count how many failures are necessary to declare
a node as unreachable, the variable is called the phi accrual failure detector and it
can be configurable by the user. It is an important thing for the system because with
a high threshold it may leads to a situation with few mistakes but it will need more
time to detect real crashes, instead with a low threshold it may generate more
mistakes but it will give a fastest detection. The default value for this variable is 8
failures and is appropriate for most situations.
The other function of the failure detector is to detect when an unreachable node
becomes reachable again, this is again done with a gossip round.
39
do is to contact seed nodes, after that it has to send a join command to the seed
node which answered first. It is possible to configure many seed nodes in the cluster,
but with only one seed node the cluster can works pretty well.
Seed nodes do not have any influence in the cluster performance, they only act as
a contact point for new nodes.
In the case which a node is leaving the cluster in a correct way, the leader changes
the status of that node to a leaving state and then when the system achieves
convergence, it will move the node to an existing state and then will mark it as
removed.
40
Figure 6-1 – Membership Lifecycle [16]
Member states are the following: joining, up, leaving, exiting, down and, removed.
There are also some actions for the leader and users.
Seed nodes can be started in any order, but the only condition is that the first node
in the seed node list has to be the first one to start in the cluster. Otherwise it won’t
be able to initialize other seed nodes. The reason for this is to avoid the creation of
separated islands in the cluster. There is not any restriction on how many seed
nodes have to be started, but it’s needed at least 2 seed nodes to start the cluster.
Once there are more than 2 seed node running, it will possible to shut down the first
node in the seed node list
When the cluster is formed, a new node can try to join the cluster to any member
node. Even if that node is not a seed node.
The first node in the seed node list will join itself in case in which it cannot contact
any other seed node
6.6. Leaving
In Akka there are only two ways to remove members from the system. The first is
stopping the node and then wait the other nodes to detect the node as unreachable,
after that the leader will mark it as removed. The second one is informing the system
that a given node has to leave.
42
6.7. Node roles
In distributed systems all the nodes have different roles, this means that the workload
can be distributed in any way to each member of the system. For example, one node
can be in charge of the Data Store, another of the architecture and another of the
inventory. But there is also the possibility to give the same role to all the members
by having redundancy, this can be done by replicating all the duties among the
members in order to gain availability in the system. The roles are defined in the
configuration files.
When a node needs to recover, it has to replay all the stored changes in order to
rebuild the internal state. It can start to rebuild the internal state from zero or start
from a snapshot which will make the process fastest, reducing the recovery time.
43
to the node, they won’t interfere with replayed process. They are stored in cache and
when the recovery phase ends they will be processed.
6.9. Snapshots
With the use of snapshots the system can reduce dramatically recovery time. The
system saved snapshots of the internal state in order to use it later during recovery.
This snapshot is offered when a node is about to start or restart with the aim of
initialize the internal state. If there are several snapshots in the system, it will take
the youngest version.
44
7. Gossip protocol
Akka uses a version of gossip call push-pull [16], this is with the aim of reducing the
information sent in the cluster. This means that it only sends current versions but not
actual information. In the answer to a gossip message a node sends another value
that represents if that node has an updated version or an outdate version. This is
done with help of the vector clock for the versioning, making possible only to pull the
information as needed.
Each node has the bucket or version information for all the other nodes as is shown
in Figure 7-1. For example in a cluster with 3 members, member one has the
information of member two and three. In this way when a member does a gossip
with another node, it can say if it has an older or a newer version with respect to the
other nodes. The gossip mechanism works in the following way, every second all the
members send status messages to each other, saying which version of the bucket
they know. Then with based on that information, nodes can decide to send back their
status if their version are lower or otherwise they send an update. Every time that
something changes in the bucket is changed also the version of the bucket.
45
Figure 7-1 – Gossip Protocol [14]
Messages are exchanged normally every 1 second and the decision of where to
send next the gossip message is random, but there is still some priority to those
nodes who have not seen the last version.
When the cluster is in a convergence state, the system only sends small gossip
messages containing only the gossip version. But when there is a change in the
cluster and there is not convergence then the system goes back.
46
Gossip protocol also makes use of an algorithm for the data structure, this is called
Vector clock and it does a partial ordering of events and detection of violations in
distributed systems. Gossip uses the vector clock in order to note differences in the
cluster state when there is a gossiped exchange. A vector clock is a couple of (node,
counter) pair, so every time the cluster state change, then the vector clock also has
to update.
- If the sender has a newer version, in this case it sends back a message in
order to request the new version.
- If the sender has an outdated version, in this case the recipient sends back
its gossip state.
- If there Is a conflicting gossip versions, in this case those versions are merged
and sent back.
47
8. Clustering in OpenDayLight
- Data Store
The idea behind the Data Store implementation is that it is allows for high availability
and scalability, where all the members are talking to each other distributing the data
as shown in Figure 8-1.
- Rpc
If there is a router RPC and it is trying to registerer in a node, it is invoked that RPC
either from RestConf or from any node in the cluster regardless of where is been
invoked from.
48
Figure 8-2 – RPC [14]
49
Figure 8-3 – Akka Actors [14]
The Akka persistence module, gives the ability to store data and in case of restart,
the node gets back the data and it reconstructs the state of the tree, it has 2 different
things:
- The Journal, is essentially a file with all the modification that you ever made
in the data tree.
The Akka remoting module, is a module in which an actor system in one node
communicates to another actor system in another node, that’s the main idea.
The Akka clustering module, is used for the discovery of nodes. For example, if there
are 2 nodes and it is wanted to know where the other node is or which is the ip
address or where is hosted. Akka clustering will give that information. It also gives
information about the status of members, for example, if “is the member alive, dead,
reachable or not reachable”.
50
8.2. Data synchronization
In the data store there are trees and the objective is to have all the trees
synchronized as is shown in Figure 8-4. There are different data trees like inventory,
topology, Toaster. These trees are allocated in a big tree, which is synchronized for
high availability. For this is used an algorithm called "RAFT the consensus algorithm"
to make sure that all these trees look the same on each node.
For RPC, there is a synchronization of the RPC Registry for each node that gets
registered, for example, an open flow switch who wants to add a RPC flow on node
1, it is important to know exactly how to invoke the added flow on that switch. That
information goes into the registry and it gets also replicated as shown in Figure 8-5.
51
Figure 8-5 - Synchronized RPC Registry [14]
8.3. Communication
The way in which the "distributed data store" communicates with the data tree, is
putting an actor around the data tree. So when is desired to have communication
with the data tree, it is just necessary to send a message to the actor and wait for
that message to be processed, leading to a modification of the data tree.
52
Figure 8-6 – Sharding [14]
The solution for this problem is to distribute also the data across the cluster in a
replicated way as is shown in Figure 8-7. For example, member 1 is the leader of a
given shard and, member 2 and 3 act as followers, these followers have exactly the
same data. So that if member 1 goes down one of the other two nodes will take the
role of leader in order to guarantee high availability. The way it works is again
governed by RAFT algorithm.
53
Figure 8-7 – Distribute Data Store [14]
8.7. Startup
Once the cluster starts up, it comes created an instant method of the distribute data
store and then it has to wait until it gets ready. The problem here is that when there
is a distributed data store, it is difficult to know when it is ready for use?. For example,
with the "In Memory" data is quite clear, it creates the "In Memory" data store and, it
is immediately available for work, it can start creating transactions and so on. Instead
In the "Distribute" data store, that’s not possible because if there is one instance of
the controller started and the other instances have not started then consensus will
not be there. For example, as who is the leader.
54
So after the system has created the distribute data store there is a waiting time until
it gets ready, normally for 90 seconds trying to find the leader. It’s enough time to
start another node and for the creation of its shards, this in order to select the leader.
If that happens within 90 second then it will move forward, otherwise it will block for
90 seconds. Once it is create the distribute data store, it creates two classes: one is
the ACTOR CONTEXT that allows to communicate with actors, this is necessary
because distribute data store is not an actor and, the other one is SHARD
MANAGER which is the parent of all the shards. Shards are created in base on a
configuration file called module-shards.conf, so for all those shards is necessary to
find the leader within the 90 seconds as is mentioned above. As a result if a
transaction is created before a leader is found, that transaction will fail.
When shards are created, the first thing they do is to read and recover the
information from disk. It can happen either from the journal or the snapshot, it goes
and reconstruct the data tree and then it sets its behavior to follower, after that it
says "I am ready for communications" to continue with the process of elections in
order to choose the leader. Once the leader is found, the countdown and the wait
until it gets ready happens. It can move forward, this procedure needs to happen for
both ConfigDataStore and OperationalDataStore.
55
9. Set Up and Testing of the Cluster
The idea is to have multiple instances of the controller working together as if they
were a single entity, in order to achieve a better scaling, persistence and, high
availability in the system. In this case it will be a 3 node cluster as shown in Figure
9-1, in which the base distribution is Helium-SR4
Before getting more in deep with the cluster deployment, it is necessary to make
some considerations.
The last consideration is to know which is the data needed for the system, this data
is allocated among different shards. By default OpenDayLight brings some shards
already created and they are: Inventory, Topology, Toaster and a default one for the
other kind of information. In this thesis are going to be used only these shards
because for the thesis purposes is not necessary to create a new one.
The first thing to do is to create a Virtual Machine (VM) where to host the controller,
in this case it is used a version of Ubuntu 14.04 [17] and VirtualBox 5.0.6 [18].
Now that there is a machine where to host the controller, the next step is to get the
OpenDayLight controller for this parte there are two options: the first is to download
a version that is already modified with all the features needed, this version can be
found in some OpenDayLight repositories. The second option is to download the
original version and add manually all the features needed for the cluster deployment,
this version can be found from the OpenDayLight webpage [8].
For this case was used an original version downloaded directly from the
OpenDayLight webpage. Here is not going to be explained how to install and
configure the VM [19]. The features that are going to be needed are odl-restconf,
odl-l2switch-switch, odl-mdsal-clustering, odl-openflowplugin-flow-services.
In the cluster architecture it is going to be needed at least 3 VM, one for each
controller. Then is possible to run the controller in each node, this can be done by
entering in the distribution folder and, looking for the <Karaf-distribution-location>/bin
directory.
57
In order to run the controller it is necessary to execute the file Karaf like this ./Karaf.
This will run the controller and will appear a command window as is shown in Figure
9-2.
In this window is possible to perform some actions, like install features, consult which
features are already installed and also to see some of the internal process of the
controller. For example, when a switch is requesting a connection to the controller.
Now that the controller is in operation it is possible to install the features, the way it
is done is the following: In the command window is typed feature: install and the
name of the feature desired.
58
In this case the most important feature is the odl-mdsal-clustering, this will install
the Akka toolkit and more characteristics in the controller and it will be the one
making possible the cluster functionality. This feature will also create some initial
configuration and, some files that are going to make possible the manual
configurations. Those files are stored in the folder "configuration/initial" and they are
named akka.conf and module-shards.conf.
Once the features mentioned above are installed, it is possible to start with the
manual configurations of the nodes.
First, it is needed to go to the akka.conf file and make some modifications, the
changes needed to do is to set the ip address and port in which the node will be
listening, configure all the seed nodes and define the role of that node.
The ip address and port that are in the initial configuration are the following.
Then it is necessary to modify only the hostname address for the current ip address
of the node, the port is the same. This configuration tells the system that it will be
listening in the ip address 192.168.56.101 and port 2500, this is important because
in this address and port will be received all the requests from joining nodes.
59
For the seed node list, initially there is the following.
In this part it is also necessary to modify the address of the seed node because
initially it is configured to contact itself to the localhost address. It is possible to
configure more than 1 seed node in the system, there is not limitation on how many
seed nodes have to be in the cluster.
For the role of each node, initially it comes by defect as the following.
In this part it has to be modified only when the configured node is not the first node,
but if it is the second it has to be changed to member-1 or member-2.
These are all the changes that are needed in the akka.conf file.
Then is also needed to modify the second file module-shards.conf. In this file it is
need to specify if any shard will be replicated in other nodes. In this will be shown
the case of the inventory shard, the initial file is like the following.
60
It is possible to run the cluster without change any of this file, but if is looked to
implement the case of High availability, it is need to make replicas of all the shards
in all the nodes that are going to be part of the cluster like the following.
Once is done the previous configuration in all the nodes, then is possible to restart
again all the controllers and start with the clustering services.
9.1. Testing
After the cluster configuration is necessary to implement some mechanism that
allows to validate that the setup is right. Validate means to prove that the cluster is
running properly, like validate that there is a leader for each shard and that the
system is making the right operations like shard replication and committing actions.
61
For this purpose is going to be used "Postman" that is an application for making
HTTP requests, with this app is possible to ask the controller for information about a
specific shard. The command to do that is the following.
GET
http://192.168.56.101:8181/jolokia/read/org.opendaylight.controller:Category=Shar
ds,name=member-1-shard-inventory-config,type=DistributedConfigDatastore
This request gives back information about the state of the cluster as is shown in
Figure 9-3.
In this answer is possible to obtain valuable information as the last log index, current
term, failed transactions, committed transactions and the current leader.
62
10. Log Analysis
In this part is going to be analyzed which are the actions that are performed for the
controller in every node. When a controller starts its operation, it creates a text file
which specifies all the steps that are made, this file is located in the folder "data" and
the name is Log.txt. This means that this file contains all the information about
actions, notifications and modifications made in the controller. Logs can help to
understand what is really happening in the system.
In order to focus in the main goal that is the clustering part, is going to be analyzed
only the information related to that feature.
Node 1
Let’s start with the first node in the seed node list (Node 1). Once the controller starts
and execute all the default features, it has to initialize itself as a cluster node. It’s an
automatically process as is shown in the following.
2016-01-19 18:04:11,359 | INFO | ult-dispatcher-2 | Remoting | 266 - com.typesafe.akka.slf4j - 2.3.10 | Remoting started;
listening on addresses :[akka.tcp://[email protected]:2550]
2016-01-19 18:04:11,480 | INFO | ult-dispatcher-2 | kka://opendaylight-cluster-data) | 266 - com.typesafe.akka.slf4j - 2.3.10
| Cluster Node [akka.tcp://[email protected]:2550] - Starting up...
2016-01-19 18:04:11,625 | INFO | ult-dispatcher-3 | kka://opendaylight-cluster-data) | 266 - com.typesafe.akka.slf4j - 2.3.10
| Cluster Node [akka.tcp://[email protected]:2550] - Registered cluster JMX MBean
[akka:type=Cluster]
2016-01-19 18:04:11,626 | INFO | ult-dispatcher-3 | kka://opendaylight-cluster-data) | 266 - com.typesafe.akka.slf4j - 2.3.10
| Cluster Node [akka.tcp://[email protected]:2550] - Started up successfully
Once Node 1 is successfully initialized, the Akka Clustering module starts working.
It sends join messages to all the seed node configured in the configuration files. At
this point as it is the only node initialized it won’t get any answer back, this can be
63
evidenced in the log, where is possible to see node 1 contacting other nodes but
with not answer. The connection is refused as is shown in the following.
Having this in mind the next move is to join itself by sending a "Join message" to its
direction and specifying its configured role “member - 1”. After that the node is set
as up and the operation can start, this is possible only if it is the first node in the first
node list.
64
2016-01-19 18:04:11,754 | INFO | config-pusher | DistributedDataStore | 280 - org.opendaylight.controller.sal-
distributed-datastore - 1.1.4.Helium-SR4 | modules config file exists - reading config from it
2016-01-19 18:04:12,226 | INFO | config-pusher | DistributedDataStore | 280 - org.opendaylight.controller.sal-
distributed-datastore - 1.1.4.Helium-SR4 | module shards config file exists - reading config from it
2016-01-19 18:04:12,241 | INFO | config-pusher | DistributedDataStore | 280 - org.opendaylight.controller.sal-
distributed-datastore - 1.1.4.Helium-SR4 | modules config file exists - reading config from it
If there are still some modules missing in the system, then it will create them.
Normally this always happen when a node is started for the first time. In this case it
has to create the modules because it is the first time the controller runs.
When the modules are fully operated, this means that the system have read it from
disk or it created from zero then the system gives a notification in the Log as the
following.
Now that all the modules are running in the system, the next step is to go to the initial
configuration files in order to read which are the shards who need to be created. In
this case are only the shards: inventory, topology, toaster and, default.
65
2016-01-19 18:04:14,208 | INFO | lt-dispatcher-20 | Shard | 273 - org.opendaylight.controller.sal-akka-raft -
1.1.4.Helium-SR4 | Shard created : member-1-shard-inventory-operational persistent : true
2016-01-19 18:04:14,209 | INFO | lt-dispatcher-20 | InMemoryDataTree | 151 - org.opendaylight.yangtools.yang-
data-impl - 0.6.6.Helium-SR4 | Attempting to install schema contexts
Now that the shards are in the "InMemoryDataTree" it’s possible to recover the
information from the journal and after that, it gives a notification which tells that the
shard in ready.
When a shard is ready to operate it always starts in the Follower state as is defined
in the Raft algorithm. In the Log is possible to see how the node switchs the state of
the shard to follower. After this step the system starts to perform the leader election
process.
Node 2
In this node is also observed the same procedure than in node 1. It starts and
immediately try to communicate with seed nodes, in this case as node 1 is already
in operation and it is the first one in the seed node list, there must be an answer for
it. So the communication between both nodes is the following. First, node 2 sends a
"Join message" to node 1 and it replies with a "Welcome message". This
66
automatically creates a connection and allows to continue with the cluster
deployment.
When all the nodes in the cluster are up and the shards have created in each node
then it’s time to start with the leader election for each shard, this is done with the
implementation of the Raft algorithm.
As was explained before, all the shards start as followers and wait for a "heartbeat"
in a determinate time, it they don’t receive anything they can change their status to
candidate and try to obtain votes from other nodes. So they send request vote
messages and if the majority vote for a node, that node can changes its status to
leader.
In the log is possible to observe when a shard change its status from follower to
candidate after waiting for a determinate time “timeout”. In this case is the shard of
node 1 which changes the status to candidate.
67
Once the votes are received, it is possible to set a leader for a determinate shard as
following.
The part above described is very important because doing this analysis is the only
way to actually understand what is really happening in the system internally. And
this will give us also an idea of how to solve problems.
68
11. Messages between Controllers
It was already explained how the system works internally, but still there has to be
explained how the nodes are interacting each other. Interacting means the actual
data exchange between controllers. Having that in mind, is performed a packet data
capture in order to determine how is the structure of an OpenDayLight packet and
also the different messages that are implemented.
This is going to be one of the most important parts of the project because it will set
some parameters for future interactions with different controllers, having the tools for
building a proxy between the OpenDayLight controller and another kind of controller.
This will allow for future clustering architectures in which the cluster can be compose
for different kind of controllers working together.
For this packet capture was used Wireshark [20] that is a packet analyzer tool, with
these tool was possible to filter all the packets needed and also to separate only the
desired information contained in the payload.
11.1. Capture
Once the controllers start, they try to communicate with seed nodes as was
explained before, first by establishing a TCP channel. If the node that is being
contacted is alive then it will allow the connection. Otherwise the connection will be
refused. This TCP channel is built with the three handshake mechanism.
During the first part of the clustering the Akka tool is in charge of all the
communication and then the Raft algorithm takes place.
69
11.1.1. Akka tool Part
After the handshake and with the channel established, nodes try to join seed nodes
by sending a message, the message basically contains its direction as is shown in
the following.
...D.B...>3.opendaylight-cluster-data..192.168.56.102...".tcp.........
After the first contact, the joining node has to wait until any of the seed nodes answer
to that first contact. This answer is very simple, it contains its address.
...D.B...>3.opendaylight-cluster-data..192.168.56.101...".tcp.].......
When the first contact is done, the joining node can start with the joining process.
This is done by initiating an internal action call "Join Action". This process also has
an identifier number that is included in the message. The message is like the
following.
...%.....;9akka.tcp://opendaylight-cluster-
[email protected]:2550/.gc........system......cluster......core.....daemon",akka.cluster.InternalClusterAction$InitJoin$(..."w
uakka.tcp://[email protected]:2550/system/cluster/core/daemon/joinSeedNodeProcess-1#-
2021912680
Once the seed node has received the previous initialize message, it sends back an
acknowledgment of that join process in order to continue with the joining. The whole
communication is performed with the help of the identifier number, with that number
is possible to identify different processes.
uakka.tcp://[email protected]:2550/system/cluster/core/daemon/joinSeedNodeProcess-1#-
2021912680.l8.opendaylight-cluster-data..192.168.56.101...".akka.tcp....akka.cluster.InternalClusterAction$InitJoinAck"`
^akka.tcp://[email protected]:2550/system/cluster/core/daemon#1239809082
70
After the acknowledgment, the joining node has to send information about its
configuration (member or role in the cluster). This is done because there are some
configurations in which each member have a different role. In this case the role
defined in the initial configuration is “member-2”.
...V.....;9akka.tcp://[email protected]:2550/.....K?8.opendaylight-cluster-data..192.168.56.102..."
.akka.tcp....v..member-2.......system......cluster......core.....daemon"'akka.cluster.InternalClusterAction$Join(..."`
^akka.tcp://[email protected]:2550/system/cluster/core/daemon#2118301878
Now that the seed node is aware of the role of the joining node, it can finish the
initiation process by changing the status of the joining node to up. It also has to
inform the joining node that it is up, this is done with a welcome message.
^akka.tcp://[email protected]:2550/system/cluster/core/daemon#2118301878................r...../
H.KI...L.(.M.)-.I-.MI,I..3.4.34..35.340..&.......W.\ p....|B...(6......PM....L.#757.....2.R00M206JJMKL.L25.HJ.01I2.4M5LLL3KU
..`.`.`TbdP..`.......J..ZL..F...B,@.......&......*akka.cluster.InternalClusterAction$Welcome"`^akka.tcp://opendaylight-cluster -
[email protected]:2550/system/cluster/core/daemon#1239809082
From this point on, the joining node starts to be part of the cluster. This procedure is
the same for all the nodes who want to join the cluster. After the joining process all
the members in the cluster can start to exchange information, this is achieved with
the help of the Akka clustering module.
There are still 2 more messages that are part of this module, one is the heartbeat
and the other is the gossip message.
Heartbeat is a really important message it’s used in order to maintain the up state
of the members in the cluster, that’s why it’s repeated during the whole operation of
the cluster. If a heartbeat message is not answered it means that some node is not
reachable and the system marks it like that. That node stays as a part of the cluster
71
as far as it does not reach the Phi accrual fail detector threshold, otherwise it has
to be removed from the system.
A heartbeat message also has integrated a heartbeat sender number and it is useful
to identify different actors. It looks like the following.
9akka.tcp://[email protected]:2550/.....8.opendaylight-cluster-data..192.168.56.101...".akka.tcp...
....system......cluster......heartbeatReceiver"-akka.cluster.ClusterHeartbeatSender$Heartbeat(..."qoakka.tcp://opendaylight -
[email protected]:2550/system/cluster/core/daemon/heartbeatSender#-1807478239
...Y.....qoakka.tcp://[email protected]:2550/system/cluster/core/daemon/heartbeatSender#-180747
8239.u?8.opendaylight-cluster-data..192.168.56.102...".akka.tcp....v...0akka.cluster.ClusterHeartbeatSender $HeartbeatRsp
"geakka.tcp://[email protected]:2550/system/cluster/heartbeatReceiver#-1251842346
Gossip message are also really important for the system, inside this message there
is useful information about the cluster as the state and versioning. This is the key to
try to build a convergence system, gossip messages are exchanged every second
and, they are random in the sense that they don’t have a fixed destination. The
message is like the following.
9akka.tcp://[email protected]:2550/[email protected]...".akka.tcp
....?8.opendaylight-cluster-data..192.168.56.102...".akka.tcp....v..............r...../H.KI...L.(.M.)-.I-.MI,I..3.4.34..35.340..&.......W.\
p....|\...`[email protected]..'.......#../.... `.`Pbd.b.`0..`.b..`.....T...........system......cluster......core.....da
emon".akka.cluster.GossipEnvelope(..."`^akka.tcp://[email protected]:2550/system/cluster/core/
daemon #1239809082
72
...,.....`^akka.tcp://[email protected]:2550/system/cluster/core/daemon#1239809082.....?8.open
daylight-cluster-data..192.168.56.102..."[email protected]...".akka.tcp................
....r...../H.KI...L.(.M.)-.I-.MI,I..3.4.34..35.340..&.......W.\p....|\...`[email protected]..'.......#../....
`.`Pbd.b.`.`4..`........>.........akka.cluster.GossipEnvelope"`^akka.tcp://[email protected]:2550
/system/cluster/core/daemon#2118301878
Up to here acts the Akka clustering tool, and from now on the principal role will be
under the supervision of the raft algorithm.
9akka.tcp://[email protected]:2550/...........sr.=org.opendaylight.controller.cluster.raft.messages.
RequestVotem[}.a.)&...J..lastLogIndexJ..lastLogTermL..candidateIdt..Ljava/lang/String;xr.Aorg.opendaylight.controller.cluste
r.raft.messages.AbstractRaftRPC......l ...J..termxp........................t.$member-1-shard-inventory-operational........user......sha
rdmanager-operational.(...$member-2-shard-inventory-operational(..."....akka.tcp://[email protected]
:2550/user/shardmanager-operational/member-1-shard-inventory-operational#-1784895488
The answer for this messages is really similar and in addition it brings information
about if the vote was granted or not. The message looks like the following.
73
.............akka.tcp://[email protected]:2550/user/shardmanager-operational/member-1-shard-
inventory-operational#-1784895488.........sr.Borg.opendaylight.controller.cluster.raft.messages.RequestVoteReply
..._.......Z..voteGrantedxr.Aorg.opendaylight.controller.cluster.raft.messages.AbstractRaftRPC......l ...J..termxp..........."..
..akka.tcp://[email protected]:2550/user/shardmanager-operational/member-2-shard-inventory-
operational#-870461665
Once a candidate node receives the majority of the votes, it will become the leader
of a given shard during a given term. After that moment it starts to replicate data to
the other nodes. In this case of Akka, that kind of message is called “Append Entries"
message and, it can also be used as a way of heartbeat message. This is with the
aim of establish if a given leader is up or down, if the leader is down there has to be
another election turn. A typical append entries message is like the following.
9akka.tcp://[email protected]:2550/.....T...$member-1-shard-inventory-operational...........
..........0..........8..................user......shardmanager-operational.(...$member-2-shard-inventory-operational"_org.opendaylight
.controller.protobuff.messages.cluster.raft.AppendEntriesMessages$AppendEntries(..."....akka.tcp://opendaylight-cluster-
[email protected]:2550/user/shardmanager-operational/member-1-shard-inventory-operational#-1784895488
...`.........akka.tcp://[email protected]:2550/user/shardmanager-operational/member-1-shard-
inventory-operational#-1784895488.........sr.Dorg.opendaylight.controller.cluster.raft.messages.AppendEntriesReply...Y.N.
....J..logLastIndexJ..logLastTermZ..successL.followerIdt..Ljava/lang/String;xr.Aorg.opendaylight.controller.cluster.raft.messag
es .AbstractRaftRPC......l ...J..termxp.........................t.$member-2-shard-inventory-operational.."....akka.tcp://opendaylight-
cluster-data @192.168.56.102:2550/user/shardmanager-operational/member-2-shard-inventory-operational#-870461665
In this case is very easy to note that the "Append Entries" message is empty, this is
because when there is not information to exchange the message is used as a
heartbeat message.
In this work was also done a different test, in which a small network simulated in
Mininet was connected to the cluster of controllers. This was made with the aim of
74
put some information about inventory and topology to the controller. As a result the
system creates a transaction and starts to replicate all the data from the shard
leaders to the followers. An example of that message will be the following.
9akka.tcp://[email protected]:2550/..........#member-1-shard-topology-operational.. .*.........iorg.
opendaylight.controller.cluster.raft.protobuff.client.messages.CompositeModificationByteStringPayload.....Rclass org.open
daylight.controller.cluster.datastore.modification.MergeModification........ .0..Y....... .0. .b+urn:TBD:params:xml:ns:yang:
network-topologyb2013-10-21b.network-topology..Rclass org.opendaylight.controller.cluster.datastore.modification.Merge
Modification........ .0....... .0..c....... .0. .b+urn:TBD:params:xml:ns:yang:network-topologyb2013-10-21b.topologyb. network-
topology..Rclass org.opendaylight.controller.cluster.datastore.modification.WriteModification.8...... .0....... .0........ "......
...flow:1..0............ ."...... ...flow:1..0. .2........ .0. .:.flow:1H.b+urn:TBD:params:xml:ns:yang:network-topologyb2013-10-
21b.topologyb.topology-idb.network-topology......50.8..................user......shardmanager-operational.'...#member-2 -shard-
topology-operational"_org.opendaylight.controller.protobuff.messages.cluster.raft.AppendEntries Messages$AppendEntries
(..."....akka.tcp://[email protected]:2550/user/shardmanager-operational/member-1-shard-topology-
operational#-1971300728
...c.....;9akka.tcp://[email protected]:2550/.......member-2-txn-0.... .........user......shardmanager-
operational.'...#member-1-shard-topology-operational"eorg.opendaylight.controller.protobuff.messages.transaction.
ShardTransactionMessages$CreateTransaction(..."[email protected]://[email protected]:2550/temp/$a
@akka.tcp://[email protected]:2550/temp/$a.......akka.tcp://[email protected]
.56.101:2550/user/shardmanager-operational/member-1-shard-topology-operational/shard-member-2-txn-0#1175605301
..member-2-txn-0.....jorg.opendaylight.controller.protobuff.messages.transaction.ShardTransactionMessages$
CreateTransactionReply"....akka.tcp://[email protected]:2550/user/shardmanager-operational
/member-1-shard-topology-operational#-1971300728
Then the leader reply with an acknowledgment of that merge operation, as the
following.
@akka.tcp://[email protected]:2550/temp/$b.h....borg.opendaylight.controller.protobuff
.messages.transaction.ShardTransactionMessages$MergeDataReply"....akka.tcp://opendaylight-cluster-data@192
.168.56.101:2550/user/shardmanager-operational/member-1-shard-topology-operational/shard-member-2-txn-
0#1175605301
When the followers have the merged information, now is possible to write it in
memory. They have to tell the leader that they will apply all the changes in the shard
to memory. The message is the following.
It also has to be acknowledged for the leader and, the message is the following.
@akka.tcp://[email protected]:2550/temp/$d.h....borg.opendaylight.controller.protobuff
.messages.transaction.ShardTransactionMessages$WriteDataReply"....akka.tcp://opendaylight-cluster-data@192. 168.56
.101:2550/user/shardmanager-operational/member-1-shard-topology-operational/shard-member-2-txn-0#1175605301
76
Once all the followers have the information in memory they notify the leader that the
information can be committed into its data tree. The message is like the following.
...p.....;9akka.tcp://[email protected]:2550/.......member-2-txn-0........user......shardmanager-
operational.3.../member-1-shard-topology-operational#-971300728"lorg.opendaylight.controller.protobuff.messages.c
ohort3pc.ThreePhaseCommitCohortMessages$CanCommitTransaction(..."[email protected]://opendaylight-cluster-
[email protected]:2550/temp/$f
@akka.tcp://[email protected]:2550/temp/$f.y......qorg.opendaylight.controller.protobuff.messages
.cohort3pc.ThreePhaseCommitCohortMessages$CanCommitTransactionReply"....akka.tcp://opendaylight-cluster-
data@192. 168.56.101:2550/user/shardmanager-operational/member-1-shard-topology-operational#-1971300728
Finally the leader can have an idea of which of the followers have a given
information. If that number of followers is the majority, it goes into its data tree and
commits the information. After this point that information will never be overwritten,
that means that will be durable and persistent.
77
12. Bandwidth Usage Analysis
This chapter contains objective of the thesis, the evaluation of the traffic exchanged
between controllers or “east-west traffic”. In this work was implemented a cluster
topology in which was also implemented a Mininet network [21], this virtual network
was varied in size with the aim of analyze the bandwidth usage for the network under
different situations. Here will be also described the methodology of experimentation
and all the steps during the analysis.
12.2. Mininet
As described in [21], Mininet is a network emulator which creates a network of virtual
hosts, switches, controllers, and links. Mininet hosts run standard Linux network
software, and its switches support OpenFlow for highly flexible custom routing and
Software-Defined Networking. It creates a realistic virtual network, running real
kernel, switch and application code, on a single machine (VM, cloud or native), in
seconds, with a single command. It supports research, development, learning,
78
prototyping, testing, debugging, and any other tasks that could benefit from having
a complete experimental network on a laptop or other PC.
Mininet for default brings its own controller, but there is also the possibility to use a
remote one. This is exactly the kind of network emulator needed for the
experimentation, the Mininet network will be connected to a SDN controller
“OpenDayLight” through a TCP/IP connection. The simplest network in Mininet is
composed for a single switch and 2 hosts and, it can be executed with the following
command.
There are also some other configurations pre-established with a single line of
command like: linear, single and tree topology, but there is also the possibility to
create custom topologies with help of python scripts.
In this case it will be used the linear topology that can be run in the following way.
The command above has always to start with “mn” along with some parameters to
define the kind of topology and controller to be use. Here it will be used only the
remote controller, in order to test the OpenDayLight controller.
In the linear topology all the switches are connected in a linear fashion and each
switch has its own host as is shown in Figure 12-1.
79
Figure 12-1 – Linear Topology
After running the command for the network creation, it will be shown how is been
created and right after the system will open a Mininet terminal as is shown in Figure
12-2 in which can be executed some command related with the network.
80
Figure 12-2 - Mininet terminal
In this terminal is possible to perform connectivity tests, pinging and also get
information about the network like: nodes, links and addresses.
During the whole work, the cluster architecture was composed for three nodes or
instances of the OpenDayLight controller. For simplicity purpose in this part will be
analyzed the traffic between only two of them, it will be named node1 as controller
A and, node2 as controller B. with these two nodes is going to be performed the data
81
capture by using Wireshark tool. This traffic will be captured in both directions from
A → B and, from b → A.
The Wireshark capture contains a lot of information, but for obvious reasons in this
part it will focus in the amount of kbps transferred between the controllers A and B.
with these information it will be possible to make further estimations and analysis,
unfortunately Wireshark does not have an exportation format that can be directly
used for Matlab. That’s why some Linux commands are going to be used to export
the data needed from the Wireshark file, basically it is filtered all the original
information and then the important one is selected. The above mentioned is
performed with the following command.
The above command filters the source file and then it uses a sample frequency in
order to obtain a matrix with some data that can be differentiated for time, from this
matrix are used the Linux commands as tail, head and, awk as is shown in Figure
12-3.
82
Figure 12-3 – Data matrix Capture
In this case is only selected the column of time and bytes. The last part of the
command is used to save a file containing the information needed for the analysis,
in this case the result file is a two columns file, containing time and bytes that can be
subsequently loaded in Matlab.
For the experimentation, a Mininet network was connected to the cluster at the time
of 60s and subsequently removed at 120s. After this part the data obtained will be
processed in Matlab.
83
In order to graph the bandwidth usage in more suitable way, it’s needed to do some
processing, for this thesis was used the sliding window approach [23]. This
approach is basically take all the bytes exchanged between controllers and sum
them and, then perform a normalization with the sample time that was implemented
before.
As describe in [23] the sliding window approach is performed in the following way.
A window size (W) is chosen with width more than the sampling interval.
The window is centered on the starting sample of the signal and the mean of
the signal values in the window is calculated and assigned to the center value.
In the next iteration, the window moves one sample to the right and computes
mean in the same way in the current window as in previous step and this is
continued till the end of signal. Overlapping occurs between windows in each
iteration.
In the end of the processing there will be a smoother signal than the original one,
but still the data can be used to construct the model. In Figure 12-4 is shown the
bandwidth usage for a Mininet network with size of 3 nodes, with different values of
Ts and, W.
84
Figure 12-4 – Bandwidth Usage Graph A → B
12.4. Modeling
Once the capture it is finished is possible to use that data to build an experimental
model for the traffic exchange between OpenDayLight controllers, this model can be
used for the evaluation of scalability in SDN networks. With this model can be easily
85
predicted different scenarios in a SDN network with the OpenDayLight controller, the
model makes use of all the experimental captures in order to create an estimation
for all the possible cases.
The procedure is the following, first it’s necessary to obtain the needed data. For this
purpose a Mininet network was connected to the controllers, this Mininet network
has a linear architecture and, it will vary in size.
During the experimentation the procedure was repeated 5 times for each topology
size, this is with the aim of having a more accurate measurement. The process is
the following.
1. Start all the controllers that are going to be part of the network.
2. Set the topology wanted and make the connection to the controller.
3. Start the packet capture during 180s, while the virtual network is connected
to the controllers.
4. Stop the virtual network after the established time.
5. Stop the packet capture 60s after the stopping of the virtual network.
6. Shutdown all the controllers in the cluster.
7. Extract all the useful data from the packet capture, this is performed by using
the Tshark tool. The information is extract in a .txt file. It contains 2 variables,
one is the time and the other is the number of bits used.
8. Open the .txt file in Matlab for processing.
86
9. Generate the model and graphs.
The model is generated by using the amount of bytes transferred “Bandwidth” from
A → B and, from B → A. there is also needed the topology size. These information
is passed to a Matlab function called Polynomial curve fitting [24], which uses the
least squares method to find a best fit curve. As a result Matlab gives a fitted curve
on the mean values. Figure 12-6 shows the results for the model from A → B and,
from B → A.
87
Figure 12-6 – Bandwidth Usage Modeling
In the previous graph it’s possible to evidence the variation in bandwidth with respect
to the number of nodes in the network.
In the first case from A → B, it is possible to observe that the bandwidth is increasing
in a linear way with respect to the number of nodes.
For the second case from B → A, is also possible to observe the same linear
increasing, but in this case consuming way less bandwidth that in the previous one.
The explanation to the behavior shown in the previous graph is because in the
clustering with OpenDayLight the information only flows from leader to followers,
that’s why there is more bandwidth usage from A to B than from B to A. the leader
has to send all the information about Topologies, Inventory and also the information
88
related to the membership between controllers. Instead other nodes has to exchange
information only about membership, that’s why it consume less bandwidth.
89
13. Conclusions
The idea of migrating traditional networks into Software Defined Networks is being
supported for most of the greatest vendors and companies, therefore is needed a
further research in the field of the SDN controllers in order to determine which one
is the most efficient for the network. SDN networks have demonstrated that can be
more flexible and reliable in shaping traffic along the network, making use of its
decoupled architecture.
In this thesis was discussed the importance of having a distributed SDN network, in
order to avoid the problem of having a single point of failure and also because with
a distributed system is possible to achieve high availability, scalability and,
persistency of the information. Was also analyzed the behavior of the OpenDayLight
controller under the cluster architecture schema, with the aim of having a better
understandability of how is acting the controller. One of the biggest part of the work
was the research for the protocols that were being used for the network in the
communication between controllers, in this part was also explained the different
messages exchanged between controllers. This was performed with help of
Wireshark, making traffic captures along the interfaces involved and analyzing the
packets with the aim of understand the communication.
For the last part of the thesis was performed a bandwidth usage analysis by making
a traffic capture between controllers. Was again implemented Wireshark for the data
capture and then implemented Matlab for the data processing. In this chapter was
implemented an experimental model for the bandwidth usage in different scenarios.
The process consisted in vary the topology size of the network, this was
implemented with a liner topology in Mininet and, then it was connected to the master
controller.
90
With the model was possible to observe the behavior of the bandwidth usage with
respect to the topology size, and then the final result was that the bandwidth grows
in a linear fashion with the topology size.
Another future work could be the implementation of the cluster in real equipment,
this because the deployment was performed in a virtualized environment and it can
lead to some changes in the real implementation. This would be the possibility to
verify the result obtained and compare the bandwidth usage model in real networks.
91
References
[1] C. Technical Report, "Forecast and Methodology, 2014-2019 White Paper.," Cisco Visual
Networking Index, 2015.
[2] O. N. Foundation, "Software Defined Networking: The New Norm for Networks," ONF, White
Paper, 2012.
[5] i. s. &. i. group, "Software Defined Networking Primer + Deep Dive into Big Switch
Networks," Technology Research, 2012.
[10] J. Ogando, "Distributed control: another choice for multi-station loading systems," Plastics
technology, 1995.
[11] U. o. P. Insup Lee, "Introduction Distributed Systems, Lecture Notes," 2014. [Online].
Available: http://www.cis.upenn.edu/~lee/00cse380/lectures/ln13-ds.ppt.
92
[14] M. Raja, «MD-SAL Clustering Internals, Linux Foundation,» 2015. [En línea]. Available:
http://events.linuxfoundation.org/sites/events/files/slides/MD-
SAL%20Clustering%20Internals.pdf.
[19] L. Foundation, "OpenDayLight User Guide - Helium Release," 2015. [Online]. Available:
https://www.opendaylight.org/software/release-archives.
93