Build Your Own Redis With c c
Build Your Own Redis With c c
James Smith
2023-01-31
Build Your Own Redis with C/C++
Build Your Own Redis with C/C++
01. Introduction
02. Introduction to Sockets
03. Hello Server/Client
04. Protocol Parsing
05. The Event Loop and Nonblocking IO
06. The Event Loop Implementation
07. Basic Server: get, set, del
08. Data Structure: Hashtables
09. Data Serialization
10. The AVL Tree: Implementation & Testing
11. The AVL Tree and the Sorted Set
12. The Event Loop and Timers
13. The Heap Data Structure and the TTL
14. The Thread Pool & Asynchronous Tasks
A1: Hints to Exercises
Build Your Own Redis with C/C++
01. Introduction
What Is This Book About?
This book contains a step-by-step walkthrough of a simple
implementation of a Redis-like server. It is intended as a practical
guide or tutorial to network programming and the implementation and
application of basic data structures in C.
The end result is a mini Redis alike with only about 1200 lines of
code. 1200 LoC seems low, but it illustrates many important aspects
the book attempts to cover.
The techniques and approaches used in the book are not exactly the
same as the real Redis. Some are intentionally simplified, and some
are chosen to illustrate a general topic. Readers can learn even more
by comparing different approaches.
The code used in this book is intended to run on Linux only, and can
be downloaded at this URL:
https://build-your-own.org/redis/src.tgz
The contents and the source code of this book can be browsed online
at:
https://build-your-own.org
02. Introduction to Sockets
This chapter is an introduction to socket programming. Readers are
assumed to have basic knowledge of computer networking but no
experience in network programming. This book does not contain
every detail on how to use socket APIs, you are advised to read
manpages and other network programming guides while learning from
this book. (https://beej.us/ is a good source for socket APIs.)
fd = socket()
bind(fd, address)
listen(fd)
while True:
conn_fd = accept(fd)
do_something_with(conn_fd)
close(conn_fd)
fd = socket()
connect(fd, address)
do_something_with(fd)
close(fd)
The next chapter will help you get started using real code.
03. Hello Server/Client
This chapter continues the introduction of socket programming. We’ll
write 2 simple (incomplete and broken) programs to demonstrate the
syscalls from the last chapter. The first program is a server, it accepts
connections from clients, reads a single message, and writes a single
reply. The second program is a client, it connects to the server, writes
a single message, and reads a single reply. Let’s start with the server
first.
The AF_INET is for IPv4, use AF_INET6 for IPv6 or dual-stack socket.
For simplicity, we’ll just use AF_INET throughout this book.
The SOCK_STREAM is for TCP. We won’t use anything other than TCP
in this book. All the 3 parameters of the socket() call are fixed in
this book.
int val = 1;
setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &val, sizeof(val
The next step is the bind() and listen(), we’ll bind on the
wildcard address 0.0.0.0:1234:
// listen
rv = listen(fd, SOMAXCONN);
if (rv) {
die("listen()");
}
while (true) {
// accept
struct sockaddr_in client_addr = {};
socklen_t socklen = sizeof(client_addr);
int connfd = accept(fd, (struct sockaddr *)&client_addr
if (connfd < 0) {
continue; // error
}
do_something(connfd);
close(connfd);
}
Note that the read() and write() call returns the number of read or
written bytes. A real programmer must deal with the return value of
functions, but in this chapter, I have omitted lots of things for brevity.
And the code in this chapter is not the correct way to do networking
anyway.
$ ./server
client says: hello
$ ./client
server says: world
03_client.cpp
03_server.cpp
04. Protocol Parsing
Our server will be able to process multiple requests from a client, to
do that we need to implement some sort of “protocol”, at least to split
requests apart from the TCP byte stream. The easiest way to split
requests apart is by declaring how long the request is at the beginning
of the request. Let’s use the following scheme.
+-----+------+-----+------+--------
| len | msg1 | len | msg2 | more...
+-----+------+-----+------+--------
Starts from the code from the last chapter, the loop of the server is
modified to handle multiple requests:
while (true) {
// accept
struct sockaddr_in client_addr = {};
socklen_t socklen = sizeof(client_addr);
int connfd = accept(fd, (struct sockaddr *)&client_addr
if (connfd < 0) {
continue; // error
}
// only serves one client connection at once
while (true) {
int32_t err = one_request(connfd);
if (err) {
break;
}
}
close(connfd);
}
The one_request function only parses one request and replies, until
something bad happens or the client connection is gone. Our server
can only handle one connection at once until we introduce the event
loop in later chapters.
uint32_t len = 0;
memcpy(&len, rbuf, 4); // assume little endian
if (len > k_max_msg) {
msg("too long");
return -1;
}
// request body
err = read_full(connfd, &rbuf[4], len);
if (err) {
msg("read() error");
return err;
}
// do something
rbuf[4 + len] = '\0';
printf("client says: %s\n", &rbuf[4]);
// 4 bytes header
char rbuf[4 + k_max_msg + 1];
errno = 0;
int32_t err = read_full(fd, rbuf, 4);
if (err) {
if (errno == 0) {
msg("EOF");
} else {
msg("read() error");
}
return err;
}
// reply body
err = read_full(fd, &rbuf[4], len);
if (err) {
msg("read() error");
return err;
}
// do something
rbuf[4 + len] = '\0';
printf("server says: %s\n", &rbuf[4]);
return 0;
}
int main() {
int fd = socket(AF_INET, SOCK_STREAM, 0);
if (fd < 0) {
die("socket()");
}
// multiple requests
int32_t err = query(fd, "hello1");
if (err) {
goto L_DONE;
}
err = query(fd, "hello2");
if (err) {
goto L_DONE;
}
err = query(fd, "hello3");
if (err) {
goto L_DONE;
}
L_DONE:
close(fd);
return 0;
}
$ ./server
client says: hello1
client says: hello2
client says: hello3
EOF
$ ./client
server says: world
server says: world
server says: world
04_client.cpp
04_server.cpp
05. The Event Loop and Nonblocking
IO
There are 3 ways to deal with concurrent connections in server-side
network programming. They are: forking, multi-threading, and event
loops. Forking creates new processes for each client connection to
achieve concurrency. Multi-threading uses threads instead of
processes. An event loop uses polling and nonblocking IO and usually
runs on a single thread. Due to the overhead of processes and threads,
most modern production-grade software uses event loops for
networking.
The simplified pseudo-code for the event loop of our server is:
all_fds = [...]
while True:
active_fds = poll(all_fds)
for each fd in active_fds:
do_something_with(fd)
def do_something_with(fd):
if fd is a listening socket:
add_new_client(fd)
elif fd is a client connection:
while work_not_done(fd):
do_something_to_client(fd)
def do_something_to_client(fd):
if should_read_from(fd):
data = read_until_EAGAIN(fd)
process_incoming_data(data)
while should_write_to(fd):
write_until_EAGAIN(fd)
if should_close(fd):
destroy_client(fd)
In blocking mode, read blocks the caller when there are no data in the
kernel, write blocks when the write buffer is full, and accept blocks
when there are no new connections in the kernel queue. In
nonblocking mode, those operations either success without blocking,
or fail with the errno EAGAIN, which means “not ready”. Nonblocking
operations that fail with EAGAIN must be retried after the readiness
was notified by the poll.
flags |= O_NONBLOCK;
errno = 0;
(void)fcntl(fd, F_SETFL, flags);
if (errno) {
die("fcntl error");
}
}
On Linux, besides the poll syscall, there are also select and epoll.
The ancient select syscall is basically the same as the poll, except
that the maximum fd number is limited to a small number, which
makes it obsolete in modern applications. The epoll API consists of
3 syscalls: epoll_create, epoll_wait, and epoll_ctl. The epoll
API is stateful, instead of supplying a set of fds as a syscall argument,
epoll_ctl was used to manipulate an fd set created by
epoll_create, which the epoll_wait is operating on.
We’ll use the poll syscall in the next chapter since it’s slightly less
code than the stateful epoll API. However, the epoll API is
preferable in real-world projects since the argument for the poll can
become too large as the number of fds increases.
06. The Event Loop Implementation
This chapter walks through the real C++ code of an echo server.
enum {
STATE_REQ = 0,
STATE_RES = 1,
STATE_END = 2, // mark the connection for deletion
};
struct Conn {
int fd = -1;
uint32_t state = 0; // either STATE_REQ or STATE_RES
// buffer for reading
size_t rbuf_size = 0;
uint8_t rbuf[4 + k_max_msg];
// buffer for writing
size_t wbuf_size = 0;
size_t wbuf_sent = 0;
uint8_t wbuf[4 + k_max_msg];
};
We need buffers for reading/writing, since in nonblocking mode, IO
operations are often deferred.
The state is used to decide what to do with the connection. There are
2 states for an ongoing connection. The STATE_REQ is for reading
requests and the STATE_RES is for sending responses.
int main() {
int fd = socket(AF_INET, SOCK_STREAM, 0);
if (fd < 0) {
die("socket()");
}
return 0;
}
The first thing in our event loop is setting up arguments of poll. The
listening fd is polled with the POLLIN flag. For the connection fd, the
state of the struct Conn determines the poll flag. In this particular case,
the poll flag is either reading (POLLIN) or writing (POLLOUT), never
both. If using epoll, the first thing in an event loop is usually
updating the fd set with epoll_ctl.
conn->rbuf_size += (size_t)rv;
assert(conn->rbuf_size <= sizeof(conn->rbuf) - conn
// Try to process requests one by one.
// Why is there a loop? Please read the explanation of "pip
while (try_one_request(conn)) {}
return (conn->state == STATE_REQ);
}
def do_something_to_client(fd):
if should_read_from(fd):
data = read_until_EAGAIN(fd)
process_incoming_data(data)
# code omitted...
The read syscall (and any other syscalls) need to be retried after
getting the errno EINTR. The EINTR means the syscall was interrupted
by a signal, the retrying is needed even if our application does not
make use of signals.
// change state
conn->state = STATE_RES;
state_res(conn);
The above code flushes the write buffer until it got EAGAIN, or transits
back to the STATE_REQ if the flushing is done.
To test our server, we can run the client from chapter 04 since the
protocol is identical. We can also modify the client to demonstrate
pipelining client:
int main() {
int fd = socket(AF_INET, SOCK_STREAM, 0);
if (fd < 0) {
die("socket()");
}
// code omitted...
L_DONE:
close(fd);
return 0;
}
Exercises:
1. Try to use epoll instead of poll in the event loop. This should
be easy.
2. We are using memmove to reclaim read buffer space. However,
memmove on every request is unnecessary, change the code the
perform memmove only before read.
3. In the state_res function, write was performed for a single
response. In pipelined sceneries, we could buffer multiple
responses and flush them in the end with a single write call.
Note that the write buffer could be full in the middle.
06_client.cpp
06_server.cpp
07. Basic Server: get, set, del
With the event loop code from the last chapter, we can finally start
adding commands to our server.
The “command” in our design is a list of strings, like set key val.
We’ll encode the “command” with the following scheme.
+------+-----+------+-----+------+-----+-----+------+
| nstr | len | str1 | len | str2 | ... | len | strn |
+------+-----+------+-----+------+-----+-----+------+
The nstr is the number of strings and the len is the length of the
following string. Both are 32-bit integers.
// change state
conn->state = STATE_RES;
state_res(conn);
size_t pos = 4;
while (n--) {
if (pos + 4 > len) {
return -1;
}
uint32_t sz = 0;
memcpy(&sz, &data[pos], 4);
if (pos + 4 + sz > len) {
return -1;
}
out.push_back(std::string((char *)&data[pos + 4
pos += 4 + sz;
}
if (pos != len) {
return -1; // trailing garbage
}
return 0;
}
enum {
RES_OK = 0,
RES_ERR = 1,
RES_NX = 2,
};
// The data structure for the key space. This is just a placeho
// until we implement a hashtable in the next chapter.
static std::map<std::string, std::string> g_map;
std::vector<std::string> cmd;
for (int i = 1; i < argc; ++i) {
cmd.push_back(argv[i]);
}
int32_t err = send_req(fd, cmd);
if (err) {
goto L_DONE;
}
err = read_res(fd);
if (err) {
goto L_DONE;
}
L_DONE:
close(fd);
return 0;
}
Testing commands:
$ ./client get k
server says: [2]
$ ./client set k v
server says: [0]
$ ./client get k
server says: [0] v
$ ./client del k
server says: [0]
$ ./client get k
server says: [2]
$ ./client aaa bbb
server says: [1] Unknown cmd
07_client.cpp
07_server.cpp
08. Data Structure: Hashtables
This chapter fills the placeholder code in the last chapter’s server.
We’ll start by implementing a hashtable. Hashtables are often the
obvious data structure for holding an unknown amount of key-value
data that does not require ordering.
When the size of the hashtable is the power of two, the indexing
operation is a simple bit mask with the hash code.
// n must be a power of 2
static void h_init(HTab *htab, size_t n) {
assert(n > 0 && ((n - 1) & n) == 0);
htab->tab = (HNode **)calloc(sizeof(HNode *), n);
htab->mask = n - 1;
htab->size = 0;
}
// hashtable insertion
static void h_insert(HTab *htab, HNode *node) {
size_t pos = node->hcode & htab->mask;
HNode *next = htab->tab[pos];
node->next = next;
htab->tab[pos] = node;
htab->size++;
}
Deleting is easy. Notice how the use of pointers enables succinct code.
The from pointer can be either an item of the array or from a node,
yet the code doesn’t differentiate.
HNode *hm_lookup(
HMap *hmap, HNode *key, bool (*cmp)(HNode *, HNode
{
hm_help_resizing(hmap);
HNode **from = h_lookup(&hmap->ht1, key, cmp);
if (!from) {
from = h_lookup(&hmap->ht2, key, cmp);
}
return from ? *from : NULL;
}
size_t nwork = 0;
while (nwork < k_resizing_work && hmap->ht2.size >
// scan for nodes from ht2 and move them to ht1
HNode **from = &hmap->ht2.tab[hmap->resizing_pos
if (!*from) {
hmap->resizing_pos++;
continue;
}
if (hmap->ht2.size == 0) {
// done
free(hmap->ht2.tab);
hmap->ht2 = HTab{};
}
}
The insertion subroutine will trigger resizing should the table become
too full:
if (!hmap->ht2.tab) {
// check whether we need to resize
size_t load_factor = hmap->ht1.size / (hmap->ht1
if (load_factor >= k_max_load_factor) {
hm_start_resizing(hmap);
}
}
hm_help_resizing(hmap);
}
HNode *hm_pop(
HMap *hmap, HNode *key, bool (*cmp)(HNode *, HNode
{
hm_help_resizing(hmap);
HNode **from = h_lookup(&hmap->ht1, key, cmp);
if (from) {
return h_detach(&hmap->ht1, from);
}
from = h_lookup(&hmap->ht2, key, cmp);
if (from) {
return h_detach(&hmap->ht2, from);
}
return NULL;
}
Instead of making our data structure contain data, the hashtable node
structure is embedded into the payload data. This is the standard way
of creating generic data structures in C. Besides making the data
structure fully generic, this technique also has the advantage of
reducing unnecessary memory management. The structure node is not
separately allocated but is part of the payload data, and the data
structure code does not own the payload but merely organizes the
data. This may be quite a new idea to you if you learned data
structures from textbooks, which is probably using void * or C++
templates or even macros.
Listing the do_get function to see how the intrusive data structure is
used:
Entry key;
key.key.swap(cmd[1]);
key.node.hcode = str_hash((uint8_t *)key.key.data(),
Exercises:
1. Our hashtable triggers resizing when the load factor is too high,
should we also shrink the hashtable when the load factor is too
low? Can the shrinking be performed automatically?
08_server.cpp
hashtable.cpp
hashtable.h
09. Data Serialization
For now, our server protocol response is an error code plus a string.
What if we need to return more complicated data? For example, we
might add the keys command that returns a list of strings. We have
already encoded the list-of-strings data in the request protocol. In this
chapter, we will generalize the encoding to handle different types of
data. This is often called “serialization”.
enum {
SER_NIL = 0,
SER_ERR = 1,
SER_STR = 2,
SER_INT = 3,
SER_ARR = 4,
};
The SER_NIL is like NULL, the SER_ERR is for returning error code
and message, the SER_STR and SER_INT are for string and int64, and
the SER_ARR is for arrays.
// code omitted...
}
As we can see, our serialization protocol starts with one byte of data
type, followed by various types of payload data. Arrays come with
their size first, then their possibly nested elements.
$ ./client asdf
(err) 1 Unknown cmd
$ ./client get asdf
(nil)
$ ./client set k v
(nil)
$ ./client get k
(str) v
$ ./client keys
(arr) len=1
(str) k
(arr) end
$ ./client del k
(int) 1
$ ./client del k
(int) 0
$ ./client keys
(arr) len=0
(arr) end
09_client.cpp
09_server.cpp
hashtable.cpp
hashtable.h
10. The AVL Tree: Implementation &
Testing
While Redis is often referred to as a key-value store, the “value” part
of Redis is not restricted to plain strings, lists, hashmaps, and sorted
sets are quite nice things to have. Redis is also referred to as the “data
structure server” due to its rich set of data structures. Redis is often
used as an in-memory cache, and when storing data in memory, there
is an advantage of freely using data structures. The sorted set data
structure in Redis is quite a unique and useful thing. Not only it offers
the ability to sort your data in order, but also has the unique feature of
querying ordered data by rank. If you put 20M records into a sorted
set, you can get the record that ranked at 10M, without going through
the first 10M records, this is a feat that can not be emulated by current
SQL databases.
As the name “sorted set” implies, it’s a data structure for sorting.
Trees, balanced binary trees, are popular data structures for storing
sorted data. Among various data structures, the author found the AVL
tree particularly simple and easy to code, which will be used in this
book to implement sorted set. The real Redis project uses skiplist
which is also considered easy to code.
The idea of the AVL tree is to restrict the height difference between
the left subtree and the right subtree. The height difference between
subtrees is restricted to be at most one, never reaching two. When
inserting/removing nodes from an AVL tree, the height difference can
temporarily reach two, which is then fixed by the node rotations. The
rotation operation is the basis of balanced binary trees, which is also
used by other balanced trees like the RB tree. After the rotation, a
node with a subtree height difference of two is reduced back to be at
most one.
struct AVLNode {
uint32_t depth = 0;
uint32_t cnt = 0;
AVLNode *left = NULL;
AVLNode *right = NULL;
AVLNode *parent = NULL;
};
This is a regular binary tree node with extra fields. The depth field is
the height of the tree. The cnt field is the size of the tree, this field is
not specific to the AVL tree, it is used to implement the rank-based
query, which will be explained in the next chapter.
If the right subtree is too deep, a left rotation will fix it. Before the left
rotation, we may need a right rotation on the right subtree to ensure
the right subtree is leaning in the correct direction. Here is the
visualization:
b b d
/ \ / \ / \
a c ==> a d ==> b c
/ \ /
d c a
Insertion for binary trees is easy, just walk down from the root until
you find an empty subtree and place the new node here, then call up
avl_fix for maintenance.
*victim = *node;
if (victim->left) {
victim->left->parent = victim;
}
if (victim->right) {
victim->right->parent = victim;
}
AVLNode *parent = node->parent;
if (parent) {
(parent->left == node ? parent->left : parent
return root;
} else {
// removing root?
return victim;
}
}
}
This is the generic function for removing nodes from a binary tree,
with the AVL-tree-specific avl_fix.
Readers with experiences with the RB tree may notice how small and
simple the AVL tree implementation is. The maintenance code for RB
tree node deletion is significantly more complicated than the insertion;
while the AVL tree uses the same function avl_fix for both insertion
and deletion, this symmetry greatly reduces the efforts required to
code an AVL tree.
Here are our testing data types. If you are not familiar with intrusive
data structures, read the hashtable chapter.
struct Data {
AVLNode node;
uint32_t val = 0;
};
struct Container {
AVLNode *root = NULL;
};
if (!c.root) {
c.root = &data->node;
return;
}
c.root = avl_del(cur);
delete container_of(cur, Data, node);
return true;
}
Here is the function for verifying the correctness of the tree structure:
uint32_t l = avl_depth(node->left);
uint32_t r = avl_depth(node->right);
assert(l == r || l + 1 == r || l == r + 1);
assert(node->depth == 1 + max(l, r));
Code for comparing the contents of AVL tree with the expected data:
Container c;
// some quick tests
container_verify(c, {});
add(c, 123);
container_verify(c, {123});
assert(!del(c, 124));
assert(del(c, 123));
container_verify(c, {});
// sequential insertion
std::multiset<uint32_t> ref;
for (uint32_t i = 0; i < 1000; i += 3) {
add(c, i);
ref.insert(i);
container_verify(c, ref);
}
// random insertion
for (uint32_t i = 0; i < 100; i++) {
uint32_t val = (uint32_t)rand() % 1000;
add(c, val);
ref.insert(val);
container_verify(c, ref);
}
// random deletion
for (uint32_t i = 0; i < 200; i++) {
uint32_t val = (uint32_t)rand() % 1000;
auto it = ref.find(val);
if (it == ref.end()) {
assert(!del(c, val));
} else {
assert(del(c, val));
ref.erase(it);
}
container_verify(c, ref);
}
add(c, val);
ref.insert(val);
container_verify(c, ref);
dispose(c);
}
}
assert(del(c, val));
ref.erase(val);
container_verify(c, ref);
dispose(c);
}
}
Exercises:
1. While there is not much code for our AVL tree, this AVL tree
implementation is probably not a very efficient one. Our code
contains some reductant pointer updates, which might be a source
of optimization. Also, we don’t need to store the height value for
balancing, it is possible to store the height difference instead.
Research and explore efficient AVL tree implementations.
2. Can you create more test cases? The test cases presented in this
chapter are unlikely to be sufficient.
avl.cpp
test_avl.cpp
11. The AVL Tree and the Sorted Set
Based on the AVL tree in the last chapter, the sorted set data structure
can be easily added. The structure definition:
struct ZSet {
AVLNode *tree = NULL;
HMap hmap;
};
struct ZNode {
AVLNode tree;
HNode hmap;
double score = 0;
size_t len = 0;
char name[0];
};
The sorted set is a sorted list of pairs of (score, name) that supports
query or update by the sorting key, or by the name. It’s a combination
of the AVL tree and hashtable, and the pair node belongs to both,
which demonstrates the flexibility of intrusive data structures. The
name string is embedded at the end of the pair node, in the hope of
saving up some space overheads.
The function for tree insertion is roughly the same as the testing code
seen from the previous chapter:
Here is the primary use case of sorted sets: the range query.
if (found) {
found = avl_offset(found, offset);
}
return found ? container_of(found, ZNode, tree) : NULL
}
struct AVLNode {
uint32_t depth = 0;
uint32_t cnt = 0;
AVLNode *left = NULL;
AVLNode *right = NULL;
AVLNode *parent = NULL;
};
It has an extra cnt field (the size of the tree), which is not explained
in the previous chapter. It is used by the avl_offset function:
It is a good idea to stop and test the new avl_offset function now.
static void test_case(uint32_t sz) {
Container c;
for (uint32_t i = 0; i < sz; ++i) {
add(c, i);
}
dispose(c.root);
}
The rest of the code is considered trivial, which will be omitted in the
code listing.
CASES = r'''
$ ./client zscore asdf n1
(nil)
$ ./client zquery xxx 1 asdf 1 10
(arr) len=0
(arr) end
# more cases...
'''
import shlex
import subprocess
cmds = []
outputs = []
lines = CASES.splitlines()
for x in lines:
x = x.strip()
if not x:
continue
if x.startswith('$ '):
cmds.append(x[2:])
outputs.append('')
else:
outputs[-1] = outputs[-1] + x + '\n'
Exercises:
11_client.cpp
11_server.cpp
avl.cpp
avl.h
common.h
hashtable.cpp
hashtable.h
test_cmds.py
test_offset.cpp
zset.cpp
zset.h
12. The Event Loop and Timers
There is one major thing missing in our server: timeouts. Every
networked application needs to handle timeouts since the other side of
the network can just disappear. Not only do ongoing IO operations
like read/write need timeouts, but it is also a good idea to kick out idle
TCP connections. To implement timeouts, the event loop must be
modified since the poll is the only thing that is blocking.
The problem is that we might have more than one timer, the timeout
value of poll should be the timeout value of the nearest timer. Some
data structure is needed for finding the nearest timer. The heap data
structure is a popular choice for finding the min/max value and is
often used for such purpose. Also, any data structure for sorting can
be used. For example, we can use the AVL tree to order timers and
possibly augment the tree to keep track of the minimum value.
Let’s start by adding timers to kick out idle TCP connections. For each
connection there is a timer, set to a fixed timeout into the future, every
time there are IO activities on the connection, the timer is renewed to
a fixed timeout. Notice that when we renew a timer, it becomes the
most distant one; therefore, we can exploit this fact to simplify the
data structure; a simple linked list is sufficient to keep the order of
timers: the new or updated timer simply goes to the end of the list, and
the list maintains sorted order. Also, operations on linked lists are
O(1), which is better than sorting data structures.
struct DList {
DList *prev = NULL;
DList *next = NULL;
};
The next step is adding the list to the server and the connection struct.
// global variables
static struct {
HMap db;
// a map of all client connections, keyed by fd
std::vector<Conn *> fd2conn;
// timers for idle connections
DList idle_list;
} g_data;
struct Conn {
int fd = -1;
uint32_t state = 0; // either STATE_REQ or STATE_RES
// buffer for reading
size_t rbuf_size = 0;
uint8_t rbuf[4 + k_max_msg];
// buffer for writing
size_t wbuf_size = 0;
size_t wbuf_sent = 0;
uint8_t wbuf[4 + k_max_msg];
uint64_t idle_start = 0;
// timer
DList idle_list;
};
int main() {
// some initializations
dlist_init(&g_data.idle_list);
// handle timers
process_timers();
// try to accept a new connection if the listening fd i
if (poll_args[0].revents) {
(void)accept_new_conn(fd);
}
}
return 0;
}
The next_timer_ms function takes the first (nearest) timer from the
list and uses it the calculate the timeout value of poll.
At each iteration of the event loop, the list is checked in order to fire
timers in due time.
// do the work
if (conn->state == STATE_REQ) {
state_req(conn);
} else if (conn->state == STATE_RES) {
state_res(conn);
} else {
assert(0); // not expected
}
}
Don’t forget to remove the connection from the list when done:
$ ./server
removing idle connection: 4
$ socat tcp:127.0.0.1:1234 -
The server should close the connection by 5s.
Exercises:
12_server.cpp
avl.cpp
avl.h
common.h
hashtable.cpp
hashtable.h
list.h
zset.cpp
zset.h
13. The Heap Data Structure and the
TTL
The primary use of Redis is as cache servers, and one way to manage
the size of the cache is through explicitly setting TTLs (time to live).
TTLs can be implemented using timers. Unfortunately, timers in the
last chapter are of fixed value (using linked lists); thus, a sorting data
structure is needed for implementing arbitrary and mutable timeouts;
and the heap data structure is a popular choice. Compared with the
AVL tree we used before, the heap data structure has the advantage of
using less space.
1. A heap is a binary tree, packed into an array; and the layout of the
tree is fixed. The parent-child relationship is implicit, pointers are
not included in heap elements.
2. The only constraint on the tree is that parents are no bigger than
their kids.
3. The value of an element can be updated. If the value changes:
Its value is bigger than before: it may be bigger than its kids,
and if so, swap it with the smallest kid, so that the parent-
child constraint is satisfied again. Now that one of the kids is
bigger than before, continue this process until reaching a
leave.
Its value is smaller: likewise, swap it with its parent until
reaching the root.
4. New elements are added to the end of the array as leaves.
Maintain the constraint as above.
5. When removing an element from a heap, replace it with the last
element in the array, then maintain the constraint as if its value
was updated.
struct HeapItem {
uint64_t val = 0;
size_t *ref = NULL;
};
The heap is used to order the timestamps, and the Entry is mutually
linked with the timestamp. The heap_idx is the index of the
corresponding HeapItem, and the ref points to the Entry. We are
using the intrusive data structure again; the ref pointer points to the
heap_idx field.
Swap with the parent when a kid is smaller than its parent. Note the
heap_idx is updated through the ref pointer while swapping.
// global variables
static struct {
HMap db;
// a map of all client connections, keyed by fd
std::vector<Conn *> fd2conn;
// timers for idle connections
DList idle_list;
// timers for TTLs
std::vector<HeapItem> heap;
} g_data;
Updating, adding, and removing a timer to the heap. Just call the
heap_update after updating an element of the array.
// idle timers
if (!dlist_empty(&g_data.idle_list)) {
Conn *next = container_of(g_data.idle_list.next
next_us = next->idle_start + k_idle_timeout_ms
}
// ttl timers
if (!g_data.heap.empty() && g_data.heap[0].val < next_us
next_us = g_data.heap[0].val;
}
if (next_us == (uint64_t)-1) {
return 10000; // no timer, the value doesn't matter
}
// idle timers
while (!dlist_empty(&g_data.idle_list)) {
// code omitted...
}
// TTL timers
const size_t k_max_works = 2000;
size_t nworks = 0;
while (!g_data.heap.empty() && g_data.heap[0].val <
Entry *ent = container_of(g_data.heap[0].ref, Entry
HNode *node = hm_pop(&g_data.db, &ent->node, &hnode_sam
assert(node == &ent->node);
entry_del(ent);
if (nworks++ >= k_max_works) {
// don't stall the server if too many keys are expi
break;
}
}
}
This is just checking the minimal value of the heap and removing
keys. Note that we put a limit on the number of keys expired per event
loop iteration; the limit is needed to prevent the server from stalling
should there are too many keys expiring at once.
Entry key;
key.key.swap(cmd[1]);
key.node.hcode = str_hash((uint8_t *)key.key.data(),
Exercises:
13_server.cpp
avl.cpp
avl.h
common.h
hashtable.cpp
hashtable.h
heap.cpp
heap.h
list.h
test_heap.cpp
zset.cpp
zset.h
14. The Thread Pool & Asynchronous
Tasks
There is a flaw in our server since the introduction of the sorted set
data type: the deletion of keys. If the size of a sorted set is huge, it can
take a long time to free its nodes and the server is stalled during the
destruction of the key. This can be easily fixed by using multi-
threading to move the destructor away from the main thread.
struct Work {
void (*f)(void *) = NULL;
void *arg = NULL;
};
struct TheadPool {
std::vector<pthread_t> threads;
std::deque<Work> queue;
pthread_mutex_t mu;
pthread_cond_t not_empty;
};
tp->threads.resize(num_threads);
for (size_t i = 0; i < num_threads; ++i) {
int rv = pthread_create(&tp->threads[i], NULL,
assert(rv == 0);
}
}
// do the work
w.f(w.arg);
}
return NULL;
}
The producer code:
pthread_mutex_lock(&tp->mu);
tp->queue.push_back(w);
pthread_cond_signal(&tp->not_empty);
pthread_mutex_unlock(&tp->mu);
}
The explanation:
1. For both the producer and consumers, the queue access code is
surrounded by the pthread_mutex_lock and the
pthread_mutex_unlock, only one thread can access the queue
at once.
2. After a consumer acquired the mutex, check the queue:
If the queue is not empty, grab a job from the queue, release
the mutex and do the work.
Otherwise, release the mutex and go to sleep, the sleep can be
wakened later by the condition variable. This is accomplished
via a single pthread_cond_wait call.
3. After the producer puts a job into the queue, the producer calls the
pthread_cond_signal to wake up a potentially sleeping
consumer.
4. After a consumer woken up from the pthread_cond_wait, the
mutex is held again automatically. The consumer must check for
the condition again after waking up, if the condition (a non-empty
queue) is not satisfied, go back to sleep.
The use of the condition variable needs some more explanations: The
pthread_cond_wait function is always inside a loop checking for
the condition. This is because the condition could be changed by other
consumers before the wakening consumer grabs the mutex; the mutex
is not transferred from the signaler to the to-be-waked consumer! It is
probably a mistake if you see a condition variable used without a
loop.
// global variables
static struct {
HMap db;
// a map of all client connections, keyed by fd
std::vector<Conn *> fd2conn;
// timers for idle connections
DList idle_list;
// timers for TTLs
std::vector<HeapItem> heap;
// the thread pool
TheadPool tp;
} g_data;
// some initializations
dlist_init(&g_data.idle_list);
thread_pool_init(&g_data.tp, 4);
// dispose the entry after it got detached from the key space
static void entry_del(Entry *ent) {
entry_set_ttl(ent, -1);
if (too_big) {
thread_pool_queue(&g_data.tp, &entry_del_async,
} else {
entry_destroy(ent);
}
}
Exercises:
14_server.cpp
avl.cpp
avl.h
common.h
hashtable.cpp
hashtable.h
heap.cpp
heap.h
list.h
thread_pool.cpp
thread_pool.h
zset.cpp
zset.h
A1: Hints to Exercises
08. Data Structure: Hashtables
Q: Our hashtable triggers resizing when the load factor is too high,
should we also shrink the hashtable when the load factor is too
low? Can the shrinking be performed automatically?
Hints:
Hints:
Hints:
Q: Another sorted set application: count the number of elements
within a range. (also with a worst-case of O(log(n)).)
Hints:
Hints:
Q: The real Redis does not use sorting for expiration, find out how
it is done, and list the pros and cons of both approaches.
Hints:
Hints: