leveldb Skiplist & MemTable
skiplist 的基本原理不介绍了,本文是有关 leveldb 中 Skiplist 的实现
leveldb 对 Skiplist 的需求
- 无锁并发读
- 外部加锁的非并发insert,且 insert 的 key 不会重复
- 不提供 del,因为如果 leveldb 想要 del 一个 key,那么他会 insert 一个 kTypeDeletion的key(墓碑)
Skiplist 的定义
注意是个模板类,其中 Key 为要存储的数据类型,Comparator 用来比较 Key 的大小。
template <typename Key, class Comparator>
class SkipList {
private:
struct Node;
public:
// Create a new SkipList object that will use "cmp" for comparing keys,
// and will allocate memory using "*arena". Objects allocated in the arena
// must remain allocated for the lifetime of the skiplist object.
explicit SkipList(Comparator cmp, Arena* arena);
SkipList(const SkipList&) = delete;
SkipList& operator=(const SkipList&) = delete;
// Insert key into the list.
// REQUIRES: nothing that compares equal to key is currently in the list.
void Insert(const Key& key);
// Returns true iff an entry that compares equal to key is in the list.
bool Contains(const Key& key) const;
// Iteration over the contents of a skip list
class Iterator;
private:
enum {
kMaxHeight = 12 };
inline int GetMaxHeight() const {
return max_height_.load(std::memory_order_relaxed);
}
Node* NewNode(const Key& key, int height);
int RandomHeight();
bool Equal(const Key& a, const Key& b) const {
return (compare_(a, b) == 0); }
// Return true if key is greater than the data stored in "n"
bool KeyIsAfterNode(const Key& key, Node* n) const;
// Return the earliest node that comes at or after key.
// Return nullptr if there is no such node.
//
// If prev is non-null, fills prev[level] with pointer to previous
// node at "level" for every level in [0..max_height_-1].
Node* FindGreaterOrEqual(const Key& key, Node** prev) const;
// Return the latest node with a key < key.
// Return head_ if there is no such node.
Node* FindLessThan(const Key& key) const;
// Return the last node in the list.
// Return head_ if list is empty.
Node* FindLast() const;
// Immutable after construction
Comparator const compare_;
Arena* const arena_; // Arena used for allocations of nodes
Node* const head_;
// Modified only by Insert(). Read racily by readers, but stale
// values are ok.
std::atomic<int> max_height_; // Height of the entire list
// Read/written only by Insert().
Random rnd_;
};
其中 arena_ 是 skiplist 的内存池,用于给节点分配内存,rand_ 是 skiplist 的随机数产生器,用于解决 skiplist 里”抛硬币”的问题,max_height_ 记录当前最大高度,compare_ 用于 key 比较,head_ 即首节点。
先看下构造函数
template <typename Key, class Comparator>
SkipList<Key, Comparator>::SkipList(Comparator cmp, Arena* arena)
: compare_(cmp),
arena_(arena),
head_(NewNode(0 /* any key will do */, kMaxHeight)),
max_height_(1),
rnd_(0xdeadbeef) {
for (int i = 0; i < kMaxHeight; i++) {
head_->SetNext(i, nullptr);
}
}
max_height_初始化为1,初始化头结点 head_(Key 为 0,高度为 kMaxHeight),并且设置 head_ 的每一层的后继节点为 nullptr。其中 enum { kMaxHeight = 12 };,也就是论文里的 MaxLevel。
Node & NewNode
template <typename Key, class Comparator>
struct SkipList<Key, Comparator>::Node {
explicit Node(const Key& k) : key(k) {
}
Key const key;
// Accessors/mutators for links. Wrapped in methods so we can
// add the appropriate barriers as necessary.
Node* Next(int n);
void SetNext(int n, Node* x);
// No-barrier variants that can be safely used in a few locations.
Node* NoBarrier_Next(int n);
void NoBarrier_SetNext(int n, Node* x);
private:
// Array of length equal to the node height. next_[0] is lowest level link.
std::atomic<Node*> next_[1];
};
所有的 Node 对象都通过 NewNode 构造出来:先通过 arena_ 分配好内存,然后通过 placement new 的方式调用 Node 的构造函数。
template <typename Key, class Comparator>
typename SkipList<Key, Comparator>::Node* SkipList<Key, Comparator>::NewNode(
const Key& key, int height) {
char* const node_memory = arena_->AllocateAligned(
sizeof(Node) + sizeof(std::atomic<Node*>) * (height - 1));
return new (node_memory) Node(key);
}
下图即为假设kMaxHeight = 4(实际为12),某一时刻leveldb的skiplist结构,黄色为Node中的key
因此可以看到 SkipList 构造函数里初始化了 head_,高度为 kMaxHeight ,并且设置 head_ 的每一层的后继节点为 nullptr。
Insert & Contains
Insert
insert操作不能并发(所以才有了下面注释里的那个TODO),必须要在外面做同步。
不能insert多个相同的key。
template <typename Key, class Comparator>
void SkipList<Key, Comparator>::Insert(const Key& key) {
// TODO(opt): We can use a barrier-free variant of FindGreaterOrEqual()
// here since Insert() is externally synchronized.
// prev记录每一层最后一个小于 key 的节点,也就是待插入节点的前驱节点
Node* prev[kMaxHeight];
Node* x = FindGreaterOrEqual(key, prev);
// Our data structure does not allow duplicate insertion
assert(x == nullptr || !Equal(key, x->key));
int height = RandomHeight(); // 随机决定节点高度 height
if (height > GetMaxHeight()) {
// 如果新节点的高度比当前所有节点高度都大,那么填充 prev 中的更高层为 head_
//(因为新节点是最高的了,那它大于当前所有节点高度的部分的前驱只能是 head_ 了)
// ,同时更新 max_height_
for (int i = GetMaxHeight(); i < height; i++) {
prev[i] = head_;
}
// It is ok to mutate max_height_ without any synchronization
// with concurrent readers. A concurrent reader that observes
// the new value of max_height_ will see either the old value of
// new level pointers from head_ (nullptr), or a new value set in
// the loop below. In the former case the reader will
// immediately drop to the next level since nullptr sorts after all
// keys. In the latter case the reader will use the new node.
max_height_.store(height