Bigtable - A Distributed Storage System For Structured Data
Bigtable - A Distributed Storage System For Structured Data
Bigtable is a distributed storage solution developed by Google to store structured data in a scalable manner. The system
is being used by more than sixty products such as Personalized search, Google Analytics, Orkut for their storage needs.
Even though Bigtable serves as a storage system for multiple products, its interface differs from a typical database in a
sense that it does not provide a fully relational database model.
Various components of BigTable:
Data Model & API : Bigtable's data model resembles a multidimensional sorted map indexed by three attributes:
a) Row keys in Bigtable are random strings, and every read or write operation for a row key is atomic, regardless of the
number of columns involved. This simplifies handling concurrent operations. Data is stored in lexicographical order based
on row keys, and the table is partitioned into ranges of row keys, known as tablets.
b) Column keys are grouped into column families, which serve as the basic unit of access control. A column family must
be created before any operations can be performed on the data. Each column family is represented in the format
family:qualifier
c) Timestamp: Bigtable stores multiple versions of data indexed by timestamps, with versions arranged in decreasing
order.
So key in the big-table is represented along with its indexes as(row:string, column:string, time:int64) → string
Bigtable provides a rich API for creating and updating tables and column families, as well as for modifying clusters, tables,
and metadata. The API includes various abstractions for read and write operations. Additionally, the paper mentions that
certain abstractions are built on top of Bigtable to enable its integration with MapReduce for large-scale computations.
Technologies that power Bigtable: Bigtable leverages several core Google technologies, including GFS for log and data
storage, and Google's cluster management system for resource management and monitoring. It uses the SSTable file
format for storage and relies on Chubby, a distributed locking service based on Paxos, for leader election, tablet server
discovery, and failure detection. Chubby is critical for Bigtable’s availability, as it handles cross-node communication.
Underlying implementation :
Bigtable consists of three key components: a client library, a master server, and multiple tablet servers for data
management.
Bigtable adjusts the number of tablet servers based on load. The master node manages tablet assignments, detects
changes in tablet servers, performs log garbage collection, and handles schema and column family changes. Each tablet
server manages a set of tablets, routing read and write requests to them and splitting tablets as needed. Unlike typical
systems, clients communicate directly with tablet servers for information, which keeps the master node less busy and
highly available. A Bigtable cluster consists of multiple tables, each starting with one tablet that splits into more as the
data grows.
Tablet Location: Bigtable uses a three-level hierarchy to store the tablet information. The three levels are:
1) First level is a file stored in Chubby that consists of the location for the root tablet.
2) The root tablet contains the locations of all other tablets in a metadata table. It's the first tablet in this table and is
treated with special preference, as it is never split regardless of size. This ensures that the number of levels in the system
remains at three or fewer.
3) Each metadata tablet contains the location of a set of user tablets.
The client library caches tablet locations, and if this information becomes stale, it looks up the tablet location based on the
hierarchy. It doesn't have to check all three levels; instead, it can move back one level at a time until it finds the correct
location. This process follows a check-then-act pattern: the client first reads from Chubby to see if the information is up to
date, realizing it's stale only after encountering a miss.
Tablet Assignment : The master node assigns tablets to tablet servers and keeps track of which servers are active. It also
manages the assignments of tablets to servers and those yet to be assigned. Chubby helps monitor tablet servers: when
a tablet server starts, it creates an exclusive lock on a uniquely named file in a Chubby directory. The master node just
needs to check this directory to find newly added tablet servers. A tablet server can only serve its assigned tablets as long
as it holds the lock in the Chubby directory.
The master node monitors tablet servers to see if they stop serving their assigned tablets by checking their lock status. If
a tablet server confirms it has lost the lock or is unreachable after several attempts, the master tries to acquire the lock on
the server's file. If successful, this indicates an issue with the tablet server rather than Chubby. The master then deletes
the server file to prevent the lock from being reacquired if the server comes back online. After confirming the file is
deleted, the master reassigns the tablets to the unassigned list.
If the master can't communicate with Chubby, it terminates itself so a new master can take over. When a new master
starts, it acquires an exclusive lock on the master lock in Chubby to prevent multiple masters from running simultaneously.
It scans Chubby's server directory to find active tablet servers, then connects to each one to identify the tablets they
manage. By comparing this list to the metadata table, the master creates a list of unassigned tablets, making them eligible
for future assignments.
The set of tablets changes only when a tablet is created, deleted, merged, or split. Since all but tablet splits are initiated
by the master, it keeps track of all tablets in the system. Tablet splits are triggered by the tablet server, which notifies the
master after the split. To improve error handling in case the notification is lost (due to either the master or tablet server
failing), the master can also request the tablet server to list its assigned tablets to get updated information.
Tablet Serving : The state of a tablet is stored in GFS using a commit log and redo records, following a typical LSM tree
architecture. Recent updates are kept in a sorted in-memory buffer called a memtable. Once the memtable reaches a
certain size, it's written to disk as an immutable file known as an SSTable. To recover a tablet, the tablet server reads the
metadata table for the list of associated SSTables, loads them into memory, and applies the updates to rebuild the tablet.
To process a write operation, the tablet server first adds an entry to the commit log, grouping commits to enhance
throughput. It validates and authorizes the write request before committing. Read operations go through the same
validation checks. For reads, the server first checks the memtable, then the SSTables until it finds the requested data.
Since both the memtable and SSTables are sorted, lookups are efficient and fast.
Compactions: When the memtable is flushed to disk as SSTables, two issues can arise: looking up sparsely used keys
may require processing multiple SSTables, increasing read latency, and recovering a tablet can become time-consuming
due to the large number of SSTable files. To address these problems, it’s essential to minimize the number of SSTables.
The compaction process merges data from multiple SSTables into a single file, reducing their overall number and
improving performance.
Improvements in the original design
1) Locality groups: Clients can enhance lookup performance by grouping column families into locality groups, which are
accessed together as a logical chunk of information. This separation allows each group to be stored in its own SSTable,
resulting in more efficient reads. Additionally, clients can designate a locality group to be in-memory, enabling SSTables
for these groups to be loaded into memory lazily. This speeds up lookups, making it ideal for small datasets with high read
traffic.
2) Compression: Clients can configure certain SSTables in a locality group to bypass compression or specify the type of
compression to use. By skipping compression, read times improve since there’s no need to decompress SSTables when
loading them into memory.
3) Caching for read performance: Tablet servers use two-level caching to enhance read performance. The upper cache
stores frequently accessed key-value pairs, while the lower block cache holds SSTable blocks for applications that read
closely related data, such as iterating through a sequence of keys.
4) Bloom filters: In the original architecture, reads would scan SSTables to find a lookup key, potentially scanning all
SSTables for keys that aren't present. To improve performance, clients can use bloom filters, which quickly indicate
whether a key exists in an SSTable without needing a full scan. This significantly speeds up lookups for non-existent keys.
5) Commit-log implementation: Maintaining multiple commit logs for each tablet would require several disk seeks in GFS
during commit operations. To address this, all updates from a tablet server's tablets are written to a single commit log.
This can create a bottleneck during tablet recovery since the log contains mutations for other tablets as well sorting the
log entries by keys (table, row name, log sequence number) ensures that mutations for a specific tablet are contiguous,
allowing for more efficient processing.
6) Exploiting immutability: SSTables are immutable, which greatly simplifies the Bigtable system. Since they don't get
updated by other threads, there's no need for synchronization during reads. The only mutable component is the
in-memory memtable, where each row uses a copy-on-write approach to allow parallel reads and writes. This immutability
also helps during tablet splits, as both child tablets can access the original SSTable without needing separate files for
each split.