Hyper Table
Hyper Table
Hypertable
● Hypertable is an open source, high performance, distributed database modeled after Google’s Bigtable*. It
differs from traditional relational database technology in that the emphasis is on scalability as opposed to
transaction support and table joining. Tables in Hypertable are sorted by a single primary key. However, tables
can smoothly and cost-effectively scale to petabytes in size by leveraging a large cluster of commodity
hardware.
● Hypertable is designed to run on top of an existing distributed file system such as the Hadoop DFS,
GLusterFS, or the Kosmos File System (KFS). One of the top design objectives for this project has been
optimum performance. To that end, the system is written almost entirely in C++, which differentiates it from
other Bigtable-like efforts, such as HBase. We expect Hypertable to replace MySQL for much of Web 2.0
backend technology.
● Hypertable is a high performance distributed data storage system designed to support applications requiring
maximum performance, scalability, and reliability. Hypertable will be particularly invaluable to any
organization that needs to manage rapidly evolving data to support demanding real-time applications.
Modeled after Google’s well known Bigtable project, Hypertable is designed to manage the storage and
processing of information on a large cluster of commodity servers, providing resilience to machine and
component failures. Hypertable seeks to set the open source standard for highly available, petabyte scale,
database systems. ”.The Hypertable Lead Developer spoke on "Architecting Hypertable-a massively parallel
high performance database".
Hypertable Overview
Hypertable is an Open Source database system designed to deal with the massive scale of data that is found in web
applications such as processing the data returned by web crawlers as they crawl the entire internet. It is also designed
to run on the massive commodity computer farms, which can consist of thousands of systems, that are employed to
process such data. In particular Hypertable is designed so that its performance will scale with the number of
computers used and to handle the unreliability problems that inevitably ensue from using large computer arrays.
From a user perspective, the data model has a database that contains tables. Each table consists of a set of rows. Each
row has a primary key value and a set of columns. Each column contains a set of key value pairs commonly known as
a map. A timestamp is associated with each key value pair. The number of columns in a table is limited to 256,
otherwise there are no tight constraints on the size of keys or values. The only query method is a table scan. Tables are
stored in primary key order, so a query easily accesses a row or group of rows by constraining on the row key. The
query can specify which columns are returned, and the time range for key value pairs in each column.
Hypertable is neither relational or transactional. Its purpose is to store vast amounts of structured data and make
that data easily available. For example, while Hypertable does have logging to ensure that information does not
get lost, it does not support transactions whose purpose is to make sure that multiple related changes either all
happen together or none of them happen. Interestingly, many database systems switch off transactional behavior
for large bulk loads. There is no mechanism for combining data from different tables as tables are expected to be
so large that there is little point in trying to combine them.
2
The current status is that Hypertable is in alpha release. The code is there and works as Doug showed us in a
demonstration, however it uses a distributed file system like Hadoop to store its data and while they are still
developing they are also waiting for Hadoop to implement a consistency feature before they declare beta. Even
then there are a number of places where they have a single point of failure, so there is plenty of work to make it a
complete and resilient system.
System Summary
Initial releases of Hypertable provide a C++ API and an HQL (Hypertable Query Language, very similar to SQL)
interpreter for client access to the data store. Hypertable is not meant as a direct replacement for traditional
Database Management Systems such as MySQL or Oracle DB, but rather an alternative for storing and processing
very large datasets. Traditional RDBMs are transaction-oriented and offer many advanced features for dealing with
well structured data. Hypertable trades off features like joins and advanced querying for massive scalability and
higher throughput. Where row-oriented systems (like MySQL) are in general better for workloads with a lot of
write operations, column-oriented systems (like Hypertable) are better for "read-mostly" situations.
Hypertable is built on top of a Distributed File System (DFS). There are currently a number of DFS projects, but
Hypertable is designed to work with any of them. A DFS lets multiple physical machines act as a single virtual disk
and generally includes transparent replication and fault tolerance. These systems are designed to scale to very large
clusters of machines, allowing massive amounts of data storage and very fast read-write access. Hypertable layers
on top of a DFS to provide a very high availability, very fast and very scalable database to hold structured or
unstructured data.
In addition, large performance increases can be realized by performing computations in parallel, pushing them to
where the data is physically stored. The parallel structure of the system allows large sets of data to be worked on
very quickly.
1. Hypertable is memory (malloc) intensive. Hypertable caches all updates in an in-memory data structure
(e.g. stl map*). Periodically, these in-memory data structures get spilled to disk. These spilled disk files get
merged together to form larger files when their number reaches a certain threshold. The performance of the
system is, in large part, dictated by how much memory it has available to it. Less memory means more
spilling and merging which increases load on the network and underlying DFS. It also increases the CPU
work required of the system, in the form of extra heap-merge operations. Java is a poor choice for memory
hungry applications. In particular, in managing a large in-memory map of key/value pairs, Java's memory
performance is poor in comparison with C++. It's on the order of two to three times worse (if you don't
believe me, try it).
2. Hypertable is CPU intensive. There are several places where Hypertable is CPU intensive. The first place is
the in-memory maps of key/value pairs. Traversing and managing those maps can consume a lot of CPU.
Plus, given Java's inefficient use of memory with regard to these maps, the processor caches become much
less effective. A recent run of the tool Calibrator (http://monetdb.cwi.nl/Calibrator/) on one of our 2GHz
Opterons yields the following statistics:
3
caches:
level size linesize miss-latency replace-time
1 64 KB 64 bytes 6.06 ns =12 cy 5.60 ns =11 cy
2 768 KB 128 bytes 74.26 ns = 149 cy 75.90 ns=152 cy
You can pack a fair amount of work into 150 clock cycles. Another place where Hypertable is CPU intensive
is compression. All of the data that is inserted into a Hypertable gets compressed at least twice, and on
average three times. Once when writing data to the commit log, once during minor compaction and then once
for every merging or major compaction. And the amount of decompression that happens can be considerably
more depending on the amount of query workload the table sees. It's arguable that the native Java
implementation of zlib* is comparable to the C implementation, but as soon as you start experimenting with
different compression techniques (e.g. Bentley-McIlroy long common strings), you need to either implement
them in Java which yields unacceptable performance (if you don't believe me, try it), or implement them in
C/C++ and use JNI(Java Native Interface). With this second option, all of the benefits of Java get thrown out
the window and there is significant overhead in invoking a method via JNI.
For an example of what data in Hypertable looks like, let's say we wanted to store tags and reviews for a non-
homogeneous set of data. Let's say we had a lot of entries for stuffed animals, furniture and types of hats for an
online store. The row identifier would be the name of the animal or furniture or hat and there could be columns for
each tag or review. We'll also throw price in there. This is, of course, a very limited example, there could easily be
hundreds of columns of data for each row. With that in mind, let's look at three entries in this hypothetical database.
4
in the first 4 columns, the column family of "tag" indicates that this cell is a tag, the identifier of the column is the
name of the tag and the value is the number of people who have tagged it that way. In the 5th and 6th the column
family indicates it is a review, the identifier indicates who wrote the review and the content is the actual review.
.
tag:m tag:comfo price:U price:
key tag:striped tag:fourlegs tag:velvet review:joey review:susan
ale rtable S EU
"I like my
Zebra 803 212 0 6 9 25.00 20.00
zebra"
"this is an
175.0
Ottoman 1 343 298 0 0 "" excellent 200.00
0
ottoman"
Schemas in Hypertable are very flexible, with only the column families needing to be defined before hand. To create our example
table the HQL command would be:
Data is stored as key:value pairs. All revisions of the data are stored in Hypertable, so timestamps are an important part
of the keys. A typical key for a single cell is
<row>
<column-family>
<column-qualifier>
<timestamp>
Timestamps in select operations are generally passed in as a range, and any values within that range are returned. This
makes it easy to look at older versions of data and look at changes over time, as well as ensuring that all previous states
of the data is saved rather than overwritten. This default behavior can be overwritten to store a fixed number of recent
versions and allow older ones to be lazily garbage collected.
Timestamped versions of the Zebra row might look something like this:
5
Zebr
t=1 803 212 0 6 9 "I like my zebra" 25.00 20.00
a
Zebr
t=2 804
a
Zebr
t=4 24.99 19.99
a
Random updates are handled really efficiently in Hypertable through use of Cell Caches and Cell Stores. A Range is
actually made up of many Cell Stores. All the rows within a cell store are sorted by the row identifier. When a write
occurs to Hypertable, the information is written to a commit log in the DFS and then stored in memory in a Cell Cache.
When the Cell Cache hits its size limit it is compacted and written out to disk as a cell store. Cell stores end up being
non-contiguous, so a Heap Merge scanner is responsible for aggregating the key/value pairs in the cell cache and cell
stores and returning them in sorted order. When the range hits a cell store limit, a heap merge runs and compacts many
cell stores into a single one. In the ideal case, each Range ends up holding only a single Cell Store.
Hypertable provides the user some control over how column data is physically stored in the DFS (e.g. on disk) through
the Access Group mechanism. Each column family belongs to a single Access Group. The data for all of the column
families in an Access Group is physically stored together on disk. This can improve read and write performance for
column families that are accessed frequently. For example, let's say you have a table with 100 column families, but
only two out of 98 of the column families are heavily accessed. If these two heavily accessed column families are
stored together within their own access group, then the system will do disk I/O for just the data in these two columns in
order to access them. Without access groups, data for all 100 columns would have to be physically read and written,
even when just two of the 100 columns are being accessed.
6
Data Model
The Hypertable data model consists of a multi-dimensional table of information that can be queried using a single
primary key. The first dimension of the table is the row key. The row key is the primary key and defines the order in
which the table data is physically stored. The second dimension is the column family. This dimension is somewhat
analogous to a traditional database column. The third dimension is the column qualifier. Within each column family,
there can be a theoretically infinite number of qualified instances. For example if we were building a URL tagging
service, we might define column families content, url, and tag. Within the "tag" column family there could be an
infinite number of qualified instances, such as tag:science, tag:theater, tag:good, etc. The fourth and final dimension
is the time dimension. This dimension consists of a timestamp that is usually auto assigned by the system and
represents the insertion time of the cell in nanoseconds since the epoch. Conceptually, a table in Hypertable can be
thought of as a three dimensional Excel spreadsheet with timestamped versions of each cell.
The following diagram graphically depicts a crawl database table called crawldb. The row key is the URL of a page
to be crawled and the column families include: title, content, and anchor. The "anchor" column family illustrates the
use of column qualifiers.
Under the hood, this multi-dimensional table of information is represented as a flat sorted list of key/value pairs. The
key is essentially the concatenation of the four dimension keys (row, column family, column qualifier, and
7
timestamp). The following diagram depicts the flattened key. One thing to note is that the timestamp is stored ones
compliment big-endian so that the most recent cell sorts ahead of older versions of a cell.
So the above crawldb table would have a flattened representation that looks something like the following:
8
Access Groups - Traditional databases are considered to be either row oriented or column oriented depending on
how data is physically stored. With a row oriented database, all the data for a given row is stored contiguously on
disk. With a column oriented database, all data for a given column is stored contiguously on disk. Access groups in
Hypertable provide a way to group columns together physically. All of the columns in an access group will have
their data stored physically together in the same CellStore. This is essentially a hybrid approach. A row oriented
datastore can be simulated by putting all of the columns into a single access group. A column oriented datastore can
be simulated by putting each column into its own access group.
9
System Components
The following diagram illustrates all of the processes in the system and how they relate to one another.
Hyperspace
This is our system's equivalent of Chubby. Hyperspace (or Chubby) is a service that provides a filesystem for
storing small amounts of metadata. It also acts as a lock manager in that either exclusive or shared lock and be
acquired on any of the files or directories. Currently it is implemented as just a single server, but will be made
distributed and highly available at some point in the near future. Google refers to Chubby as, "the root of all
distributed data structures" which is a good way to think of this system. (Hyperspace C++ API)
Range Server
Tables are broken into a set of contiguous row ranges, each of which is managed by a range server. Initially each
table consists of a single range that spans the entire row key space. As the table fills with data, the range will
eventually exceed a size threshold (default is 200MB) and will split into two ranges using the middle row key as a
split point. One of the ranges will stay on the same range server that held the original range and the other will get
reassigned to another range server by the Master. This splitting process continues for all of the ranges as they
continue to grow. (Range Server C++ API) Each range server handles all of the reading and writing of table data for
the ranges that it is responsible for. Range servers cache updates in memory (after writing them to a Commit Log) in
what's called a CellCache. Periodically the CellCache will get flushed to disk (e.g. the DFS) in a specially formatted
file called a CellStore. To scan over the data in an access group, the range server must do a heap merge of the
CellCache and all of the CellStores for the access group. The following diagram illustrates this process.
10
Master
The master handles all meta operations such as creating and deleting tables. Client data does not move through the
Master, so the Master can be down for short periods of time without clients being aware. The master is also
responsible for detecting range server failures and re-assigning ranges if necessary. The master is also responsible
for range server load balancing. Currently there is only a single Master process, but the system is designed in such
a way as to allow hot standby masters. (Master C++ API)
DFS Broker
Hypertable is designed to run on top of any filesystem. To achieve this, the system has abstracted the interface to
the filesystem through something called the DFS broker. The DFS broker is a process that translates standardized
filesystem protocol messages into the system calls that are unique to the specific filesystem. DFS brokers have
been developed for HDFS (hadoop), KFS & local (for running on top of a local fileystem).(DFS Broker C++ API).
The best way to get started hands-on using hypertable is to first download the source code, build it, and get the
regression tests to pass. Once the regression tests are all passing, you can then start the servers running on top of
the local filesystem with the following command (assuming the installation directory is ~/hypertable/0.9.0.8):
$ ~/hypertable/0.9.0.8/bin/start-all-servers.sh local
Successfully started DFSBroker (local)
Successfully started Hyperspace
Successfully started Hypertable.MasterSuccessfully started Hypertable.RangeServer
$
You can create tables, load data, and issue queries with the Hypertable command interpreter "hypertable":
$ ~/hypertable/0.9.0.8/bin/hypertable
Welcome to the hypertable command interpreter.
For information about Hypertable, visit http://www.hypertable.org/
Type 'help' for a list of commands, or 'help shell' for a
list of shell meta commands.
Hypertable>
11
hypertable> help
CREATE TABLE ....... Creates a table
DELETE ............. Deletes all or part of a row from a table
DESCRIBE TABLE ..... Displays a table's schema
DROP TABLE ......... Removes a table
INSERT ............. Inserts data into a table
LOAD DATA INFILE ... Loads data from a tab delimited input file into a table
SELECT ............. Selects (and display) cells from a table
SHOW CREATE TABLE .. Displays CREATE TABLE command used to create table
SHOW TABLES ........ Displays the list of tables
SHUTDOWN ........... Shuts servers down gracefully
Statements must be terminated with ';' to execute. For more information on
a specific statement, type 'help <statement>', where <statement> is one from
the preceeding list.
hypertable>
Bigtable:
Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size:
petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including
web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both
in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk
processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible,
high-performance solution for all of these Google products. In this paper we describe the simple data model provided
by Bigtable, which gives clients dynamic control over data layout and format, and we describe the design and
implementation of Bigtable.
Over the last two and a half years we have designed, implemented, and deployed a distributed storage system for
managing structured data at Google called Bigtable. Bigtable is designed to reliably scale to petabytes of data and
thousands of machines. Bigtable has achieved several goals: wide applicability, scalability, high performance, and
high availability. Bigtable is used by more than sixty Google products and projects, including Google Analytics,
Google Finance, Orkut, Personalized Search, Writely, and Google Earth. These products use Bigtable for a variety of
demanding workloads, which range from throughput-oriented batch-processing jobs to latency-sensitive serving of
data to end users. The Bigtable clusters used by these products span a wide range of configurations, from a handful to
thousands of servers, and store up to several hundred terabytes of data.
In many ways, Bigtable resembles a database: it shares many implementation strategies with databases. Parallel
databases and main-memory databases have achieved scalability and high performance, but Bigtable provides a
different interface than such systems. Bigtable does not support a full relational data model; instead, it provides
clients with a simple data model that supports dynamic control over data layout and format, and allows clients to
reason about the locality properties of the data represented in the underlying storage. Data is indexed using row and
column names that can be arbitrary strings. Bigtable also treats data as uninterpreted strings, although clients often
serialize various forms of structured and semi-structured data into these strings. Clients can control the locality of
their data through careful choices in their schemas. Finally, Bigtable schema parameters let clients dynamically
control whether to serve data out of memory or from disk.
12
Google is the King of scalability. Everyone knows Google for their large, sophisticated, and fast searching, but they
don't just shine in search. Their platform approach to building scalable applications allows them to roll out internet
scale applications at an alarmingly high competition crushing rate. Their goal is always to build a higher performing
higher scaling infrastructure to support their products. How do they do that?
Thrift
Thrift is a software framework for scalable cross-language services development. Thrift allows you to define data
types and service interfaces in a simple definition file. Taking that file as input, the compiler generates code to be
used to easily build RPC clients and servers that communicate seamlessly across programming languages.
A map is a sorted unique associative container that maintains a collection of key value pairs.
These collections are sorted by the key. These collections are unique as only one value is allowed in the collection for
the key. Fundamentally, the most frequently used member of the STL's map API is the [] operator. This operator allows
convenient access and modification of a key's associated value. If no value for the key specified exists, the key is
associated with a default constructor and returns a reference to the new value. If a value is associated to the key
specified, a reference to that value is returned. Maps are, therefore, useful for implementing collections of one-to-one
mappings. A map contains elements that are key and value pairs. Each element has a key that is the basis for the sorting
criterion and a value.
- Each key may occur only once, thus duplicate keys are not allowed.
- A map can also be used as an associative array, which is an array that has an arbitrary index type.