DBMS Module-5 2024 Chap 2
DBMS Module-5 2024 Chap 2
Even keyword-based search engines store large amounts of data with fast search
access, so the stored data can be considered as large NOSQL big data stores.
The rest of this chapter is organized as follows. In each of Sections 24.3 through
24.6, we will discuss one of the four main categories of NOSQL systems, and elabo-
rate further on which characteristics each category focuses on. Before that, in Sec-
tion 24.2, we discuss in more detail the concept of eventual consistency, and we
discuss the associated CAP theorem.
We saw in Section 23.3 that there are distributed concurrency control methods that
do not allow this inconsistency among copies of the same data item, thus enforcing
serializability and hence the isolation property in the presence of replication. How-
ever, these techniques often come with high overhead, which would defeat the pur-
pose of creating multiple copies to improve performance and availability in
distributed database systems such as NOSQL. In the field of distributed systems,
there are various levels of consistency among replicated data items, from weak con-
sistency to strong consistency. Enforcing serializability is considered the strongest
form of consistency, but it has high overhead so it can reduce performance of read
and write operations and hence adversely affect system performance.
The CAP theorem, which was originally introduced as the CAP principle, can be
used to explain some of the competing requirements in a distributed system with
replication. The three letters in CAP refer to three desirable properties of distributed
systems with replicated data: consistency (among replicated copies), availability (of
the system for read and write operations) and partition tolerance (in the face of the
nodes in the system being partitioned by a network fault). Availability means that
each read or write request for a data item will either be processed successfully or will
receive a message that the operation cannot be completed. Partition tolerance means
that the system can continue operating if the network connecting the nodes has a
fault that results in two or more partitions, where the nodes in each partition can
only communicate among each other. Consistency means that the nodes will have
the same copies of a replicated data item visible for various transactions.
It is important to note here that the use of the word consistency in CAP and its use
in ACID do not refer to the same identical concept. In CAP, the term consistency
refers to the consistency of the values in different copies of the same data item in a
replicated distributed system. In ACID, it refers to the fact that a transaction will
not violate the integrity constraints specified on the database schema. However, if
we consider that the consistency of replicated copies is a specified constraint, then
the two uses of the term consistency would be related.
The CAP theorem states that it is not possible to guarantee all three of the desirable
properties—consistency, availability, and partition tolerance—at the same time in a
distributed system with data replication. If this is the case, then the distributed sys-
tem designer would have to choose two properties out of the three to guarantee. It
is generally assumed that in many traditional (SQL) applications, guaranteeing
consistency through the ACID properties is important. On the other hand, in a
NOSQL distributed data store, a weaker consistency level is often acceptable, and
guaranteeing the other two properties (availability, partition tolerance) is impor-
tant. Hence, weaker consistency levels are often used in NOSQL system instead of
guaranteeing serializability. In particular, a form of consistency known as eventual
consistency is often adopted in NOSQL systems. In Sections 24.3 through 24.6, we
will discuss some of the consistency models used in specific NOSQL systems.
The next four sections of this chapter discuss the characteristics of the four main cat-
egories of NOSQL systems. We discuss document-based NOSQL systems in Sec-
tion 24.3, and we use MongoDB as a representative system. In Section 24.4, we discus
NOSQL Databases and Big Data Storage Systems
For our example, we will create another document collection called worker to
hold information about the EMPLOYEEs who work on each project; for
example:
db.createCollection(“worker”, { capped : true, size : 5242880, max : 2000 } ) )
Each document in a collection has a unique ObjectId field, called _id, which is
automatically indexed in the collection unless the user explicitly requests no index
for the _id field. The value of ObjectId can be specified by the user, or it can be
system-generated if the user does not specify an _id field for a particular document.
System-generated ObjectIds have a specific format, which combines the timestamp
when the object is created (4 bytes, in an internal MongoDB format), the node id
(3 bytes), the process id (2 bytes), and a counter (3 bytes) into a 16-byte Id value.
User-generated ObjectsIds can have any value specified by the user as long as it
uniquely identifies the document and so these Ids are similar to primary keys in
relational systems.
A collection does not have a schema. The structure of the data fields in documents
is chosen based on how documents will be accessed and used, and the user can
choose a normalized design (similar to normalized relational tuples) or a denor-
malized design (similar to XML documents or complex objects). Interdocument
references can be specified by storing in one document the ObjectId or ObjectIds of
other related documents. Figure 24.1(a) shows a simplified MongoDB document
showing some of the data from Figure 5.6 from the COMPANY database example
that is used throughout the book. In our example, the _id values are user-defined,
and the documents whose _id starts with P (for project) will be stored in the “project”
collection, whereas those whose _id starts with W (for worker) will be stored in the
“worker” collection.
In Figure 24.1(a), the workers information is embedded in the project document; so
there is no need for the “worker” collection. This is known as the denormalized pat-
tern, which is similar to creating a complex object (see Chapter 12) or an XML
document (see Chapter 13). A list of values that is enclosed in square brackets [ … ]
within a document represents a field whose value is an array.
Another option is to use the design in Figure 24.1(b), where worker references are
embedded in the project document, but the worker documents themselves are
stored in a separate “worker” collection. A third option in Figure 24.1(c) would
use a normalized design, similar to First Normal Form relations (see Sec-
tion 14.3.4). The choice of which design option to use depends on how the data
will be accessed.
It is important to note that the simple design in Figure 24.1(c) is not the general nor-
malized design for a many-to-many relationship, such as the one between employees
and projects; rather, we would need three collections for “project”, “employee”, and
“works_on”, as we discussed in detail in Section 9.1. Many of the design tradeoffs
that were discussed in Chapters 9 and 14 (for first normal form relations and for ER-
to-relational mapping options), and Chapters 12 and 13 (for complex objects and
XML) are applicable for choosing the appropriate design for document structures
NOSQL Databases and Big Data Storage Systems
(c) normalized project and worker documents (not a fully normalized design
for M:N relationships):
{
_id: “P1”,
Pname: “ProductX”,
Plocation: “Bellaire”
}
{ _id: “W1”,
Ename: “John Smith”,
ProjectId: “P1”,
Hours: 32.5
}
Document-Based NOSQL Systems and MongoDB
and document collections, so we will not repeat the discussions here. In the design
in Figure 24.1(c), an EMPLOYEE who works on several projects would be repre-
sented by multiple worker documents with different _id values; each document
would represent the employee as worker for a particular project. This is similar to
the design decisions for XML schema design (see Section 13.6). However, it is again
important to note that the typical document-based system does not have a schema,
so the design rules would have to be followed whenever individual documents are
inserted into a collection.
shard will contain fewer documents than if the entire collection were stored at one
node, thus further improving performance.
There are two ways to partition a collection into shards in MongoDB—range
partitioning and hash partitioning. Both require that the user specify a particular
document field to be used as the basis for partitioning the documents into shards.
The partitioning field—known as the shard key in MongoDB—must have two
characteristics: it must exist in every document in the collection, and it must have an
index. The ObjectId can be used, but any other field possessing these two character-
istics can also be used as the basis for sharding. The values of the shard key are
divided into chunks either through range partitioning or hash partitioning, and the
documents are partitioned based on the chunks of shard key values.
Range partitioning creates the chunks by specifying a range of key values; for example,
if the shard key values ranged from one to ten million, it is possible to create ten
ranges—1 to 1,000,000; 1,000,001 to 2,000,000; … ; 9,000,001 to 10,000,000—and
each chunk would contain the key values in one range. Hash partitioning applies a
hash function h(K) to each shard key K, and the partitioning of keys into chunks is
based on the hash values (we discussed hashing and its advantages and disadvantages
in Section 16.8). In general, if range queries are commonly applied to a collection (for
example, retrieving all documents whose shard key value is between 200 and 400),
then range partitioning is preferred because each range query will typically be submit-
ted to a single node that contains all the required documents in one shard. If most
searches retrieve one document at a time, hash partitioning may be preferable because
it randomizes the distribution of shard key values into chunks.
When sharding is used, MongoDB queries are submitted to a module called the query
router, which keeps track of which nodes contain which shards based on the particu-
lar partitioning method used on the shard keys. The query (CRUD operation) will be
routed to the nodes that contain the shards that hold the documents that the query is
requesting. If the system cannot determine which shards hold the required docu-
ments, the query will be submitted to all the nodes that hold shards of the collection.
Sharding and replication are used together; sharding focuses on improving perfor-
mance via load balancing and horizontal scalability, whereas replication focuses on
ensuring system availability when certain nodes fail in the distributed system.
There are many additional details about the distributed system architecture and com-
ponents of MongoDB, but a full discussion is outside the scope of our presentation.
MongoDB also provides many other services in areas such as system administration,
indexing, security, and data aggregation, but we will not discuss these features here.
Full documentation of MongoDB is available online (see the bibliographic notes).
set of operations that can be used by the application programmers. The key is a
unique identifier associated with a data item and is used to locate this data item
rapidly. The value is the data item itself, and it can have very different formats for
different key-value storage systems. In some cases, the value is just a string of bytes
or an array of bytes, and the application using the key-value store has to interpret
the structure of the data value. In other cases, some standard formatted data is
allowed; for example, structured data rows (tuples) similar to relational data, or
semistructured data using JSON or some other self-describing data format. Differ-
ent key-value stores can thus store unstructured, semistructured, or structured data
items (see Section 13.1). The main characteristic of key-value stores is the fact that
every value (data item) must be associated with a unique key, and that retrieving the
value by supplying the key must be very fast.
There are many systems that fall under the key-value store label, so rather than pro-
vide a lot of details on one particular system, we will give a brief introductory over-
view for some of these systems and their characteristics.
DynamoDB Overview
The DynamoDB system is an Amazon product and is available as part of Amazon’s
AWS/SDK platforms (Amazon Web Services/Software Development Kit). It can be
used as part of Amazon’s cloud computing services, for the data storage component.
DynamoDB data model. The basic data model in DynamoDB uses the concepts
of tables, items, and attributes. A table in DynamoDB does not have a schema; it
holds a collection of self-describing items. Each item will consist of a number of
(attribute, value) pairs, and attribute values can be single-valued or multivalued. So
basically, a table will hold a collection of items, and each item is a self-describing
record (or object). DynamoDB also allows the user to specify the items in JSON for-
mat, and the system will convert them to the internal storage format of DynamoDB.
When a table is created, it is required to specify a table name and a primary key;
the primary key will be used to rapidly locate the items in the table. Thus, the pri-
mary key is the key and the item is the value for the DynamoDB key-value store.
The primary key attribute must exist in every item in the table. The primary key can
be one of the following two types:
■ A single attribute. The DynamoDB system will use this attribute to build a
hash index on the items in the table. This is called a hash type primary key.
The items are not ordered in storage on the value of the hash attribute.
■ A pair of attributes. This is called a hash and range type primary key. The
primary key will be a pair of attributes (A, B): attribute A will be used for hash-
ing, and because there will be multiple items with the same value of A, the B
values will be used for ordering the records with the same A value. A table
with this type of key can have additional secondary indexes defined on its
attributes. For example, if we want to store multiple versions of some type of
items in a table, we could use ItemID as hash and Date or Timestamp (when
the version was created) as range in a hash and range type primary key.
NOSQL Key-Value Stores
An item (k, v) will be stored on the node whose position in the ring follows
the position of h(k) on the ring in a clockwise direction. In Figure 24.2(a), we
assume there are three nodes in the distributed cluster labeled A, B, and C,
where node C has a bigger capacity than nodes A and B. In a typical system,
there will be many more nodes. On the circle, two instances each of A and B
are placed, and three instances of C (because of its higher capacity), in a
pseudorandom manner to cover the circle. Figure 24.2(a) indicates which
(k, v) items are placed in which nodes based on the h(k) values.
Range 3
B
C
Range 2 Range 1
A
C
Range 3
Range 4
(reduced)
D D
Range 4 Range 2
(reduced)
A
B
Range 1 C Range 3
NOSQL Key-Value Stores
■ The h(k) values that fall in the parts of the circle marked as range 1 in Fig-
ure 24.2(a) will have their (k, v) items stored in node A because that is the node
whose label follows h(k) on the ring in a clockwise direction; those in range 2
are stored in node B; and those in range 3 are stored in node C. This scheme
allows horizontal scalability because when a new node is added to the distrib-
uted system, it can be added in one or more locations on the ring depending
on the node capacity. Only a limited percentage of the (k, v) items will be reas-
signed to the new node from the existing nodes based on the consistent hash-
ing placement algorithm. Also, those items assigned to the new node may not
all come from only one of the existing nodes because the new node can have
multiple locations on the ring. For example, if a node D is added and it has two
placements on the ring as shown in Figure 24.2(b), then some of the items
from nodes B and C would be moved to node D. The items whose keys hash to
range 4 on the circle (see Figure 24.2(b)) would be migrated to node D. This
scheme also allows replication by placing the number of specified replicas of
an item on successive nodes on the ring in a clockwise direction. The sharding
is built into the method, and different items in the store (file) are located on
different nodes in the distributed cluster, which means the items are horizon-
tally partitioned (sharded) among the nodes in the distributed system. When
a node fails, its load of data items can be distributed to the other existing nodes
whose labels follow the labels of the failed node in the ring. And nodes with
higher capacity can have more locations on the ring, as illustrated by node C
in Figure 24.2(a), and thus store more items than smaller-capacity nodes.
■ Consistency and versioning. Voldemort uses a method similar to the one
developed for DynamoDB for consistency in the presence of replicas. Basi-
cally, concurrent write operations are allowed by different processes so there
could exist two or more different values associated with the same key at dif-
ferent nodes when items are replicated. Consistency is achieved when the
item is read by using a technique known as versioning and read repair. Con-
current writes are allowed, but each write is associated with a vector clock
value. When a read occurs, it is possible that different versions of the same
value (associated with the same key) are read from different nodes. If the
system can reconcile to a single final value, it will pass that value to the read;
otherwise, more than one version can be passed back to the application,
which will reconcile the various versions into one version based on the
application semantics and give this reconciled value back to the nodes.
Redis key-value cache and store. Redis differs from the other systems dis-
cussed here because it caches its data in main memory to further improve perfor-
mance. It offers master-slave replication and high availability, and it also offers
persistence by backing up the cache to disk.
part of the Hbase data model (this is similar to the concept of attribute versioning in
temporal databases, which we shall discuss in Section 26.2). As with other NOSQL
systems, unique keys are associated with stored data items for fast access, but the
keys identify cells in the storage system. Because the focus is on high performance
when storing huge amounts of data, the data model includes some storage-related
concepts. We discuss the Hbase data modeling concepts and define the terminol-
ogy next. It is important to note that the use of the words table, row, and column is
not identical to their use in relational databases, but the uses are related.
■ Tables and Rows. Data in Hbase is stored in tables, and each table has a
table name. Data in a table is stored as self-describing rows. Each row has a
unique row key, and row keys are strings that must have the property that
they can be lexicographically ordered, so characters that do not have a lexi-
cographic order in the character set cannot be used as part of a row key.
■ Column Families, Column Qualifiers, and Columns. A table is associated
with one or more column families. Each column family will have a name,
and the column families associated with a table must be specified when the
table is created and cannot be changed later. Figure 24.3(a) shows how a table
may be created; the table name is followed by the names of the column fami-
lies associated with the table. When the data is loaded into a table, each col-
umn family can be associated with many column qualifiers, but the column
qualifiers are not specified as part of creating a table. So the column qualifiers
make the model a self-describing data model because the qualifiers can be
dynamically specified as new rows are created and inserted into the table. A
column is specified by a combination of ColumnFamily:ColumnQualifier.
Basically, column families are a way of grouping together related columns
(attributes in relational terminology) for storage purposes, except that the
column qualifier names are not specified during table creation. Rather, they
are specified when the data is created and stored in rows, so the data is self-
describing since any column qualifier name can be used in a new row of data
(see Figure 24.3(b)). However, it is important that the application program-
mers know which column qualifiers belong to each column family, even
though they have the flexibility to create new column qualifiers on the fly
when new data rows are created. The concept of column family is somewhat
similar to vertical partitioning (see Section 23.2), because columns (attri-
butes) that are accessed together because they belong to the same column
family are stored in the same files. Each column family of a table is stored in
its own files using the HDFS file system.
■ Versions and Timestamps. Hbase can keep several versions of a data item,
along with the timestamp associated with each version. The timestamp is a
long integer number that represents the system time when the version was
created, so newer versions have larger timestamp values. Hbase uses mid-
night ‘January 1, 1970 UTC’ as timestamp value zero, and uses a long integer
that measures the number of milliseconds since that time as the system
timestamp value (this is similar to the value returned by the Java utility
java.util.Date.getTime() and is also used in MongoDB). It is also possible for
NOSQL Databases and Big Data Storage Systems
Figure 24.3
Examples in Hbase. (a) Creating a table called EMPLOYEE with three column families: Name, Address, and Details.
(b) Inserting some in the EMPLOYEE table; different rows can have different self-describing column qualifiers
(Fname, Lname, Nickname, Mname, Minit, Suffix, … for column family Name; Job, Review, Supervisor, Salary
for column family Details). (c) Some CRUD operations of Hbase.
(a) creating a table:
create ‘EMPLOYEE’, ‘Name’, ‘Address’, ‘Details’
(b) inserting some row data in the EMPLOYEE table:
put ‘EMPLOYEE’, ‘row1’, ‘Name:Fname’, ‘John’
put ‘EMPLOYEE’, ‘row1’, ‘Name:Lname’, ‘Smith’
put ‘EMPLOYEE’, ‘row1’, ‘Name:Nickname’, ‘Johnny’
put ‘EMPLOYEE’, ‘row1’, ‘Details:Job’, ‘Engineer’
put ‘EMPLOYEE’, ‘row1’, ‘Details:Review’, ‘Good’
put ‘EMPLOYEE’, ‘row2’, ‘Name:Fname’, ‘Alicia’
put ‘EMPLOYEE’, ‘row2’, ‘Name:Lname’, ‘Zelaya’
put ‘EMPLOYEE’, ‘row2’, ‘Name:MName’, ‘Jennifer’
put ‘EMPLOYEE’, ‘row2’, ‘Details:Job’, ‘DBA’
put ‘EMPLOYEE’, ‘row2’, ‘Details:Supervisor’, ‘James Borg’
put ‘EMPLOYEE’, ‘row3’, ‘Name:Fname’, ‘James’
put ‘EMPLOYEE’, ‘row3’, ‘Name:Minit’, ‘E’
put ‘EMPLOYEE’, ‘row3’, ‘Name:Lname’, ‘Borg’
put ‘EMPLOYEE’, ‘row3’, ‘Name:Suffix’, ‘Jr.’
put ‘EMPLOYEE’, ‘row3’, ‘Details:Job’, ‘CEO’
put ‘EMPLOYEE’, ‘row3’, ‘Details:Salary’, ‘1,000,000’
the user to define the timestamp value explicitly in a Date format rather than
using the system-generated timestamp.
■ Cells. A cell holds a basic data item in Hbase. The key (address) of a cell is
specified by a combination of (table, rowid, columnfamily, columnqualifier,
timestamp). If timestamp is left out, the latest version of the item is retrieved
unless a default number of versions is specified, say the latest three versions.
The default number of versions to be retrieved, as well as the default number
of versions that the system needs to keep, are parameters that can be speci-
fied during table creation.
■ Namespaces. A namespace is a collection of tables. A namespace basically
specifies a collection of one or more tables that are typically used together by
user applications, and it corresponds to a database that contains a collection
of tables in relational terminology.
NOSQL Graph Databases and Neo4j
Figure 24.4
Examples in Neo4j using the Cypher language. (a) Creating some nodes. (b) Creating some relationships.
(a) creating some nodes for the COMPANY data (from Figure 5.6):
CREATE (e1: EMPLOYEE, {Empid: ‘1’, Lname: ‘Smith’, Fname: ‘John’, Minit: ‘B’})
CREATE (e2: EMPLOYEE, {Empid: ‘2’, Lname: ‘Wong’, Fname: ‘Franklin’})
CREATE (e3: EMPLOYEE, {Empid: ‘3’, Lname: ‘Zelaya’, Fname: ‘Alicia’})
CREATE (e4: EMPLOYEE, {Empid: ‘4’, Lname: ‘Wallace’, Fname: ‘Jennifer’, Minit: ‘S’})
…
CREATE (d1: DEPARTMENT, {Dno: ‘5’, Dname: ‘Research’})
CREATE (d2: DEPARTMENT, {Dno: ‘4’, Dname: ‘Administration’})
…
CREATE (p1: PROJECT, {Pno: ‘1’, Pname: ‘ProductX’})
CREATE (p2: PROJECT, {Pno: ‘2’, Pname: ‘ProductY’})
CREATE (p3: PROJECT, {Pno: ‘10’, Pname: ‘Computerization’})
CREATE (p4: PROJECT, {Pno: ‘20’, Pname: ‘Reorganization’})
…
CREATE (loc1: LOCATION, {Lname: ‘Houston’})
CREATE (loc2: LOCATION, {Lname: ‘Stafford’})
CREATE (loc3: LOCATION, {Lname: ‘Bellaire’})
CREATE (loc4: LOCATION, {Lname: ‘Sugarland’})
…
(b) creating some relationships for the COMPANY data (from Figure 5.6):
CREATE (e1) – [ : WorksFor ] –> (d1)
CREATE (e3) – [ : WorksFor ] –> (d2)
…
CREATE (d1) – [ : Manager ] –> (e2)
CREATE (d2) – [ : Manager ] –> (e4)
…
CREATE (d1) – [ : LocatedIn ] –> (loc1)
CREATE (d1) – [ : LocatedIn ] –> (loc3)
CREATE (d1) – [ : LocatedIn ] –> (loc4)
CREATE (d2) – [ : LocatedIn ] –> (loc2)
…
CREATE (e1) – [ : WorksOn, {Hours: ‘32.5’} ] –> (p1)
CREATE (e1) – [ : WorksOn, {Hours: ‘7.5’} ] –> (p2)
CREATE (e2) – [ : WorksOn, {Hours: ‘10.0’} ] –> (p1)
CREATE (e2) – [ : WorksOn, {Hours: 10.0} ] –> (p2)
CREATE (e2) – [ : WorksOn, {Hours: ‘10.0’} ] –> (p3)
CREATE (e2) – [ : WorksOn, {Hours: 10.0} ] –> (p4)
…
NOSQL Graph Databases and Neo4j
Query 1 in Figure 24.4(d) shows how to use the MATCH and RETURN clauses in a
query, and the query retrieves the locations for department number 5. Match speci-
fies the pattern and the query variables (d and loc) and RETURN specifies the query
result to be retrieved by refering to the query variables. Query 2 has three variables
(e, w, and p), and returns the projects and hours per week that the employee with
NOSQL Databases and Big Data Storage Systems
Empid = 2 works on. Query 3, on the other hand, returns the employees and hours
per week who work on the project with Pno = 2. Query 4 illustrates the ORDER BY
clause and returns all employees and the projects they work on, sorted by Ename. It
is also possible to limit the number of returned results by using the LIMIT clause as
in query 5, which only returns the first 10 answers.
Query 6 illustrates the use of WITH and aggregation, although the WITH clause can
be used to separate clauses in a query even if there is no aggregation. Query 6 also illus-
trates the WHERE clause to specify additional conditions, and the query returns the
employees who work on more than two projects, as well as the number of projects each
employee works on. It is also common to return the nodes and relationships them-
selves in the query result, rather than the property values of the nodes as in the previ-
ous queries. Query 7 is similar to query 5 but returns the nodes and relationships only,
and so the query result can be displayed as a graph using Neo4j’s visualization tool. It is
also possible to add or remove labels and properties from nodes. Query 8 shows how to
add more properties to a node by adding a Job property to an employee node.
The above gives a brief flavor for the Cypher query language of Neo4j. The full lan-
guage manual is available online (see the bibliographic notes).