0% found this document useful (0 votes)
15 views

adsu4

Uploaded by

Anisha Patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

adsu4

Uploaded by

Anisha Patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 169

Unit 4 –

NoSQL
Database Part 1

Mrs. Deepali Jaadhav 1


Unit 4: NoSQL Database Part 1
 Introduction to NoSQL Database
 History of NoSQL Database
 Relational databases vs new NoSQL databases
 Data Management with Distributed Databases
 ACID and BASE
 Types of consistency
 CAP Theorem
 Replication and Sharding
 NoSQL Types:
 Key-Value Database,
 Document Database,
 Column Family Database and
 Graph Database

2 Mrs. Deepali Jaadhav


Unit 4: NoSQL Database Part 1
 Document Databases with MongoDB
 What Is a Document
 Differences Between Document and Relational Databases
 Managing Multiple Documents in Collections
 Basic Operations on Document Databases
 • Inserting • Deleting • Updating • Retrieving
 Document Database with MongoDB
 Key-Value Databases
 Essential Features of Key-Value Databases
 Key-Value Database Data Modeling Terms
 Key-Value Architecture Terms
 Limitations of Key-Value Databases
 Key-value Database with Riak

3 Mrs. Deepali Jaadhav


Introduction of NoSQL Databases
 A database Management System provides the mechanism to store
and retrieve the data.
 There are different kinds of database Management Systems:
1. RDBMS (Relational Database Management Systems)
2. OLAP (Online Analytical Processing)
3. NoSQL (Not only SQL)

4 Mrs. Deepali Jaadhav


What are NoSQL databases ?
 “NoSQL refers to an ill-defined set of mostly open-source databases,
mostly developed in the early 21st century, and mostly not using
SQL”

5 Mrs. Deepali Jaadhav


What is a NoSQL database?
 NoSQL databases are different than relational databases like MySql.
 In relational database you need to create the table, define schema, set
the data types of fields etc before you can actually insert the data.
 In NoSQL you don’t have to worry about that, you can insert, update
data on the fly.
 NoSQL database - really easy to scale and they are much faster in most
types of operations that we perform on database.
 When you are dealing with huge amount of data then NoSQL
database is your best choice.

6 Mrs. Deepali Jaadhav


Common Characteristics
Not using the Can handle
Run well on
relational huge amount Open Source
clusters
model of data

Build for 21st


Schema-less / BASE (not
century web
schema-free ACID)
access

Mrs. Deepali Jaadhav 7


Limitations of Relational databases
 In relational database we need to define structure and schema of data
first and then only we can process the data.
 Relational database systems provides consistency and integrity of
data by enforcing ACID properties (Atomicity, Consistency,
Isolation and Durability).
 However, in most of the other cases these properties are significant
performance overhead and can make your database response very
slow.
 RDBMS don’t provide you a better way of performing operations such
as create, insert, update, delete etc on this data.
 On the other hand, NoSQL store their data in JSON format, which is
compatible with most of the today’s world application.

8 Mrs. Deepali Jaadhav


History of NoSQL Databases
 In the mid-1990s, the internet gained extreme popularity, and
relational databases simply could not keep up with the flow of
information demanded by users,
 As well as the larger variety of data types that occurred from this
evolution.
 This led to the development of non-relational databases, often
referred to as NoSQL.
 NoSQL databases can translate strange data quickly and avoid the
rigidity of SQL by replacing “organized” storage with more flexibility.

9 Mrs. Deepali Jaadhav


History of NoSQL Databases
 1998- Carlo Strozzi use the term
NoSQL for his lightweight,
open-source relational database
 2000- Graph database Neo4j is
launched
 2004- Google BigTable is
launched
 2005- CouchDB is launched
 2007- The research paper on
Amazon Dynamo is released
 2008- Facebooks open sources
the Cassandra project
 2009- The term NoSQL was
reintroduced

10 Mrs. Deepali Jaadhav


Features of NoSQL
 Non-relational:
 NoSQL databases never follow the relational model
 Never provide tables with flat fixed-column records
 Work with self-contained aggregates or BLOBs (binary large
object)
 Doesn't require object-relational mapping and data normalization
 No complex features like query languages, query planners,
referential integrity, joins, ACID

11 Mrs. Deepali Jaadhav


Features of NoSQL
 Schema-free
 NoSQL databases are either schema-free or have relaxed schemas
 Do not require any sort of definition of the schema of the data
 Offers heterogeneous structures of data in the same domain
 Simple API
 Offers easy to use interfaces for storage and querying data
provided
 APIs allow low-level data manipulation & selection methods
 Text-based protocols mostly used with HTTP REST with JSON
 Mostly used no standard based NoSQL query language
 Web-enabled databases running as internet-facing services

12 Mrs. Deepali Jaadhav


Features of NoSQL
 Distributed
 Multiple NoSQL databases can be executed in a distributed fashion
 Offers auto-scaling and fail-over capabilities
 Often ACID concept can be sacrificed for scalability and throughput
 Mostly no synchronous replication between distributed nodes
Asynchronous Multi-Master Replication, peer-to-peer, HDFS
Replication
 Only providing eventual consistency
 Shared Nothing Architecture. This enables less coordination and
higher distribution.

13 Mrs. Deepali Jaadhav


RDBMS Vs NoSQL
 RDBMS: It is a structured data that provides more functionality but gives
less performance.
NoSQL: Structured or semi structured data, less functionality and high
performance.

 1. You can’t have constraints in NoSQL


2. Joins are not supported in NoSQL
These supports hides the scalability of a database, so while using NoSQL
database like MongoDB, you can implement these functionalities at the
application level.

14 Mrs. Deepali Jaadhav


RDBMS Vs NoSQL

RDBMS NoSQL
integrity is mission-critical OK as long as most data is correct
data format consistent, well-defined data format unknown or inconsistent
data is of long-term value data are expected to be replaced
data updates are frequent write-once, read multiple (no
updates, or at least not often)

predictable, linear growth unpredictable growth (exponential)


non-programmers writing queries only programmers writing queries
regular backup replication
access through master server sharding across multiple nodes

15 Mrs. Deepali Jaadhav


16 Mrs. Deepali Jaadhav
Data Management with Distributed Databases
 Databases are designed to do two things: store data and retrieve data.
 To meet these objectives, the database management systems must do three
things:
 Store data persistently
 Maintain data consistency
 Ensure data availability

17 Mrs. Deepali Jaadhav


Store Data Persistently
 Data must be stored persistently; that is, it must be stored in a way that data is
not lost when the database server is shut down.
 If data were only stored in memory—that is, RAM —then it would be lost when
power to the memory is lost.
 Only data that is stored on disk, flash, tape, or other long-term storage is
considered persistently stored.
 Indices are used for faster retrieval of data.

18 Mrs. Deepali Jaadhav


Maintain Data Consistency
 It is important to ensure that the correct data is written to a persistent storage
device.
 If the write or read operation does not accurately record or retrieve data, the
database will not be of much use.
 Concurrency control mechanism is used for data consistency.

19 Mrs. Deepali Jaadhav


Ensure Data Availability
 Data should be available whenever it is needed. This is difficult to
guarantee.
 One way to avoid the problem of an unavailable database server is to
have two database servers.
 One is used for updating data and responding to queries while the other is kept
as a backup in case the first server fails. The server that is used for updating and
responding to queries is called the primary server, and the other is the backup
server

20 Mrs. Deepali Jaadhav


Ensure Data Availability
 With data consistent on two database servers, you can be sure that if
the primary database fails, you can switch to using the backup database
and know that you have the same data on both.
 When the primary database is back online, the first thing it does is to
update itself so that all changes made to the backup database while the
primary database was down are made to the primary database.
 The advantage of using two database servers is that it enables the
database to remain available even if one of the servers fails.

21 Mrs. Deepali Jaadhav


Ensure Data Availability
 Because, in the case of a two-phase commit, a write operation is not
complete until both databases are updated successfully, the speed of
the updates depends on the amount of data written, the speed of the
disks, the speed of the network between the two servers, and other
design factors.

 You can have consistent data, and you can have a high-availability
database, then transactions will require longer times to execute.

22 Mrs. Deepali Jaadhav


Availability and Consistency in Distributed Databases
 When two database servers must keep consistent copies of data, they
incur longer times to complete a transaction.
 This is acceptable in applications that require both consistency and high
availability at all times. Financial systems at a bank, for example, fall into
this category.
 There are applications, in which the fast database operations are more
important than maintaining consistency at all times. For example, an
ecommerce site.
 Imagine you are programming the user interface for an e-commerce site.
How long should the customer wait after clicking on an “Add to My Cart”
button? Ideally, the interface would respond immediately so the
customer could keep shopping.
 In this case, speed is more important than having consistent data at all
times.

23 Mrs. Deepali Jaadhav


Availability and Consistency in Distributed Databases
 One way to deal with this problem is to write the updates to one
database and then copy the data to another server. There is a brief
period of time when the customer’s cart on the two servers is not
consistent, but the customer is able to continue shopping anyway.

24 Mrs. Deepali Jaadhav


Balancing Response Times, Consistency, and Durability
 NoSQL databases often implement eventual consistency; that is, there
might be a period of time where copies of data have different values,
but eventually all copies will have the same value.
 This raises the possibility of a user querying the database and getting
different results from different servers in a cluster.
 NoSQL databases often use the concept of quorums when working
with reads and writes.
 A quorum is the number of servers that must respond to a read or write
operation for the operation to be considered complete.
 When a read is performed, the NoSQL database reads data from,
potentially, multiple servers. Most of the time, all of the servers will
have consistent data.
 While the database copies data from one of the servers to the other
servers storing replicas, the replica servers may have inconsistent data.

25 Mrs. Deepali Jaadhav


Balancing Response Times, Consistency, and Durability
 One way to determine the correct response to any read operation is to
query all servers storing that data. The database counts the number of
distinct response values and returns the one that meets or exceeds a
configurable threshold.

26 Mrs. Deepali Jaadhav


Balancing Response Times, Consistency, and Durability
 You can vary the threshold to improve response time or consistency. If
the read threshold is set to 1, you get a fast response. The lower the
threshold, the faster the response but the higher the risk of returning
inconsistent data.
 Just as you can adjust a read threshold to balance response time and
consistency, you can also alter a write threshold to balance response
time and durability.
 Durability is the property of maintaining a correct copy of data for
long periods of time.
 A write operation is considered complete when a minimum number of
replicas have been written to persistent storage.

27 Mrs. Deepali Jaadhav


ACID and BASE
ACID: Atomicity, Consistency, Isolation, and Durability
 A is for atomicity Atomicity, as the name implies, describes a unit that cannot
be further divided. The set of steps is indivisible. You have to complete all of
them as a single indivisible unit, or you complete none of them.
 C is for consistency. In relational databases, this is known as strict
consistency. In other words, a transaction does not leave a database in a state
that violates the integrity of data.
 I is for isolation. Isolated transactions are not visible to other users until
transactions are complete. For example, in the case of a bank transfer from a
savings account to a checking account, someone could not read your account
balances while the funds are being deducted from your savings account but
before they are added to your checking account.
 D is for durability. This means that once a transaction or operation is
completed, it will remain even in the event of a power loss. In effect, this means
that data is stored on disk, flash, or other persistent media.

Relational database management systems are designed to support ACID


transactions

28 Mrs. Deepali Jaadhav


ACID and BASE
BASE: Basically Available, Soft State, Eventually Consistent
 BA is for basically available. This means that there can be a partial
failure in some parts of the distributed system and the rest of the
system continues to function.
 For example, if a NoSQL database is running on 10 servers without
replicating data and one of the servers fails, then 10% of the users’ queries
would fail, but 90% would succeed.
 NoSQL databases often keep multiple copies of data on different servers.
This allows the database to respond to queries even if one of the servers has
failed.
 S is for soft state. Usually in computer science, the term soft state
means data will expire if it is not refreshed. In NoSQL operations, it
refers to the fact that data may eventually be overwritten with more
recent data. This property overlaps with the third property of BASE
transactions, eventually consistent.

29 Mrs. Deepali Jaadhav


ACID and BASE
 E is for eventually consistent. This means that there may be times
when the database is in an inconsistent state. For example, some
NoSQL databases keep multiple copies of data on multiple servers.
There is, however, a possibility that the multiple copies may not be
consistent for a short period of time.

 The time it takes to update all copies depends on several factors, such
as the load on the system and the speed of the network.

30 Mrs. Deepali Jaadhav


ACID and BASE
 Types of Eventual Consistency:
1. Casual consistency
2. Read-your-writes consistency
3. Session consistency
4. Monotonic read consistency
5. Monotonic write consistency

31 Mrs. Deepali Jaadhav


ACID and BASE
1. Casual consistency
 Casual consistency ensures that the database reflects the order in which
operations were updated.
 For example, if Alice changes a customer’s outstanding balance to $1,000 and
one minute later Bob changes it to $2,000, all copies of the customer’s
outstanding balance will be updated to $1,000 before they are updated to
$2,000.

2. Read-Your-Writes Consistency
 Once you have updated a record, all of your reads of that record will return the
updated value. You would never retrieve a value inconsistent with the value you
had written.
 For example, Alice updates a customer’s outstanding balance to $1,500. The
update is written to one server and the replication process begins updating
other copies. During the replication process, Alice queries the customer’s
balance. She is guaranteed to see $1,500 when the database supports read your-
writes consistency.

32 Mrs. Deepali Jaadhav


ACID and BASE
3. Session Consistency
 Session consistency ensures read-your-writes consistency during a session. You
can think of a session as a conversation between a client and a server or a user
and the database.
 As long as the conversation continues, the database “remembers” all writes you
have done during the conversation.
 If the session ends and you start another session with the same server, there is
no guarantee it will “remember” the writes you made in the previous session.

4. Monotonic Read Consistency


 Monotonic read consistency ensures that if you issue a query and see a result,
you will never see an earlier version of the value.
 For Example, Alice is updating customer balance. Currently it is $1500 she
updates it to $2500. If Bob queries the database he will see the balance is $2,500
even if all the servers with copies of that customer’s balance have not updated
to the latest value.

33 Mrs. Deepali Jaadhav


ACID and BASE
5. Monotonic Write Consistency
 Monotonic write consistency ensures that if you were to issue several update
commands, they would be executed in the order you issued them.
 Let’s consider a variation on the outstanding balance example. Alice decides to
reduce all customers’ outstanding balances by 10%.
 Charlie, one of her customers, has a $1,000 outstanding balance. After the
reduction, Charlie would have a $900 balance.
 Charlie has just ordered $1,100 worth of material. His outstanding balance is
now the sum of the previous outstanding balance ($900) and the amount of the
new order ($1,100) or $2,000.

34 Mrs. Deepali Jaadhav


ACID and BASE
 5. Monotonic Write Consistency (cont…)
 Now consider what would happen if the NoSQL database performed Alice’s
operations in a different order.
 Charlie started with a $1,000 outstanding balance. Next, instead of having the
discount applied, his record was first updated with the new order ($1,100).
 His outstanding balance becomes $2,100. Now, the 10% discount operation is
executed and his outstanding balance is set to $2,100–$210 or $1890.
 Monotonic write consistency is obviously an important feature. If you cannot
guarantee the order of operations in the database, you would have to build
features into your program to guarantee operations execute in the order you
expect.

35 Mrs. Deepali Jaadhav


Brewer’s CAP Theorem
A distributed system can support only two of the following
characteristics:
 Consistency
 Availability
 Partition tolerance
Proven by Nancy Lynch et al. MIT labs.

Mrs. Deepali Jaadhav


36
Consistency
 Consistency: Clients should read the same data.
 There are many levels of consistency.
 Strict Consistency – RDBMS.
 Tunable Consistency – Cassandra.
 Eventual Consistency – Amazon Dynamo.
 Client perceives (realize) that a set of operations has occurred all at
once – Pritchett
 More like Atomic in ACID transaction properties

Mrs. Deepali Jaadhav


37
Consistency
 Strict consistency: is the strongest consistency model. Under this model, a
write to a variable by any processor needs to be seen instantaneously by all
processors
 Tunable Consistency: means that you can set the consistency level for each
read and write request. So, Cassandra gives you a lot of control over how
consistent your data is.
 You can allow some queries to be immediately consistent and other queries
to be eventually consistent.
 That means, in your application, the data that requires immediate
consistency, you can create your queries accordingly and the data for which
immediate consistency is not required, you can optimize for performance
and choose eventual consistency.
 Eventual Consistency: is a guarantee that when an update is made in a
distributed database, that update will eventually be reflected in all nodes that
store the data, resulting in the same response every time the data is queried.
Eventual consistency – data will become consistent at some point in time,
with no guarantee when.

38 Mrs. Deepali Jaadhav


Consistency - Example
 Consistent databases should be used when the value of the
information returned needs to be accurate.
 Financial data is a good example. When a user logs in to their banking
application, they do not want to see an error that no data is returned,
or that the value is higher or lower than it actually is. Banking apps
should return the exact value of a user’s account information. In this
case, banks would rely on consistent databases.
 Examples of a consistent database include:
 Bank account balances
 Text messages
 Database options for consistency:
 MongoDB
 Redis
 HBase

39 Mrs. Deepali Jaadhav


Availability
 Availability: Data to be available.
 Node failures do not prevent survivors from continuing to operate
 Every operation must terminate in an intended response

Mrs. Deepali Jaadhav


40
Availability - Example
 Availability databases should be used when the service is more
important than the information.
 An example of having a highly available database can be seen in e-
commerce businesses. Online stores want to make their store and the
functions of the shopping cart available 24/7 so shoppers can make
purchases exactly when they need.
 Database options for availability:
 Cassandra
 DynamoDB
 Cosmos DB

41 Mrs. Deepali Jaadhav


Partition Tolerance
 A partition is a communications break within a distributed system—
a lost or temporarily delayed connection between two nodes.
 Partition tolerance means that the cluster must continue to work
despite any number of communication breakdowns between nodes in
the system.
 Partial Tolerance: Data to be partitioned across network segments
due to network failures.
 The system continues to operate despite arbitrary message loss.
 Operations will complete, even if individual components are
unavailable

Mrs. Deepali Jaadhav


42
Partition Tolerance
 In normal operations, your data store provides all three functions.
 But the CAP theorem maintains that when a distributed database
experiences a network failure, you can provide either consistency or
availability.
 All other times, all three can be provided. But, in the event of a network
failure, a choice must be made.
 In the theorem, partition tolerance is a must. The assumption is that the
system operates on a distributed data store, operates with network
partitions.
 Network failures will happen, so to offer any kind of reliable service,
partition tolerance is necessary—the P of CAP. That leaves a decision
between the other two, C and A.
 When a network failure happens, one can choose to guarantee
consistency or availability:
 High consistency comes at the cost of lower availability.
 High availability comes at the cost of lower consistency.

43 Mrs. Deepali Jaadhav


Example
 The e-commerce shopping cart that it is possible to have a backup
copy of the cart data that is out of sync with the primary copy. The data
would still be available if the primary server failed, but the data on the
backup server would be inconsistent with data on the primary server if
the primary server failed prior to updating the backup server.

44 Mrs. Deepali Jaadhav


CAP theorem NoSQL database types

46 Mrs. Deepali Jaadhav


CAP Theorem

47 Mrs. Deepali Jaadhav


Replication and Sharding

 Distribution Models-
 Single Server,
 Sharding,
 Master-Slave Replication,
 Peer-to-Peer Replication,
 Combining Sharding and Replication
Data Distribution
➢ NoSQLsystems: data distributed over large clusters

➢ As data volumes increase, it becomes more difficult and expensive to scale


up—buy a bigger server to run the database on.

➢ A more appealing option is to scale out—run the database on a cluster of


servers.
➢ Aggregate isanatural unit to use for data distribution
➢ Depending on the distribution model the data store can give us the ability:

1. To handle large quantity of data,


2. To process a greater read or write traffic
3. To have more availability in the case of network slowdowns of breakages
4
9
Data Distribution
➢ Data distribution models:
➢ Single server (is an option for some applications)

➢ Multiple servers

➢ Orthogonal aspects of data distribution – two paths:


➢ Replication: the same data copied over multiple nodes

➢ master-slave and

➢ peer-to-peer

➢ Sharding: different data on different nodes

5
0
Single Server
 The first and the simplest distribution option is the one we
would most often recommend—no distribution at all.
 Run the database on a single machine that handles all the
reads and writes to the data store.
 It eliminates all the complexities.
 It’s easy for operations people to manage and easy for
application developers to handle.
 Graph databases are the obvious category here—these work
best in a single-server configuration.
 If your data usage is mostly about processing aggregates, then
a single-server document or key-value store may well be
worthwhile because it’s easier on application developers.
Sharding
➢ Different parts of the data onto different servers : different people are accessing
different parts of the dataset - a technique that’s called sharding.
➢ This allows for larger datasets to be split in smaller chunks and stored in
multiple data nodes, increasing the total storage capacity of the system.
➢ Horizontal scalability : as additional nodes are brought on to share the load.
Horizontal scaling allows for near-limitless scalability to handle big data
and intense workloads
➢ Ideal case: different users all talking to different server nodes. Each user only has
to talk to one server, so gets rapid responses from that server.
➢ Data accessed together on the same node ̶ aggregate unit!

➢ Sharding puts different data on separate nodes,

each of which does its own reads and writes.


5
2
Sharding -Improving performance
Mainrules of sharding:
1. Placethe data close to where it’s accessed
➢ Orders for Boston: data in your eastern USdata center
2. Try to keep the load even
➢ Allnodes should get equal amounts of the load
3. Put together data that maybe read in sequence
➢ Sameorder, same node

➢ ManyNoSQLdatabases offer auto-sharding


➢ Database takes on the responsibility of allocating data to shards
and ensuring that data access goes to the right shard. This can
make it much easier to use sharding in an application.
5
3
Sharding
 Advantages:
 Shardingis particularly valuable for performance
because it can improve both read and write
performance.
 Using replication, particularly with caching, can
greatly improve read performance but does little
for applications that have a lot of writes.
 Sharding provides a way to horizontally scale
writes.
 Sharding is made much easier with aggregates,
Sharding
 Advantages:
Sharding allows you to scale your database to handle increased load to a nearly
unlimited degree by providing increased read/write throughput, storage
capacity, and high availability.
 Increased Read/Write Throughput — By distributing the dataset across
multiple shards, both read and write operation capacity is increased as long
as read and write operations are confined to a single shard.
 Increased Storage Capacity — Similarly, by increasing the number of
shards, you can also increase overall total storage capacity, allowing near-
infinite scalability.
 High Availability — Finally, shards provide high availability in two ways.
First, since each shard is a replica set, every piece of data is replicated.
Second, even if an entire shard becomes unavailable since the data is
distributed, the database as a whole still remains partially functional, with
part of the schema on different shards.
Sharding
 Disadvantages:
 Although the data is on different nodes, a node failure makes that
shard’s data unavailable just as surely as it does for a single-server
solution.
 The resilience benefit it does provide is that only the users of the
data on that shard will suffer;
 It’s not good to have a database with part of its data missing.

 With a single server it’s easier to pay the effort and cost to keep that
server up and running;
 Clusters usually try to use less reliable machines, and you’re more
likely to get a node failure.
 So in practice, sharding alone is likely to decrease resilience
(Immediately recover).
Sharding
 Disadvantages:
 Sharding does come with several drawbacks, namely overhead in
query result compilation, complexity of
administration, and increased infrastructure costs.
 Query Overhead — Each sharded database must have a separate machine or
service which understands how to route a querying operation to the appropriate
shard. This introduces additional latency on every operation. Furthermore, if the
data required for the query is horizontally partitioned across multiple shards, the
router must then query each shard and merge the result together. This can make an
otherwise simple operation quite expensive and slow down response times.
 Complexity of Administration — With a single unsharded database, only the
database server itself requires upkeep and maintenance. With every sharded
database, on top of managing the shards themselves, there are additional service
nodes to maintain. Plus, in cases where replication is being used, any data updates
must be mirrored across each replicated node. Overall, a sharded database is a
more complex system which requires more administration.
 Increased Infrastructure Costs — Sharding by its nature requires additional
machines and compute power over a single database server. While this allows your
database to grow beyond the limits of a single machine, each additional shard
comes with higher costs. The cost of a distributed database system, especially if it is
missing the proper optimization, can be significant.
Sharding Architectures and Types

Ranged/Dynamic Sharding

Algorithmic/Hashed Sharding

Entity-/Relationship-Based Sharding

Geography-Based Sharding
1. Ranged/Dynamic Sharding

 Ranged sharding, or dynamic sharding, takes a field on the record as an


input and based on a predefined range, allocates that record to the
appropriate shard.
 Ranged sharding requires there to be a lookup table or service available for
all queries or writes.
 For example, consider a set of data with IDs that range from 0-50. A simple
lookup table might look like the following:
 The field on which the range is based is also known as the shard key.
 A poor choice of shard key will lead to unbalanced Range Shard ID

shards, which leads to decreased performance. [0, 20) A

 An effective shard key will allow for queries to be [20, 40) B

targeted to a minimum number of shards. [40, 50] C

 In this example, if we query for all records with IDs 10-30,


then only shards A and B will need to be queried.
2. Algorithmic/Hashed Sharding
 Algorithmic sharding or hashed sharding, takes a record as an input and
applies a hash function or algorithm to it which generates an output or hash
value. This output is then used to allocate each record to the appropriate
shard.
 The function can take any subset of values on the record as inputs.
 The simplest example of a hash function is to use the modulus operator with
the number of shards, as follows:
Hash Value=ID % Number of Shards
 Hashing the inputs allows more even distribution across shards even when
there is not a suitable shard key, and no lookup table needs to be
maintained.
 Drawbacks:
 First, query operations for multiple records are more likely to get distributed
across multiple shards. This is reflected in increased broadcast operation
occurrence.
 Second, resharding can be expensive. Any update to the number of shards likely
requires rebalancing all shards to moving around records.
3. Entity-/Relationship-Based Sharding

 Entity-based sharding keeps related data together on a single


physical shard.
 In a relational database (such as PostgreSQL, MySQL, or SQL
Server), related data is often spread across several different tables.
 For instance, consider the case of a shopping database with Users
and Payment Methods.
 Each user has a set of payment methods that is tied tightly with
that user.
 As such, keeping related data together on the same shard can
reduce the need for broadcast operations, increasing performance.
4. Geography-Based Sharding

 Geography-based sharding, or geosharding, also keeps related


data together on a single shard, but in this case, the data is
related by geography.
 This is essentially ranged sharding where the shard key contains
geographic information and the shards themselves are geo-
located.
 For example, consider a dataset where each record contains a
“country” field.
 In this case, we can both increase overall performance and
decrease system latency by creating a shard for each country or
region and storing the appropriate data on that shard.
Replication

It comes in two forms:

master-slave peer-to-peer

6
4
Replication

 Replication: Replication copies data across multiple


servers, so each bit of data can be found in multiple places.
Replication comes in two forms,
 Master-slave replication makes one node the
authoritative copy that handles writes while slaves
synchronize with the master and may handle reads.
 Peer-to-peer replication allows writes to any node;
the nodes coordinate to synchronize their copies of the
data.
Master-Slave Replication
Master :With master-slave distribution, you replicate data across multiple
nodes. One node is designated as the master, or primary.
➢ is the authoritative source for the data
➢ is responsible for processing any updates to that data
➢ can be appointed manually or automatically
➢ Slaves: secondary's
➢ A replication process synchronizes the slaves with the
master
➢ After a failure of the master, a slave can be
appointed as new master very quickly

6
6
Master-Slave Replication
 Advantages:
 Most helpful for scaling when you have a read-intensive dataset.
 More read requests handled by:
 Add more slave nodes

 Ensure that all read requests are routed to the slaves


 Should the master fail, the slaves can still handle read requests
 The ability to appoint a slave to replace a failed master means that
master-slave replication is useful even if you don’t need to scale
out.
 Good for datasets with a read-intensive dataset
Master-Slave Replication
➢ Disadvantages:
➢ The master is a bottleneck
➢ Limited by its ability to process updates and to pass those
updates on slaves.
➢ Its failure does eliminate the ability to handle writes until:
➢ the master is restored or
➢ a new master is appointed
➢ Inconsistency due to slow propagation of changes to the slaves
➢ Bad for datasets with heavy write traffic
Replication in MongoDB
 MongoDB achieves replication by the use of replica set.
 A replica set is a group of mongod instances that host the same data set.
 In a replica, one node is primary node that receives all write operations.
All other instances, such as secondaries, apply operations from the
primary so that they have the same data set.
 The secondaries copy the data from the primaries which are typically
read-only unless they get elected to be a primary.
 Each replica set can consist of up to 50 secondaries.
Peer-to-Peer Replication
 Master-slave replication helps with read
scalability but doesn’t help with
scalability of writes.
 It provides resilience against failure of a
slave, but not of a master.
 Essentially, the master is still a
bottleneck and a single point of failure.
 Peer-to-peer replication not having a
master.
 All the replicas have equal weight, they can all
accept writes
 The loss of any of them doesn’t prevent access
to the data store
Peer-to-Peer Replication
 Peer to peer replication is built on the concept of transaction
replication, which propagates consistent transactional data.
 Peer to peer replication involves Multiple servers are called as
nodes.
 Below are the terms used in the replication:
 Publisher: It is the source database, which contains data to replicate.

 Subscriber: This is the destination database; there may be many


subscribers for a single publisher.
 Distributor: It is used to distribute the transactions to the subscriber
database.
 In Peer to peer replication , each node acts as a publisher as well as
a subscriber that means it receives and sends transactions to other
nodes, data is synchronized across the nodes.
2 Node Peer To Peer Replication
 In this architecture, we can see the application
server is sending requests to both the database
servers.
 Both nodes A and B are replicating its data
with each other.
 If any DML statement is executed for node A, it
is then replicated to node B and vice-a-versa.
 So since, it is bi-directional replication, each
database is publisher and subscriber so these
are called as nodes and replication is called
peer to peer replication.
 If any node goes down, the application will still
be functional.
 Later, once the node is up it can be again
brought in sync. This way we can achieve both
high availability and fault tolerance.
3 Node Peer To Peer Replication

 In three nodes architecture, we can see each one is replicating


transactions to the other node and all are having the same copy of the
data.
Peer-to-Peer replication

you can ride over node failures without losing access to


Advantages: data
you can easily add nodes to improve your performance

Inconsistency
Disadvantages: • Slow propagation of changes to copies on different nodes
• Inconsistencies on read lead to problems but are relatively transient
• Two people can update different copies of the same record stored on
different nodes at the same time - a write-write conflict.
• Inconsistent writes are forever.
Combining Sharding and Replication

 Master-slave replication and


sharding are strategies that can be
combined.
 If we use both master-slave
replication and sharding, this means
that we have multiple masters, but
each data item only has a single
master.
 Depending on your configuration,
you may choose a node to be a
master for some data and slaves for
others, or dedicate nodes for master
or slave duties
Combining Sharding and Replication

 Using peer-to-peer replication and


sharding is a common strategy for
column-family databases.
 you might have tens or hundreds of
nodes in a cluster with data sharded
over them.
 A good starting point for peer-to-
peer replication is to have a
replication factor of 3,
 So each shard is present on three
nodes.
 Should a node fail, then the shards
on that node will be built on the
other nodes.
76
Types of NoSQL Databases
 The most widely used types of NoSQL databases are
 Key-value pair databases
 Document databases
 Column family store databases
 Graph databases
 Every category has its unique attributes and limitations. Users should
select the database based on their product needs.

77 Mrs. Deepali Jaadhav


Key Value Pair Based
 Data is stored in key/value pairs.
 It is designed in such a way to handle lots of data and heavy load.
 Key-value pair storage databases store data as a hash table where each key is
unique, and the value can be a JSON, BLOB(Binary Large Objects), string, etc.
 It is one of the most basic NoSQL database example.
 This kind of NoSQL database is used as a collection, dictionaries, associative
arrays, etc.
 Key value stores help the developer to store schema-less
data. They work best for shopping cart contents.
 Redis, Dynamo, Riak are some NoSQL examples of key-value
store DataBases.
 They are all based on Amazon's Dynamo paper.

Mrs. Deepali Jaadhav 78


Key-Value Pair Databases
 Key-value databases are modeled on a simple, two-part data structure
consisting of an identifier and a data value.
 The important principle to remember about keys is that they must be
unique.

79 Mrs. Deepali Jaadhav


Key-value store Representatives

MemcachedDB

not
open-source
Project
Voldemort
open-source
version

Mrs. Deepali Jaadhav 80


Column-based
 Column-oriented databases work on columns and are based on
BigTable paper by Google.
 Every column is treated separately.
 Values of single column databases are stored contiguously.
 They deliver high performance on aggregation queries like SUM,
COUNT, AVG, MIN etc. as the data is readily available in a column.
 Column-based NoSQL databases are widely used to
manage data warehouses, business intelligence,
CRM, Library card catalogs,
 HBase, Cassandra, Hypertable are NoSQL
query examples of column based database.

Mrs. Deepali Jaadhav 81


Column-based

Mrs. Deepali Jaadhav 82


Column-Family Stores Representatives

Google’s
BigTable

Mrs. Deepali Jaadhav 83


Document-Oriented
 Document-Oriented NoSQL DB stores and retrieves data as a key value pair
but the value part is stored as a document.
 The document is stored in JSON or XML formats. The value is understood by
the DB and can be queried.

 The document type is mostly used for CMS systems, blogging platforms, real-
time analytics & e-commerce applications.
 It should not use for complex transactions which require multiple operations
or queries against varying aggregate structures.
 Amazon SimpleDB, CouchDB, MongoDB, Riak, Lotus Notes are popular
Document originated DBMS systems.
Mrs. Deepali Jaadhav 84
Document Databases
 Documents
 Instead of storing each attribute of an entity with a separate key,
document databases store multiple attributes in a single document.
 One of the most important characteristics of document databases is you
do not have to define a fixed schema before you add data to the
database.

Mrs. Deepali Jaadhav 85


Document Databases Representatives

Lotus Notes
Storage Facility

Mrs. Deepali Jaadhav 86


Graph-Based
 A graph type database stores entities as well the relations amongst those
entities.
 The entity is stored as a node with the relationship as edges. An edge gives a
relationship between nodes.
 Every node and edge has a unique identifier.
 Compared to a relational database where tables are
loosely connected, a Graph database is a multi-
relational in nature.
 Traversing relationship is fast as they are already
captured into the DB, and there is no need to calculate
them.
 Graph base database mostly used for social networks, logistics, spatial data.
 Neo4J, Infinite Graph, OrientDB, FlockDB are some popular graph-based
Mrs. Deepali Jaadhav
databases. 87
Graph Databases Representatives

FlockDB

Mrs. Deepali Jaadhav 88


Document Database

89 Mrs. Deepali Jaadhav


Unit 4: NoSQL Database Part 1
 Document Databases with MongoDB
 What Is a Document
 Differences Between Document and Relational Databases
 Managing Multiple Documents in Collections
 Basic Operations on Document Databases
 • Inserting • Deleting • Updating • Retrieving
 Document Database with MongoDB

90 Mrs. Deepali Jaadhav


Document Databases: A
Comprehensive Guide
Document databases are becoming
increasingly popular for modern
applications due to their flexibility,
scalability, and ease of use. They are
particularly well-suited for handling
unstructured or semi-structured data,
which is common in web applications,
mobile apps, and other data-intensive
systems.

by deepali Jadhav

preencoded.png
Document Databases
 Document databases, also called document-oriented databases.
 It uses a key-value approach to store data but with important
differences from key-value databases. A document database stores
values as documents.
 Documents are semi-structured entities, typically in a standard format
such as JavaScript Object Notation (JSON) or Extensible Markup
Language (XML).
 When the term document is used in this context, it refers to data
structures that are stored as strings or binary representations of strings.

92 Mrs. Deepali Jaadhav


Document Databases
 Documents
 Instead of storing each attribute of an entity with a separate key,
document databases store multiple attributes in a single document.
 One of the most important characteristics of document databases is
you do not have to define a fixed schema before you add data to the
database.

93 Mrs. Deepali Jaadhav


Document Databases
 Simply adding a document to the database creates the underlying data
structures needed to support the document.
 The lack of a fixed schema gives developers more flexibility with
document databases than they have with relational databases.
 For example, employees can have different attributes than the ones
listed above. Another valid employee document is

94 Mrs. Deepali Jaadhav


Document Databases
 Querying Documents
 Document databases provide application programming interfaces
(APIs) or query languages that enable you to retrieve documents based
on attribute values.
 For example, if you have a database with a collection of employee
documents called “employees,” you could use a statement such as the
following to return the set of all employees with the position Manager:
db.employees.find( { position:“Manager” })

95 Mrs. Deepali Jaadhav


Differences Between Document and Relational Databases
 A key distinction between document and relational databases is that
document databases do not require a fixed, predefined schema.
 Another important difference is that documents can have embedded
documents and lists of multiple values within a document.

96 Mrs. Deepali Jaadhav


Document Databases vs. Relational
Databases
Document Databases Relational Databases
• Data organized into tables with rows and
• Data stored in JSON-like documents columns
• Flexible schema, allowing for dynamic • Strict schema, enforcing data consistency
data structures • Strong support for transactional integrity
• Optimized for fast read and write operations and data relationships
• Well-suited for handling unstructured or • Suitable for structured data with clear
semi-structured data relationships

preencoded.png
Benefits of Document Databases
Flexibility
Document databases offer a flexible schema, allowing you to
adapt to changing data requirements without rigid constraints.
You can easily add new fields or modify existing ones without
impacting the structure of your data.
Scalability
These databases are designed to handle large volumes of data,
making them ideal for applications with rapidly growing data
sets. They can scale horizontally by adding more servers to the
cluster, ensuring consistent performance even under heavy
load.
Performance
Document databases prioritize read and write operations,
offering fast and efficient data access. This makes them well-
suited for applications that require real-time updates and low
latency, such as e-commerce platforms and social media
applications.

preencoded.png
Code Example: Illustrating Document Structures
Let's examine a practical example of a customer collection to understand how document structures can vary
within a collection.
Customer Collection Example Explanation
• Each document in the collection represents a
{
customer, identified by a unique `customerId`.
"customerId": "12345",
"firstName": "John", • The document structure includes basic information
"lastName": "Doe", like `firstName`, `lastName`, and `email`.
"email": "[email protected]", • An embedded `address` object provides detailed
"address": { address information.
"street": "123 Main Street", • The `orders` array stores information about the
"city": "Anytown", customer's previous orders, demonstrating a one-
"state": "CA", to-many relationship.
"zip": "91234"
},
"orders": [
{ "orderId": "A123", "date":
"2023-03-15" },
{ "orderId": "B456", "date":
"2023-04-20" }
]
}

preencoded.png
Basic Operations on Document Databases
 Inserting
 Deleting
 Updating
 Retrieving

db.createCollection("books")

101 Mrs. Deepali Jaadhav


Inserting Documents into a Collection
 The insert method adds documents to a collection. For example, the
following adds a single document describing a book by Kurt Vonnegut
to the books collection:
db.books.insert( {“title”:” Mother Night”, “author”: “Kurt
Vonnegut, Jr.”} )
 Instead of simply adding a book document with the title and author
name, a better option would be to include a unique identifier as in
db.books.insert( {book_id: 1298747,
“title”:“Mother Night”,
“author”: “Kurt Vonnegut, Jr.”} )

102 Mrs. Deepali Jaadhav


Inserting Documents into a Collection
 In many cases, it is more efficient to perform bulk inserts instead of a
series of individual inserts. For example, the following three insert
commands could be used to add three books to the book collection:
db.books.insert(
[
{ “book_id”: 1298747,
“title”:“Mother Night”,
“author”: “Kurt Vonnegut, Jr.”},
{ “book_id”: 639397,
“title”:“Science and the Modern World”,
“author”: “Alfred North Whitehead”},
{ “book_id”: 1456701,
“title”:“Foundation and Empire”,
“author”: “Isaac Asimov”}
]}
 The [ and ] in the parameter, list delimit an array of documents to insert.

103 Mrs. Deepali Jaadhav


Deleting Documents from a Collection
 You can delete documents from a collection using the remove methods.
The following command deletes all documents in the books collection:
db.books.remove()
 Note that the collection still exists, but it is empty.
 To delete a single document, you can specify a query document that
matches the document you would like to delete.
 To delete the book titled Science and the Modern World, you would
issue the following command:
db.books.remove({“book_id”: 639397})
 Executing the following remove command will delete the two books by
Kurt Vonnegut, Jr.:
db.books.remove({“author”: “Kurt Vonnegut, Jr.”})

104 Mrs. Deepali Jaadhav


Updating Documents in a Collection
 Once a document has been inserted into a collection, it can be
modified with the update method. The update method requires two
parameters to update:
 Document query
 Set of keys and values to update
 MongoDB uses the $set operator to specify which keys and values
should be updated. For example, the following command adds the
quantity key with the value of 10 to the collection:
db.books.update ({“book_id”: 1298747},
{$set {“quantity” : 10 }})
The full document would then be
{“book_id”: 1298747,
“title”:“Mother Night”,
“author”: “Kurt Vonnegut, Jr.”,
“quantity” : 10}

105 Mrs. Deepali Jaadhav


Updating Documents in a Collection
 The update command adds a key if it does not exist and sets the value
as indicated. If the key already exists, the update command changes
the value associated with it.
 Document databases sometimes provide other operators in addition to
set commands.
 For example, MongoDB has an increment operator ($inc), which is
used to increase the value of a key by the specified amount.
 The following command would change the quantity of Mother Night
from 10 to 15:
db.books.update ({“book_id”: 1298747},
{$inc {“quantity” : 5 }})

106 Mrs. Deepali Jaadhav


Retrieving Documents from a Collection
 The find method is used to retrieve documents from a collection.
 The find method takes an optional query document that specifies
which documents to return.
 The following command matches all documents in a collection:
db.books.find()
 This is useful if you want to perform an operation on all documents in a
collection.
 The following returns all books by Kurt Vonnegut, Jr.:
db.books.find({“author”: “Kurt Vonnegut, Jr.”})
 These two find examples both return all the keys and values in the
documents.
 There are times when it is not necessary to return all key-value pairs.
 In those cases, you can specify an optional second argument that is a
list of keys to return along with a “1” to indicate the key should be
returned.
db.books.find({“author”: “Kurt Vonnegut, Jr.”}, {“title” : 1} )

107 Mrs. Deepali Jaadhav


Retrieving Documents from a Collection
 To retrieve all books with a quantity greater than or equal to 10 and less
than 50, you could use the following command:
Db.books.find( {“quantity” : {“$gte” : 10, “$lt” : 50 }} )
 The conditionals and Booleans supported in MongoDB include the
following:
 $lt—Less than
 $let—Less than or equal to
 $gt—Greater than
 $gte—Greater than or equal to
 $in—Query for values of a single key
 $or—Query for values in multiple keys
 $not—Negation

108 Mrs. Deepali Jaadhav


What is MongoDB?
 Definition: MongoDB is an open source, document-oriented
database designed with both scalability and developer agility in mind.
 Instead of storing your data in tables and rows as you would with a
relational database, in MongoDB you store JSON-like documents
with dynamic schemas (schema-free, schemaless).

109 Mrs. Deepali Jaadhav


What is MongoDB? (Cont’d)
 Document-Oriented DB
 Unit object is a document instead of a row (tuple) in relational DBs

> db.user.findOne({age:39})
{
"_id" : ObjectId("5114e0bd42…"),
"first" : "John",
"last" : "Doe",
"age" : 39,
"interests" : [
"Reading",
"Mountain Biking ]
"favorites": {
"color": "Blue",
"sport": "Soccer"}
}

110 Mrs. Deepali Jaadhav


Is It Fast?
 For semi-structured & complex relationships: Yes

111 Mrs. Deepali Jaadhav


Difference between RDBMS and MongoDB
RDBMS MongoDB

It is a relational database. It is a non-relational and document-oriented database.

Not suitable for hierarchical data storage. Suitable for hierarchical data storage.

It is vertically scalable i.e increasing RAM. It is horizontally scalable i.e we can add more servers.

It has a predefined schema. It has a dynamic schema.

It is quite vulnerable to SQL injection. It is not affected by SQL injection.

It centers around ACID properties (Atomicity, It centers around the CAP theorem (Consistency,
Consistency, Isolation, and Durability). Availability, and Partition tolerance).

It is row-based. It is document-based.

It is slower in comparison with MongoDB. It is almost 100 times faster than RDBMS.

Supports complex joins. No support for complex joins.

It is column-based. It is field-based.

It does not provide JavaScript client for querying. It provides a JavaScript client for querying.

It supports SQL query language only. It supports JSON query language along with SQL.

114 Mrs. Deepali Jaadhav


JSON
Field Name
Field Value  Field Value
 Scalar (Int,
Boolean, String,
One document Date, …)

 Document
(Embedding or
Nesting)

 Array of JSON
objects

115 Mrs. Deepali Jaadhav


Another Example
Remember it is stored
in binary formats
(BSON)

116 Mrs. Deepali Jaadhav


MongoDB Model
One document (e.g., one tuple in RDBMS)
• Collection is a group of
similar documents

• Within a collection, each


document must have a
unique Id

One Collection (e.g., one Table in RDBMS)

Unlike RDBMS:
No Integrity
Constraints in
MongoDB

117 Mrs. Deepali Jaadhav


MongoDB Model
One document (e.g., one tuple in RDBMS)
• The field names
cannot start with the $
character

• The field names


cannot contain the .
character

One Collection (e.g., one Table in RDBMS) • Max size of single


document 16MB

118 Mrs. Deepali Jaadhav


Example Document in MongoDB
• _id is a special column in each document
• Unique within each collection
• _id ➔ Primary Key in RDBMS
• _id is 12 Bytes, you can set it yourself
• Or:
• 1st 4 bytes ➔ timestamp
• Next 3 bytes ➔ machine id
• Next 2 bytes ➔ Process id
• Last 3 bytes ➔ incremental values

Documents manual
Insert Documents —
MongoDB Manual

119 Mrs. Deepali Jaadhav


No Defined Schema
(Schema-free Or Schema-less)

120 Mrs. Deepali Jaadhav


CRUD
 Create
 db.collection.insert( <document> )
 db.collection.save( <document> )
 db.collection.update( <query>, <update>, { upsert:
true } )
 Read
 db.collection.find( <query>, <projection> )
 db.collection.findOne( <query>, <projection> )
 Update
 db.collection.update( <query>, <update>, <options> )
 Delete
 db.collection.remove( <query>, <justOne> )

121 Mrs. Deepali Jaadhav


CRUD Examples

122 Mrs. Deepali Jaadhav


Examples

In RDBMS In MongoDB
Either insert the 1st docuement

Or create “Users” collection explicitly

123 Mrs. Deepali Jaadhav


Insertion

 The collection “users” is created automatically if it does not exist

124 Mrs. Deepali Jaadhav


Multi-Document Insertion
(Use of Arrays)

All the documents are


inserted at once
125 Mrs. Deepali Jaadhav
Deletion
(Remove Operation)
 You can put condition on any field in the document (even _id)

db.users.remove ( ) Removes all documents from users collection

127 Mrs. Deepali Jaadhav


Update

Otherwise, it will update only the 1st matching document

Equivalent to in SQL:

128 Mrs. Deepali Jaadhav


Update (Cont’d)

Two
operat
ors

129 Mrs. Deepali Jaadhav


Replace a document

Query Condition

New
doc

For the document having item = “BE10”, replace it with the given document

130 Mrs. Deepali Jaadhav


Insert or Replace

The upsert option

If the document having item = “TBD1” is in the DB, it will be replaced


Otherwise, it will be inserted.

131 Mrs. Deepali Jaadhav


Find queries
 SELECT * FROM inventory
db.inventory.find{}
 SELECT * FROM inventory WHERE status = "D"
db.inventory.find{ status: "D" }
 SELECT * FROM inventory WHERE status in ("A", "D")
db.inventory.find{ status: { $in: [ "A", "D" ] } }
 SELECT * FROM inventory WHERE status = "A" AND qty < 30
db.inventory.find{ status: "A", qty: { $lt: 30 } }
 SELECT * FROM inventory WHERE status = "A" OR qty < 30
db.inventory.find{ $or: [ { status: "A" }, { qty: { $lt: 30 } } ] }
 SELECT * FROM inventory WHERE status = "A" AND ( qty < 30 OR
item LIKE "p%")
db.inventory.find{ status: "A", $or: [ { qty: { $lt: 30 } }, { item: /^p/ } ] }

132 Mrs. Deepali Jaadhav


Must Practice It

Install it Practice simple stuff Move to complex stuff

Install it from here: http://www.mongodb.org

Manual:
http://docs.mongodb.org/master/MongoDB-
manual.pdf
Dataset: http://docs.mongodb.org/manual/reference/bios-
example-collection/
Online Execution:
https://docs.mongodb.com/manual/tutorial/insert-
documents/
133 Mrs. Deepali Jaadhav
Key Value Database

134 Mrs. Deepali Jaadhav


Key Value Database
 Essential Features of Key-Value Databases
 Key-Value Database Data Modeling Terms
 Key-Value Architecture Terms
 Limitations of Key-Value Databases
 Key-value Database with Riak

135 Mrs. Deepali Jaadhav


Key-Value Pair Databases
 Key-value pair databases are the simplest form of NoSQL databases.
These databases are modeled on two components: keys and values.
 Keys - Keys are identifiers associated with values.
 For example, the tag you receive has an identifier associated with your
luggage. With your tag, you can find your luggage more efficiently than
without it.
 The first customer checking bags on flight 1928 might be assigned
ticket 1928.1 for her first bag and 1928.2 for her second bag. The second
customer also has two bags and he is assigned 1928.3 and 1928.4

136 Mrs. Deepali Jaadhav


Key-Value Pair Databases
 Key-value databases are modeled on a simple, two-part data structure
consisting of an identifier and a data value.
 The important principle to remember about keys is that they must be
unique.

137 Mrs. Deepali Jaadhav


Key-Value Pair Databases
 Values - are data stored along with keys.
 Like luggage, values in a key-value database can store many different
things.
 Values can be as simple as a string, such as a name, or a number, such
as the number of items in a customer’s shopping cart.
 You can store more complex values, such as images or binary objects,
too.
 Key-value databases give developers a great deal of flexibility when
storing values.
 For example, strings can vary in length. Values can also vary in type. An
employee database might include photos of employees using keys such
as Emp328.photo.

138 Mrs. Deepali Jaadhav


Differences Between Key-Value and Relational Databases
 Key-value databases are modeled on minimal principles for storing and
retrieving data.
 Unlike in relational databases, there are no tables, so there are no
features associated with tables, such as columns and constraints on
columns.
 There is no need for joins in key-value databases, so there are no
foreign keys.
 Key-value databases do not support a rich query language such as SQL.
 Some key-value databases support buckets, or collections, for creating
separate namespaces within a database.

139 Mrs. Deepali Jaadhav


Differences Between Key-Value and Relational Databases

140 Mrs. Deepali Jaadhav


Essential Features of Key-Value Databases
 Simplicity –
 Key-value databases use a bare-minimum data structure.
 In key-value databases, you work with a simple data model. The
syntax for manipulating data is simple.
 Key-value databases are flexible and forgiving.
 If you make a mistake and assign the wrong type of data, for
example, a real number instead of an integer, the database usually
does not complain.
 This feature is especially useful when the data type changes, or you
need to support two or more data types for the same attribute.
shoppingCart[cart:1298:customerID] = 1982737
shoppingCart[cart:3985:customerID] = ‘Johnson, Louise’
 Both numbers as strings for customer identifiers

141 Mrs. Deepali Jaadhav


Essential Features of Key-Value Databases
 Speed –
 Key-value databases are known for their speed.
 With a simple associative array data structure and design features to
optimize performance, key-value databases can deliver high-
throughput, data-intensive operations.
 One way to keep database operations running fast is to keep data in
memory.
 Reading and writing data to RAM is much faster than writing to a
disk.
 When a program changes the value associated with a key, the key-
value database can update the entry in RAM and then send a
message to the program that the updated value has been saved.
 The program can then continue with other operations. While the
program is doing something else, the key-value database can write
the recently updated value to disk.
 Similarly, read operations can be faster if data is stored in memory.

142 Mrs. Deepali Jaadhav


Essential Features of Key-Value Databases

143 Mrs. Deepali Jaadhav


Essential Features of Key-Value Databases
 Speed –
 When the key-value database uses all the memory allocated to it, the
database will need to free some of the allocated memory before
storing copies of additional data.
 There are multiple algorithms for this, but a commonly used method
is known as least recently used (LRU).

144 Mrs. Deepali Jaadhav


Essential Features of Key-Value Databases
 Scalability-
 Scalability is the capability to add or remove servers from a
cluster of servers as needed to accommodate the load on
the system.
 Key-value databases take different approaches to scaling
read and write operations.
 Two options:
 Master-slave replication
 Masterless replication

145 Mrs. Deepali Jaadhav


Essential Features of Key-Value Databases
 Scalability-
 Master-slave replication

146 Mrs. Deepali Jaadhav


Essential Features of Key-Value Databases
 Scalability-
 Masterless replication

147 Mrs. Deepali Jaadhav


Essential Features of Key-Value Databases
 Scalability-
 Masterless replication

148 Mrs. Deepali Jaadhav


149 Mrs. Deepali Jaadhav
Key-Value Database Data Modeling Terms
 Key
 Value
 Namespace
 Partition
 Partition Key
 Schemaless

150 Mrs. Deepali Jaadhav


Key-Value Database Data Modeling Terms
 Key - A key is a reference to a value.
 It is analogous to an address.
 A key can take on different forms depending on the key-value
database used.
 At a minimum, a key is specified as a string of characters, such as
"Cust9876" or "Patient:A384J:Allergies".
 Some key-value databases, such as Redis (www.redis.io ),
support more complex data structures as keys.
 The supported key data types in Redis version 2.8.13 include
 • Strings • Lists • Sets • Sorted sets • Hashes • Bit arrays
 Keys can also play an important role in implementing scalable
architectures. Keys are not only used to reference values, but
they are also used to organize data across multiple servers.

151 Mrs. Deepali Jaadhav


Key-Value Database Data Modeling Terms
 Value - A value is an object, typically a set of bytes, that has been
associated with a key.
 Values can be integers, floating-point numbers, strings of characters,
binary large objects (BLOBs), semistructured constructs such as JSON
objects, images, audio, etc.
 Most key-value databases will have a limit on the size of a value.
 Redis, for example, can have a string value up to 512MB in length.
 FoundationDB (foundationdb.com), a key-value database known for its
support of ACID transactions, limits the size of values to 100,000 bytes.
 Riak (www.basho.com 3 ), which supports full text indexing of values
so you can use an API to find keys and values using search queries.

152 Mrs. Deepali Jaadhav


Key-Value Database Data Modeling Terms

Namespace - A namespace is a collection of key-value pairs.

A namespace as a set, a collection, a list of key-value pairs


without duplicates, or a bucket for holding key-value pairs.

A namespace could be an entire key-value database.

The essential characteristic of a namespace is it is a


collection of key-value pairs that has no duplicate keys.

153 Mrs. Deepali Jaadhav


Key-Value Database Data Modeling Terms

Namespaces implicitly define an


additional prefix for keys.

The customer management team could


create a namespace called custMgmt, and
the order management team could create a
namespace called ordMgmt.

They would then store all keys and values


in their respective namespaces.

The key that caused problems before


effectively becomes two unique keys:
custMgmt: Prod:12986:name and ordMgmt:
Prod:12986:name

154 Mrs. Deepali Jaadhav


Key-Value Database Data Modeling Terms
 Partition - A partitioned cluster is a group of servers
in which servers or instances of key-value database
software running on servers are assigned to manage
subsets of a database to manage workload.
 EX: All keys starting with the letters A through L are
handled by Server 1 and all keys starting with M
through Z are managed by Server 2.
 But most of the keys may start with the letter C, as in
cust (customer), cmpg (campaign), comp
(component), and so on, whereas very few keys start
with letters from the latter half of the alphabet, for
example, warh (warehouse). This imbalance the
work done by each server in the cluster.
 Partition schemes should be chosen to distribute the
workload as evenly as possible across the cluster.

155 Mrs. Deepali Jaadhav


Key-Value Database Data Modeling Terms

Partition Key A partition key is a key used to determine


which partition should hold a data value.

Any key in a key-value database is used as a partition key;


good partition keys are ones that distribute workloads
evenly.

Ex: Hash functions map an input string to a fixed-sized


string that is usually unique to the input string.

156 Mrs. Deepali Jaadhav


Key-Value Database Data Modeling Terms
 Schemaless - Schemaless is a term that describes the logical model of
a database.
 In the case of key value databases, you are not required to define all the
keys and types of values you will use prior to adding them to the
database.
 EX: cust:8983:fullName = ‘Jane Anderson’

157 Mrs. Deepali Jaadhav


Key-Value Architecture Terms
 The architecture of a key-value database is a set of characteristics about
the servers, networking components, and related software that allows
multiple servers to coordinate their work.
 Clusters
 Rings
 Replication

158 Mrs. Deepali Jaadhav


Key-Value Architecture Terms - Cluster
 Cluster - Clusters are sets to connected computers that coordinate
their operations.
 Clusters may be loosely or tightly coupled.
 Loosely coupled clusters consist of fairly independent servers that
complete many functions on their own with minimal coordination with
other servers in the cluster.
 Tightly coupled clusters tend to have high levels of communication
between servers. This is needed to support more coordinated
operations, or calculations, on the cluster.
 Key-value clusters tend to be loosely coupled.

159 Mrs. Deepali Jaadhav


160 Mrs. Deepali Jaadhav
Key-Value Architecture Terms - Cluster
 Servers, also known as nodes, in a loosely coupled cluster share information
about the range of data the server is responsible for and routinely send
messages to each other to indicate they are still functioning.
 When a node fails, the other nodes in the cluster can respond by taking over
the work of that node.
 Radis support master-slave cluster. The master node is responsible for
accepting read and write operations and copying, or replicating, copies of data
to slave nodes that respond to read requests.
 If a master node fails, the remaining nodes in the cluster will elect a new
master node.
 If a slave node fails, the other nodes in the cluster can continue to respond to
read requests.
 Riak supports Masterless clusters where all nodes carry out operations to
support read and write operations.
 If one of those nodes fails, other nodes will take on the read and write
responsibilities of the failed node.
 Each node in a masterless cluster is responsible for managing some set of
partitions. One way to organize partitions is in a ring structure.

161 Mrs. Deepali Jaadhav


Key-Value Architecture Terms - Ring
 Ring - A ring is a logical structure for organizing partitions.
 A ring is a circular pattern in which each server or instance of key-value
database software running on a server is linked to two adjacent servers or
instances.
 Each server or instance is responsible for managing a range of data based on a
partition key.
 A ring architecture helps to simplify potentially complex operations.
 For example, whenever a piece of data is written to a server, it is also written to
the two servers linked to the original server.
 This enables high availability of a key-value database.
 For example, if Server 4 fails, both Server 3 and Server 5 could respond to read
requests for the data on Server 4. Servers 3 and 5 could also accept write
operations destined for Server 4. When Server 4 is back online, Servers 3 and 5
can update Server 4 with the writes that occurred while it was down.

162 Mrs. Deepali Jaadhav


163 Mrs. Deepali Jaadhav
Key-Value Architecture Terms - Replication
 Replication - Replication is the process of saving multiple copies of
data in your cluster.
 This provides for high availability.
 One parameter you will want to consider is the number of replicas to
maintain.
 The more replicas you have, the less likely you will lose data; however,
you might have lower performance with a large number of replicas.

164 Mrs. Deepali Jaadhav


Limitations of Key-Value Databases
 Look Up Values by Key Only
 Key-Value Databases Do Not Support Range Queries
 No Standard Query Language Comparable to SQL for
Relational Databases

165 Mrs. Deepali Jaadhav


RIAK Database

166 Mrs. Deepali Jaadhav


What is RIAK?
 Key Value store
 Distributed and horizontal Scalable
 Fault Tolerant
 Highly available
 Built for web
 Based on Amazon Dynamo

167 Mrs. Deepali Jaadhav


What is RIAK?
 Riak is a highly resilient NoSQL database.
 It ensures your most critical data is always available and that your Big
Data applications can scale.
 Riak KV can be operationalized at lower costs than both relational and
other NoSQL databases, especially at scale.
 Running on commodity hardware ( inexpensive, widely available and
basically interchangeable with other hardware of its type)
 Riak KV stores data as a combination of keys and values, and is a
fundamentally content-agnostic database.
 You can use it to store anything you want – JSON, XML, HTML,
documents, binaries, images, and more.
 Keys are simply binary values used to uniquely identify a value.

168 Mrs. Deepali Jaadhav


CORECONCEPTS in Riak
• Node - An instance of Riak
• Cluster – A collection of connected Riaknodes
• Bucket - Logical grouping of objects. Shared configuration
• Key -An identifier for a record/object
• Value - Opaque binary representation of data stored with key
• Metadata - Additional data linked to record, not part of value
• Riak Object - Bucket, Key,Value and Metadata. Unit of replication.

Mrs. Deepali Jaadhav


1
CORECONCEPTS in Riak
• Partition - Logical division of storage

• Vnode - Process handling requests and managing a partition

• Ownership Handoff - Transferof data on cluster change

• Hinted Handoff - Transferof data on node/network failure

• Quorum - Set of nodesrequired to participate in transaction

Mrs. Deepali Jaadhav


1
THE RIAK RING
 The basis of Riak KV’s masterless architecture, replication, and fault
tolerance is the Ring.
 This Ring is a managed collection of partitions that share a common
hash space.
 The hash space is called a Ring because the last value in the hash space
is thought of as being adjacent to the first value in the space.
 Replicas of data are stored in the “next N partitions” of the hash spaces.

171 Mrs. Deepali Jaadhav


NODES AND VNODES
 Each node in a Riak KV cluster manages one or many virtual nodes, or
vnodes.
 Each vnode is a separate process which is assigned a partition of the
Ring, and is responsible for a number of operations in a Riak cluster,
from the storing of objects, to handling of incoming read/write
requests from other vnodes, to interpreting causal context metadata for
objects.
 If your cluster has 64 partitions and you are running three nodes, two
of your nodes will have 21 vnodes, while the third node holds 22
vnodes.
 The concept of vnodes is important as we look at data replication.
 No single vnode is responsible for more than one replica of an object.
 Each object belongs to a primary vnode, and is then replicated to
neighboring vnodes located on separate nodes in the cluster.

172 Mrs. Deepali Jaadhav


CAP Theorem - AP
 Riak KV is a tunable AP system.
 By default, Riak KV replicas are “eventually consistent,” meaning that
while data is always available, not all replicas may have the most recent
update at the exact same time, causing brief periods—generally on the
order of milliseconds—of inconsistency while all state changes are
synchronized.
 Riak KV is designed to deliver maximum data availability, so as long as
your client can reach one Riak KV server, it can write data.

173 Mrs. Deepali Jaadhav


References
 Basho Technologies, Inc., Riak Documentation:
http://docs.basho.com/riak/latest/
 FoundationDB. Key-Value Store 2.0 Documentation:
https://foundationdb.com/key-value store/documentation/index.html
 Oracle Corporation. “Oracle NoSQL Database, 12c Release 1”:
http://docs.oracle.com/cd/NOSQL/html/index.html
 Redis Documentation:
http://redis.io/documentation

174 Mrs. Deepali Jaadhav


End of Unit 4

175 Mrs. Deepali Jaadhav

You might also like