Module 2.3 Document Databases in NoSQL Systems

Document Databases in NoSQL
- stores data in JSON, BSON, or XML documents. Follows Tree Structure
- move documents under one document, any particular elements can be indexed to run
queries faster.
- close to the data objects which are used in many applications which means very less
translations are required to use data in applications.
- JSON is a native language that is often used to store and query data too. So in the
document data model, each document has a key-value pair.
- works well with use cases such as catalogs, user profiles, and content management --
enable flexible indexing, powerful ad hoc queries, and analytics over collections of
documents.

Eg. Application
Catalogs
For example, in an e-commerce application,
products have different numbers of attributes.
Managing thousands of attributes in relational databases is inefficient, and the
reading performance is affected.
Using a document database, each product’s attributes can be described in a
single document for easy management and faster reading speed. Changing
the attributes of one product won’t affect others.

Features:
● Document Type Model: data is stored in documents rather than tables or
graphs, so it becomes easy to map things in many programming languages.
● Flexible Schema: not all documents in a collection need to have the same
fields.
● Distributed and Resilient: Document data models are very much dispersed
which is the reason behind horizontal scaling and distribution of data.
● Manageable Query Language: CRUD (Create Read Update Destroy)
operations on the data model.

Examples of Document Data Models :
● Amazon DocumentDB
● MongoDB (Document databases, such as CouchDB and MongoDB store entire documents in the form of
JSON objects. You can think of these objects as nested key-value pairs.)
● Cosmos DB
● ArangoDB
● Couchbase Server
● CouchDB

Advantages:
● Schema-less: no restrictions in the format and the structure of data storage.
● Faster creation of document and maintenance: simple to create a
document and apart from this maintenance requires is almost nothing.
● Open formats: simple build process that uses XML, JSON, and its other
forms.
● Built-in versioning: It has built-in versioning which means as the documents
grow in size there might be a chance they can grow in complexity. Versioning
decreases conflicts.

Disadvantages:
● Weak Atomicity: It lacks in supporting multi-document ACID transactions. A
change in the document data model involving two collections will require us to
run two separate queries i.e. one for each collection. This is where it breaks
atomicity requirements.
● Consistency Check Limitations: One can search the collections and
documents that are not connected to an author collection but doing this might
create a problem in the performance of database performance.
● Security: Nowadays many web applications lack security which in turn results
in the leakage of sensitive data. So it becomes a point of concern, one must
pay attention to web app vulnerabilities.

Applications of Document Data Model :
● Content Management: These data models are very much used in creating various
video streaming platforms, blogs, and similar services Because each is stored as a
single document and the database here is much easier to maintain as the service
evolves over time.
● Book Database: These are very much useful in making book databases because
as we know this data model lets us nest.
● Catalog: When it comes to storing and reading catalog files these data models are
very much used because it has a fast reading ability if incase Catalogs have
thousands of attributes stored.
● Analytics Platform: These data models are very much used in the Analytics
Platform.

Mongo DB sample operations.
● First N / Last N Rows
● Equal to / Not Equal to
● Equal / Not Equal to One of Them — IN
● Multiple Conditions with AND / OR
● Regular Expression — Text Matching — Contains, Starts / Ends With
● Is NA (Null) or Not NA
● Filter with Numeric Columns
● Filter with Date/Time Columns
● Filter with Array Columns
● Filter with ‘Array of Documents’ Columns

Why Graph database?
Graph Database : the relationships between data are prioritized.
data is used where one data is connected directly or indirectly.
join operations are not required in this database which reduces the cost.
The relationships and properties are stored as first-class entities in Graph Database.
Allow organizations to connect the data with external sources as well.
if the organization wants to find a particular data that is connected with another data in another
table, so first join operation is performed between the tables, and then search for the data is done
row by row.
Graph database solves this big problem. They store the relationships and properties along with
the data. So if the organization needs to search for a particular data, then with the help of
relationships and properties the nodes can be found without joining or without traversing row by
row. Thus the searching of nodes is not dependent on the amount of data.

● Nodes: represent the objects or instances. They are equivalent to a row in
database. The node basically acts as a vertex in a graph. The nodes are
grouped by applying a label to each member.
● Relationships: They are basically the edges in the graph. They have a
specific direction, type and form patterns of the data. They basically establish
relationship between nodes.
● Properties: They are the information associated with the nodes.
Eg. Neo4j, Oracle NoSQL DB, Graph base

Types of Graph Databases:
● Property Graphs: These graphs are used for querying and analyzing data by
modelling the relationships among the data. It comprises of vertices that has
information about the particular subject and edges that denote the relationship.
The vertices and edges have additional attributes called properties.
● RDF Graphs: It stands for Resource Description Framework. It focuses more
on data integration. They are used to represent complex data with well defined
semantics. It is represented by three elements: two vertices, an edge that reflect
the subject, predicate and object of a sentence. Every vertex and edge is
represented by URI(Uniform Resource Identifier).

When to Use Graph Database?
● Graph databases should be used for heavily interconnected data.
● It should be used when amount of data is larger and relationships are present.
● It can be used to represent the cohesive picture of the data.

How Graph and Graph Databases Work?
Graph databases provide graph models They allow users to
perform traversal queries since data is connected. Graph
algorithms are also applied to find patterns, paths and other
relationships this enabling more analysis of the data. The
algorithms help to explore the neighboring nodes, clustering
of vertices analyze relationships and patterns. Countless joins
are not required in this kind of database.

Example of Graph Database:
● Recommendation engines in E commerce use graph databases to provide customers with
accurate recommendations, updates about new products thus increasing sales and satisfying the
customer’s desires.
● Social media companies use graph databases to find the “friends of friends” or products that the
user’s friends like and send suggestions accordingly to user.
● To detect fraud Graph databases play a major role. Users can create graph from the transactions
between entities and store other important information. Once created, running a simple query will
help to identify the fraud.
Advantages of Graph Database:
● Potential advantage of Graph Database is establishing the relationships with external sources as
well
● No joins are required since relationships is already specified.
● Query is dependent on concrete relationships and not on the amount of data.
● It is flexible and agile.
● it is easy to manage the data in terms of graph.

Neo4j : data model
● Person nodes, which have the following properties: name and born.
● Movie nodes, which have the following properties: title, released, and tagline.

● The data model also contains five different relationship types between
the Person and Movie nodes: ACTED_IN, DIRECTED, PRODUCED, WROTE
, and REVIEWED. Two of the relationship types have properties:
● The ACTED_IN relationship type, which has the roles property.
● The REVIEWED relationship type, which has a summary property and
a rating property.

Sample query statements in Cypher query language
CREATE (TheMatrix:Movie {title:'The Matrix', released:1999, tagline:'Welcome to the Real World'})
CREATE (Keanu:Person {name:'Keanu Reeves', born:1964})
CREATE (Carrie:Person {name:'Carrie-Anne Moss', born:1967})
CREATE (Laurence:Person {name:'Laurence Fishburne', born:1961})
CREATE (Hugo:Person {name:'Hugo Weaving', born:1960})
CREATE (LillyW:Person {name:'Lilly Wachowski', born:1967})
CREATE (LanaW:Person {name:'Lana Wachowski', born:1965})
CREATE (JoelS:Person {name:'Joel Silver', born:1952})
CREATE
(Keanu)-[:ACTED_IN {roles:['Neo']}]->(TheMatrix),
(Carrie)-[:ACTED_IN {roles:['Trinity']}]->(TheMatrix),
(Laurence)-[:ACTED_IN {roles:['Morpheus']}]->(TheMatrix),
(Hugo)-[:ACTED_IN {roles:['Agent Smith']}]->(TheMatrix),
(LillyW)-[:DIRECTED]->(TheMatrix),
(LanaW)-[:DIRECTED]->(TheMatrix),
(JoelS)-[:PRODUCED]->(TheMatrix)

● Finding nodes
find the nodes with Person label and the name Keanu Reeves, and return
the name and born properties of the found nodes
MATCH (keanu:Person {name:'Keanu Reeves'})
RETURN keanu.name AS name, keanu.born AS born
Table 1. Result
name born
"Keanu Reeves" 1964

MATCH (tom:Person
{name:'Tom Hanks'})-[r]-
>(m:Movie)
RETURN type(r) AS type,
m.title AS movie

To be a good candidate for a general class of big data
problems, NoSQL solutions should
 Be efficient with input and output and scale linearly with growing data size.
 Be operationally efficient. Organizations can’t afford to hire many people to run
the servers.
 Require that reports and analyses be performed by nonprogrammers using
simple tools—not every business can afford a full-time Java programmer to write
on-demand queries.
 Meet the challenges of distributed computing, including consideration of latency
between systems and eventual node failures.
 Meet both the needs of overnight batch processing economy-of-scale and time
critical event processing

● For graph traversals to be fast, the entire graph should be in main memory.
This is why graph stores work most efficiently when you have enough RAM to
hold the graph. If you can’t keep your graph in RAM, graph stores will try to
swap the data to disk, which will decrease graph query performance by a
factor of 1,000.
● The only way to combat the problem is to move to a shared-memory
architecture, where multiple threads all access a large RAM structure without
the graph data moving outside of the shared RAM

Analyzing big data with a shared-nothing architecture
There are three ways that resources can be shared between computer systems:
1. shared RAM,
2. shared disk, and
3. shared-nothing

Choosing distribution models: master-slave versus peer-to-peer

● The disadvantage of peer-to peer networks is that there’s an increased
complexity and communication overhead that must occur for all nodes to be
kept up to date with the cluster status.
● Using the right distribution model will depend on your business requirements:
if high availability is a concern, a peer-to-peer network might be the best
solution. If you can manage your big data using batch jobs that run in off
hours, then the simpler master-slave model might be best.

Variations of NoSQL Architectural Patterns
● Database architecture : distributed (manages single database distributed in multiple servers
located at various sites) or federated (manages independent and heterogeneous databases at
multiple sites).
● In Internet of Things (IoT) architecture a virtual sensor has to integrate multiple data streams from
real sensors into a single data stream.
Results : stored temporarily or stored permanently
Uses data and sources-centric IoT middleware.
● Scalable architectural patterns can be used to form new scalable architectures. For example, a
combination of load balancer and shared-nothing architecture; distributed Hash Table and Content
Addressable network (Cassandra); Publish/Subscribe (EventJava); etc.

Variations of NoSQL Architectural Patterns
● The variations in architecture are based on system requirements like agility,
availability (any-where, anytime), intelligence, scalability, collaboration and
low latency.
● Agility is given as a service using virtualization or cloud computing;
● availability is the service given by internet and mobility;
● intelligence is given by machine learning and predictive analytics;
● scalability (flexibility of using commodity machines) is given by Big Data
Technologies/cloud platforms;
● collaboration is given by (enterprise-wide) social network application; and
● low latency (event driven) is provided by in- memory databases.

Analyzing Big Data with a Shared Nothing Architecture
● distributed computing architecture : resource sharing possible or share
nothing.
● shared RAM, shared disk, and shared-nothing.
● In shared RAM, many CPUs access a single shared RAM over a high-speed bus.
This system is ideal for large computation and also for graph stores. For graph
traversals to be fast, the entire graph should be in main memory.
● The shared disk system, processors have independent RAM but shares disk space
using a storage area network (SAN). Big data uses commodity machines which
shares nothing (shares no resources).
● (key−value store and document store) are cache-friendly.
● Row stores and graph stores are not cache-friendly

Choosing Distribution Models
● There are two styles of distributing data: Sharding and replication.
● Riak database shards the data and also replicates it.
● Sharding: horizontal partitioning
● Replication: Master−slave replication, Peer-to-peer replication.
● The right distribution model is chosen based on the business requirement. For
example, for batch processing jobs, master-slave is chosen and for high
availability jobs, peer-to-peer distribution model is chosen.
● Hadoop’s initial version has master−slave architecture
● Later versions removed SPOF. HBase has a master−slave model, while
Cassandra has a peer-to-peer model.

When to use which store?
● Key – Value : constant stream of small R-W
● Document DB : natural data modelling, Programmer friendly , CRUD.
● Columnar : Massive write load, High availability, Map Reduce.
● Graph : Graph algorithms and relations

Module 2.3 Document Databases in NoSQL Systems

Recommended

More Related Content

Similar to Module 2.3 Document Databases in NoSQL Systems (20)

More from NiramayKolalle (6)

Recently uploaded (20)

Module 2.3 Document Databases in NoSQL Systems