0% found this document useful (0 votes)
28 views

Unit 2

Uploaded by

Avon Numa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

Unit 2

Uploaded by

Avon Numa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 85

MODULE- 2: The MongoDB Data Model, Using

MongoDB Shell, MongoDB Architecture

Compiled By: Asst. Spruha S More


Vidyalankar School of
Information Technology [email protected]
Wadala (E), Mumbai
www.vsit.edu.in
Certificate
This is to certify that the e-book titled “The MongoDB Data Model,Using
MongoDB Shell, MongoDB Architecture” comprises all elementary learning
tools for a better understating of the relevant concepts. This e-book is
comprehensively compiled as per the predefined eight parameters and
guidelines.

Signature Date: 23-05-2019


Ms. Spruha More
Assistant Professor
Department of Information technology
DISCLAIMER: The information contained in this e-book is compiled and distributed
for educational purposes only. This e-book has been designed to help learners
understand relevant concepts with a more dynamic interface. The compiler of this
e-book and Vidyalankar School of Information Technology give full and due credit to
the authors of the contents, developers and all websites from wherever information
has been sourced. We acknowledge our gratitude towards the websites YouTube,
Wikipedia, and Google search engine. No commercial benefits are being drawn from
this project.
Unit II The MongoDB Data Model, Using MongoDB Shell,
MongoDB Architecture

Contents
The MongoDB Data Model: The Data Model, JSON and BSON, The
Identifier (_id), Capped Collection, Polymorphic Schemas, Object-
Oriented Programming, Schema Evolution

Using MongoDB Shell: Basic Querying, Create and Insert, Explicitly


Creating Collections, Inserting Documents Using Loop, Inserting by
Explicitly Specifying _id, Update, Delete, Read, Using Indexes,
Stepping Beyond the Basics, Using Conditional Operators, Regular
Expressions, MapReduce, aggregate(), Designing an Application’s
Data Model, Relational Data Modeling and Normalization, MongoDB
Document Data Model Approach

MongoDB Architecture: Core Processes, mongod , mongo, mongos,


MongoDB Tools, Standalone Deployment, Replication, Master/Slave
Replication, Replica Set, Implementing Advanced Clustering with
Replica Sets, Sharding, Sharding Components, Data Distribution
Process, Data Balancing Process, Operations, Implementing Sharding,
Controlling Collection Distribution (Tag-Based Sharding), Points to
Remember When Importing Data in a Sharded Environment, Monitoring for
Sharding, Monitoring the Config Servers, Production Cluster Architecture,
Scenario 1, Scenario 2, Scenario 3, Scenario 4

Recommended Books
• Practical MongoDB Shakuntala Gupta Edward Navin Sabharwal Apress
• Beginning jQuery Jack Franklin Russ Ferguson Apress Second
• Next Generation Databases Guy Harrison Apress
• Beginning JSON Ben Smith Apress

Unit IV Pre-requisites Sem. II Sem. Sem. IV Sem. Sem.


III V VI
Data
visualization
Chapter 4: The MongoDB Data Model

4.1 The Data Model


A MongoDB deployment can have many databases. Each database is a set of collections.
Collections are similar to the concept of tables in SQL; however, they are schemaless. Each
collection can have multiple documents. Think of a document as a row in SQL. Figure 4-1
depicts the MongoDB database model.

MongoDB database model

Example of a Region collection:

{ "R_ID" : "REG001", "Name" : "United States" }

{ "R_ID" :1234, "Name" : "New York" , "Country" : "United States" }

In this code, you have two documents in the Region collection. Although both
documents are part of a single collection, they have different structures: the second
collection has an additional field of information, which is country. In fact, if you look
at the “R_ID” field, it stores a STRING value in the first document whereas it’s a
number in the second document.

Thus, a collection’s documents can have entirely different schemas. It falls to the
application to store the documents in a collection together or to have multiple
collections.

4.1 .1 JSON and BSON

MongoDB is a document-based database. It uses Binary JSON for storing its data.
JSON stands for JavaScript Object Notation. It’s a standard used for data interchange
in today’s modern Web (along with XML). The format is human and machine
readable. It is not only a great way to exchange data but also a nice way to store
data. All the basic data types (such as strings, numbers, Boolean values, and arrays)
are supported by JSON.

The following code shows what a JSON document looks like:

{
"_id" : 1,
"name" : { "first" : "John", "last" : "Doe" },
"publications" : [
{
"title" : "First Book",

"year" : 1989,
"publisher" : "publisher1"
},
{ "title" : "Second Book",
"year" : 1999,

"publisher" : "publisher2"
}
]
}

4.1.1.1 Binary JSON (BSON)

MongoDB stores the JSON document in a binary-encoded format. This is termed as


BSON. The BSON data model is an extended form of the JSON data model.
MongoDB’s implementation of a BSON document is fast, highly traversable, and
lightweight. It supports embedding of arrays and objects within other arrays, and
also enables MongoDB to reach inside the objects to build indexes and match
objects against queried expressions, both on top-level and nested BSON keys.

4.1.2 The Identifier (_id)

MongoDB stores data in documents. Documents are made up of key-value pairs.


Although a document can be compared to a row in RDBMS, unlike a row, documents
have flexible schema. A key, which is nothing but a label, can be roughly compared
to the column name in RDBMS. A key is used for querying data from the documents.
Hence, like a RDBMS primary key (used to uniquely identify each row), you need to
have a key that uniquely identifies each document within a collection. This is referred
to as _id in MongoDB.

If you have not explicitly specified any value for a key, a unique value will be
automatically generated and assigned to it by MongoDB. This key value is immutable
and can be of any data type except arrays.

4.1.3 Capped Collection

MongoDB has a concept of capping the collection. This means it stores the
documents in the collection in the inserted order. As the collection reaches its limit,
the documents will be removed from the collection in FIFO (first in, first out) order.
This means that the least recently inserted documents will be removed first.

This is good for use cases where the order of insertion is required to be maintained
automatically, and deletion of records after a fixed size is required. One such use
cases is log files that get automatically truncated after a certain size.

4.2 Polymorphic Schemas

A polymorphic schema is a schema where a collection has documents of different


types or schemas. A good example of this schema is a collection named Users. Some
user documents might have an extra fax number or email address, while others
might have only phone numbers, yet all these documents coexist within the same
Users collection. This schema is generally referred to as a polymorphic schema.
4.2.1 Object-Oriented Programming

Object-oriented programming enables you to have classes share data and


behaviours using inheritance. It also lets you define functions in the parent class that
can be overridden in the child class and thus will function differently in a different
context. In other words, you can use the same function name to manipulate the child
as well as the parent class, although under the hood the implementations might be
different. This feature is referred to as polymorphism.

Suppose you have an application that lets the user upload and share different
content types such as HTML pages, documents, images, videos, etc. Although many
of the fields are common across all of the above-mentioned content types (such as
Name, ID, Author, Upload Date, and Time), not all fields are identical. For example, in
the case of images, you have a binary field that holds the image content, whereas an
HTML page has a large text field to hold the HTML content. In this scenario, the
MongoDB polymorphic schema can be used wherein all of the content node types
are stored in the same collection, such as LoadContent, and each document has
relevant fields only.

// "Document collections" - "HTML Page" document

id: 1,
title: "Hello",
type: "HTMLpage",
text: "<html>Hi..Welcome to my world</html>"
}
...
// Document collection also has a "Picture" document
{
id: 3,
title: "Family Photo",
type: "JPEG",
sizeInMB: 10,........
}

This schema not only enables you to store related data with different structures
together in a same collection, it also simplifies the querying. The same collection can
be used to perform queries on common fields such as fetching all content uploaded
on a particular date and time as well as queries on specific fields such as finding
images with a size greater than X MB.

Thus, object-oriented programming is one of the use cases where having a


polymorphic schema makes sense.

4.2.2 Schema Evolution

When you are working with databases, one of the most important considerations
that you need to account for is the schema evolution (i.e. the change in the schema’s
impact on the running application). The design should be done in a way as to have
minimal or no impact on the application, meaning no or minimal downtime, no or
very minimal code changes, etc.

Typically, schema evolution happens by executing a migration script that upgrades


the database schema from the old version to the new one. If the database is not in
production, the script can be simple drop and recreation of the database. However, if
the database is in a production environment and contains live data, the migration
script will be complex because the data will need to be preserved. The script should
take this into consideration. Although MongoDB offers an Update option that can be
used to update all the documents’ structure within a collection if there’s a new
addition of a field, imagine the impact of doing this if you have thousands of
documents in the collection. It would be very slow and would have a negative impact
on the underlying application’s performance. One of the ways of doing this is to
include the new structure to the new documents being added to the collection and
then gradually migrating the collection in the background while the application is still
running. This is one of the many use cases where having a polymorphic schema will
be advantageous.
Chapter 6 Using MongoDB Shell

“mongo shell comes with the standard distribution of MongoDB. It offers a JavaScript
environment with complete access to the language and the standard functions. It
provides a full interface for the MongoDB database.”

https://www.youtube.com/watch?v=EyC_Bi9kAtM

6.1 Basic query

6.1.1 Create and Insert

Now we know how the databases and collections are created. As explained earlier,
the documents in MongoDB are in the JSON format.

First, by issuing the db command you will confirm that the context is the mydbpoc
database.

> db

mydbpoc

>

Creating documents.
The first document complies with the first prototype whereas the second document
complies with the

second prototype. You have created two documents named user1 and user2 .

> user1 = {FName: "Test", LName: "User", Age:30, Gender: "M", Country: "US"}
{
"FName" : "Test",
"LName" : "User",
"Age" : 30,
"Gender" : "M",
"Country" : "US"
}
> user2 = {Name: "Test User", Age:45, Gender: "F", Country: "US"}
{ "Name" : "Test User", "Age" : 45, "Gender" : "F", "Country" : "US" }
>
You will next add both these documents (user1 and user2) to the users collection in
the following order

of operations:

> db.users.insert(user1)
> db.users.insert(user2)
The above operation will not only insert the two documents to the user’s collection
but it will also create the collection as well as the database. The same can be verified
using the show collections and show dbs commands.

dbs will display the list of databases.

> show dbs


admin 0.078GB
local 0.078GB
mydb 0.078GB
mydbproc 0.078GB

And show collections will display the list of collection in the current database.

> show collections


system.indexes
users
>
Executing the command db.users.find() will display the documents in the users
collection.

> db.users.find()
{ "_id" : ObjectId("5450c048199484c9a4d26b0a"), "FName" : "Test", "LName" : "User",
"Age" : 30, "Gender": "M", "Country" : "US" }
{ "_id" : ObjectId("5450c05d199484c9a4d26b0b"), "Name" : "Test", User", "Age" : 45,
"Gender" : "F", "Country" : "US"
6.1.2 Explicitly Creating Collections

user can also explicitly create a collection before executing the insert statement.

db.createCollection("users")

6.1.3 Inserting Documents Using Loop

Documents can also be added to the collection using a for loop. The following code
inserts users using for.

> for(var i=1; i<=20; i++) db.users.insert({"Name" : "Test User" + i, "Age": 10+i,
"Gender" : "F", "Country" : "India"})
>
In order to verify that the insert is successful, run the find command on the collection.
> db.users.find()
{ "_id" : ObjectId("52f48cf474f8fdcfcae84f79"), "FName" : "Test", "LName" : "User",
"Age" : 30, "Gender" : "M", "Country" : "US" }
{ "_id" : ObjectId("52f48cfb74f8fdcfcae84f7a"), "Name" : "Test User", "Age" : 45
, "Gender" : "F", "Country" : "US" }
................
{ "_id" : ObjectId("52f48eeb74f8fdcfcae84f8c"), "Name" : "Test User18", "Age" :
28, "Gender" : "F", "Country" : "India" }
Type "it" for more

In your case, if you type “it” and press Enter, the following will appear:

> it
{ "_id" : ObjectId("52f48eeb74f8fdcfcae84f8d"), "Name" : "Test User19", "Age" :
29, "Gender" : "F", "Country" : "India" }
{ "_id" : ObjectId("52f48eeb74f8fdcfcae84f8e"), "Name" : "Test User20", "Age" :
30, "Gender" : "F", "Country" : "India" }
>
Since only two documents were remaining, it displays the remaining two documents
6.1.4 Inserting by Explicitly Specifying _id

The id field was not specified, so it was implicitly added. In the following example,
you will see how to explicitly specify the id field when inserting the documents within
a collection. While explicitly specifying the id field, you have to keep in mind the
uniqueness of the field; otherwise the insert will fail.

The following command explicitly specifies the id field:

> db.users.insert({"_id":10, "Name": "explicit id"})


The insert operation creates the following document in the users collection:
{ "_id" : 10, "Name" : "explicit id" }
This can be confirmed by issuing the following command:
>db.users.find()

6.1.5 Update

The update() command, which is used to update the documents in a collection. The
update() method updates a single document by default. If you need to update all
documents that match the selection criteria, you can do so by setting the multi
option as true.

The $set operator will be used for updating the records.

The following command updates the country to UK for


> db.users.update({"Gender":"F"}, {$set:{"Country":"UK"}})
To check whether the update has happened, issue a find command to check all the female
users.
> db.users.find({"Gender":"F"})
{ "_id" : ObjectId("52f48cfb74f8fdcfcae84f7a"), "Name" : "Test User", "Age" : 45
, "Gender" : "F", "Country" : "UK" }
{ "_id" : ObjectId("52f48eeb74f8fdcfcae84f7b"), "Name" : "Test User1", "Age" : 11,
"Gender" : "F", "Country" : "India" }
{ "_id" : ObjectId("52f48eeb74f8fdcfcae84f7c"), "Name" : "Test User2", "Age" : 12,
"Gender" : "F", "Country" : "India" }
...................
Type "it" for more
>
The update command and include the multi option:
>db.users.update({"Gender":"F"},{$set:{"Country":"UK"}},{multi:true})
>
Adding the field

When working in a real-world application, you may come across a schema evolution
where you might end up adding or removing fields from the documents. Let’s see
how to perform these alterations in the MongoDB database.

The update() operations can be used at the document level, which helps in updating
either a single document or set of documents within a collection.

Next, let’s look at how to add new fields to the documents. In order to add fields to
the document, use

The update() command with the $set operator and the multi option.

If you use a field name with $set , which is non-existent, then the field will be added
to the documents.
The following command will add the field company to all the documents:

> db.users.update({},{$set:{"Company":"TestComp"}},{multi:true})
>
Issuing find command against the user’s collection, you will find the new field added
to all documents.

> db.users.find()
{ "Age" : 30, "Company" : "TestComp", "Country" : "US", "FName" : "Test", "Gender" : "M",
"LName" : "User", "_id" : ObjectId("52f48cf474f8fdcfcae84f79") }
{ "Age" : 45, "Company" : "TestComp", "Country" : "UK", "Gender" : "F", "Name" : "Test
User", "_id" : ObjectId("52f48cfb74f8fdcfcae84f7a") }
{ "Age" : 11, "Company" : "TestComp", "Country" : "UK", "Gender" : "F", ....................
Type "it" for more
>

Removing the field

The following command will remove the field Company from all the documents:

> db.users.update({},{$unset:{"Company":""}},{multi:true})

>

6.1.6 Delete

To delete documents in a collection, use the remove () method. If you specify a


selection criterion, only the documents meeting the criteria will be deleted. If no
criteria is specified, all of the documents will be deleted.

The following command will delete the documents where Gender = ‘M :

> db.users.remove({"Gender":"M"})
>
The same can be verified by issuing the find() command on Users :
> db.users.find({"Gender":"M"})
>
No documents are returned.
The following command will delete all documents:
> db.users.remove({})
> db.users.find()
Dropping the collection

Finally, if you want to drop the collection, the following command will drop the
collection:

> db.users.drop()

true
>

In order to validate whether the collection is dropped or not, issue the show
collections command.

> show collections

system.indexes

>

6.1.2 Read

In this part of the chapter, you will look at various examples illustrating the querying
functionality available as part of MongoDB that enables you to read the stored data
from the database. In order to start with basic querying, first create the users
collection and insert data following the insert command.

> user1 = {FName: "Test", LName: "User", Age:30, Gender: "M", Country: "US"}
{
"FName" : "Test",
"LName" : "User",
"Age" : 30,
"Gender" : "M",
"Country" : "US"
}
user2 = {Name: "Test User", Age:45, Gender: "F", Country: "US"}
{ "Name" : "Test User", "Age" : 45, "Gender" : "F", "Country" : "US" }
> db.users.insert(user1)
> db.users.insert(user2)
> for(var i=1; i<=20; i++) db.users.insert({"Name" : "Test User" + i, "Age": 10+i,
"Gender" : "F", "Country" : "India"})

6.1.2.1 Query Documents

A rich query system is provided by MongoDB. Query documents can be


passed as a parameter to the find() method to filter documents within a
collection.A query document is specified within open “{” and closed “}” curly
braces. A query document is matched against all of the documents in the
collection before returning the result set.Using the find() command without
any query document or an empty query document such as find({}) returns all
the documents within the collection. A query document can contain selectors
and projectors.

A selector is like a where condition in SQL or a filter that is used to filter out
the results.A projector is like the select condition or the selection list that is
used to display the data fields.
6.1.2.2 Selector

We will now see how to use the selector. The following command will return
all the female users:

> db.users.find({"Gender":"F"})
{ "_id" : ObjectId("52f4a826958073ea07e15071"), "Name" : "Test User", "Age" : 45,
"Gender" : "F", "Country" : "US" }
.............
{ "_id" : ObjectId("52f4a83f958073ea07e15084"), "Name" : "Test User19", "Age" :29,
"Gender" : "F", "Country" : "India" }
Type "it" for more
>

6.1.2.3. Projector

We have seen how to use selectors to filter out documents within the
collection. In the above example, the find() command returns all fields of the
documents matching the selector.
Let’s add a projector to the query document where, in addition to the selector,
you will also mention specific details or fields that need to be displayed.
Suppose you want to display the first name and age of all female employees.
In this case, along with the selector, a projector is also used.
Execute the following command to return the desired result set:
> db.users.find({"Gender":"F"}, {"Name":1,"Age":1})
{ "_id" : ObjectId("52f4a826958073ea07e15071"), "Name" : "Test User", "Age" :
45 }
..........
Type "it" for more
>

6.1.2.4 sort( )

In MongoDB, the sort order is specified as follows: 1 for ascending and -1 for
descending sort.If in the above example you want to sort the records by
ascending order of age , you execute the following command:

>db.users.find({"Gender":"F"}, {"Name":1,"Age":1}).sort({"Age":1})
{ "_id" : ObjectId("52f4a83f958073ea07e15072"), "Name" : "Test User1", "Age" : 11 }
{ "_id" : ObjectId("52f4a83f958073ea07e15073"), "Name" : "Test User2", "Age" : 12 }
{ "_id" : ObjectId("52f4a83f958073ea07e15074"), "Name" : "Test User3", "Age" : 13 }
..............
{ "_id" : ObjectId("52f4a83f958073ea07e15085"), "Name" : "Test User20", "Age" :30 }
Type "it" for more
If you want to display the records in descending order by name and ascending
order by age , you execute the following command:

>db.users.find({"Gender":"F"},{"Name":1,"Age":1}).sort({"Name":-1,"Age":1})

{ "_id" : ObjectId("52f4a83f958073ea07e1507a"), "Name" : "Test User9", "Age" :


19 }

............

{ "_id" : ObjectId("52f4a83f958073ea07e15072"), "Name" : "Test User1", "Age" :


11 }

6.1.2.5 limit( )

You will now look at how you can limit the records in your result set. For example, in
huge collections with thousands of documents, if you want to return only five
matching documents, the limit command is used,which enables you to do exactly
that.Returning to your previous query of female users who live in either India or US,
say you want to limit the result set and return only two users. The following
command needs to be executed:

>db.users.find({"Gender":"F",$or:[{"Country":"India"},{"Country":"US"}]}).limit(2)

{ "_id" : ObjectId("52f4a826958073ea07e15071"), "Name" : "Test User", "Age" : 45,

"Gender" : "F", "Country" : "US" }

{ "_id" : ObjectId("52f4a83f958073ea07e15072"), "Name" : "Test User1", "Age" : 11,

"Gender" : "F", "Country" : "India" }

6.1.2.6 skip( )

If the requirement is to skip the first two records and return the third and fourth user,
the skip command is used. The following command needs to be executed:

>db.users.find({"Gender":"F",$or:[{"Country":"India"}, {"Country":"US"}]}).limit(2).skip(2)

{ "_id" : ObjectId("52f4a83f958073ea07e15073"), "Name" : "Test User2", "Age" : 12,

"Gender" : "F", "Country" : "India" }

{ "_id" : ObjectId("52f4a83f958073ea07e15074"), "Name" : "Test User3", "Age" : 13,

"Gender" : "F", "Country" : "India" }

>
6.1.2.7 findOne( )

Similar to find() is the findOne() command. The findOne() method can take the
same parameters as find() , but rather then returning a cursor, it returns a
single document. Say you want to return one female user who stays in either
India or US. This can be achieved using the following command:

> db.users.findOne({"Gender":"F"}, {"Name":1,"Age":1})

"_id" : ObjectId("52f4a826958073ea07e15071"),

"Name" : "Test User",

"Age" : 45

>

Similarly, if you want to return the first record irrespective of any selector in that case,
you can use

findOne() and it will return the first document in the collection.

> db.users.findOne()

"_id" : ObjectId("52f4a823958073ea07e15070"),

"FName" : "Test",

"LName" : "User",

"Age" : 30,

"Gender" : "M",

"Country" : "US"}

6.1.2.8 Using Cursor

When the find() method is used, MongoDB returns the results of the query as a
cursor object. In order to display the result, the mongo shell iterates over the
returned cursor.MongoDB enables the users to work with the Cursor object of
the find method. In the next example,you will see how to store the cursor
object in a variable and manipulate it using a while loop. Say you want to
return all the users in the US. In order to do so, you created a variable, assigned
the output of find() to the variable, which is a cursor, and then using the while
loop you iterate and print the output.

The code snippet is as follows:

> var c = db.users.find({"Country":"US"})

> while(c.hasNext()) printjson(c.next())

"_id" : ObjectId("52f4a823958073ea07e15070"),

"FName" : "Test",

"LName" : "User",

"Age" : 30,

"Gender" : "M",

"Country" : "US"

6.1.2.9 explain( )

The explain() function can be used to see what steps the MongoDB database
is running while executing a query. Starting from version 3.0, the output
format of the function and the parameter that is passed to the function have
changed. It takes an optional parameter called verbose , which determines
what the explain output should look like. The following are the verbosity
modes: allPlansExecution , executionStats , and queryPlanner . The default
verbosity mode is queryPlanner , which means if nothing is specified, it
defaults to queryPlanner .

The following code covers the steps executed when filtering on the username
field:

> db.users.find({"Name":"Test User"}).explain("allPlansExecution")

"queryPlanner" : {

"plannerVersion" : 1,

"namespace" : "mydbproc.users",

"indexFilterSet" : false,

"parsedQuery" : {
"$and" : [ ]

},

"winningPlan" : {

"stage" : "COLLSCAN",

"filter" : {

"$and" : [ ]

},

"direction" : "forward"

},

"rejectedPlans" : [ ]

"executionStats" : {

"executionSuccess" : true,

"nReturned" : 20,

"executionTimeMillis" : 0,

"totalKeysExamined" : 0,

"totalDocsExamined" : 20,

"executionStages" : {

"stage" : "COLLSCAN",

"filter" : {

"$and" : [ ]

},

"nReturned" : 20,

"executionTimeMillisEstimate" : 0,

"works" : 22,

"advanced" : 20,

"needTime" : 1,
"needFetch" : 0,

"saveState" : 0,

"restoreState" : 0,

"isEOF" : 1,

"invalidates" : 0,

"direction" : "forward",

"docsExamined" : 20

},

"allPlansExecution" : [ ]

},

"serverInfo" : {

"host" : " ANOC9",

"port" : 27017,

"version" : "3.0.4",

"gitVersion" : "534b5a3f9d10f00cd27737fbcd951032248b5952"

},

"ok" : 1

6.1.3 Using Indexes

Indexes are used to provide high performance read operations for queries that are
used frequently. Bydefault, whenever a collection is created and documents are
added to it, an index is created on the id field of the document. In this section, you
will look at how different types of indexes can be created. Let’s begin by inserting 1
million documents using for loop in a new collection called testindx.

>for(i=0;i<1000000;i++){db.testindx.insert({"Name":"user"+i,"Age":Math.floor(Math.

random()*120)})}

Next, issue the find() command to fetch a Name with value of user101 . Run the
explain() command

to check what steps MongoDB is executing in order to return the result set.
> db.testindx.find({"Name":"user101"}).explain("allPlansExecution")

"queryPlanner" : {

"plannerVersion" : 1,

"namespace" : "mydbproc.testindx",

"indexFilterSet" : false,

"parsedQuery" : {

"Name" : {

"$eq" : "user101"

},

"winningPlan" : {

"stage" : "COLLSCAN",

"filter" : {

"Name" : {

"$eq" : "user101"

},

"direction" : "forward"

},

"rejectedPlans" : [ ]

},

"executionStats" : {

"executionSuccess" : true,

"nReturned" : 1,

"executionTimeMillis" : 645,

"totalKeysExamined" : 0,
"totalDocsExamined" : 1000000,

"executionStages" : {

"stage" : "COLLSCAN",

"filter" : {

"Name" : {

"$eq" : "user101"

},

"nReturned" : 1,

"executionTimeMillisEstimate" : 20,

"works" : 1000002,

"advanced" : 1,

"needTime" : 1000000,

"needFetch" : 0,

"saveState" : 7812,

"restoreState" : 7812,

"isEOF" : 1,

"invalidates" : 0,

"direction" : "forward",

"docsExamined" : 1000000

},

"allPlansExecution" : [ ]

},

"serverInfo" : {

"host" : " ANOC9",

"port" : 27017,

"version" : "3.0.4",
"gitVersion" : "534b5a3f9d10f00cd27737fbcd951032248b5952"

},

"ok" : 1

6.1.3.1 Single Key Index


Let’s create an index on the Name field of the document. Use ensureIndex() to
create the index.
> db.testindx.ensureIndex({"Name":1})
The index creation will take few minutes depending on the server and the
collection size.Let’s run the same query that you ran earlier with explain() to
check what the steps the database is executing post index creation. Check the
n , nscanned , and millis fields in the output.
>db.testindx.find({"Name":"user101"}).explain("allPathsExecution")
{
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "mydbproc.testindx",
"indexFilterSet" : false,
"parsedQuery" : {
"Name" : {
"$eq" : "user101"
}
},
"winningPlan" : {
"stage" : "FETCH",
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"Name" : 1
},
"indexName" : "Name_1",
"isMultiKey" : false,
"direction" : "forward",

"indexBounds" : {
"Name" : [
"[\"user101\", \"user101\"]"
]
}
}
},
"rejectedPlans" : [ ]
},
"executionStats" : {
"executionSuccess" : true,
"nReturned" : 1,
"executionTimeMillis" : 0,
"totalKeysExamined" : 1,
"totalDocsExamined" : 1,
"executionStages" : {
"stage" : "FETCH",
"nReturned" : 1,
"executionTimeMillisEstimate" : 0,
"works" : 2,
"advanced" : 1,
"needTime" : 0,
"needFetch" : 0,
"saveState" : 0,
"restoreState" : 0,
"isEOF" : 1,
"invalidates" : 0,
"docsExamined" : 1,
"alreadyHasObj" : 0,
"inputStage" : {
"stage" : "IXSCAN",
"nReturned" : 1,
"executionTimeMillisEstimate" : 0,
"works" : 2,
"advanced" : 1,
"needTime" : 0,
"needFetch" : 0,
"saveState" : 0,
"restoreState" : 0,
"isEOF" : 1,
"invalidates" : 0,
"keyPattern" : {
"Name" : 1
},
"indexName" : "Name_1",
"isMultiKey" : false,
"direction" : "forward",
"indexBounds" : {
"Name" : [
"[\"user101\", \"user101\"]"
]
},

"keysExamined" : 1,
"dupsTested" : 0,
"dupsDropped" : 0,
"seenInvalidated" : 0,
"matchTested" : 0
}
},
"allPlansExecution" : [ ]
},
"serverInfo" : {
"host" : "ANOC9",
"port" : 27017,
"version" : "3.0.4",
"gitVersion" : "534b5a3f9d10f00cd27737fbcd951032248b5952"
},
"ok" : 1
}
>

6.1.3.2 Compound Index

When creating an index, you should keep in mind that the index covers most of your
queries. If you sometimes query only the Name field and at times you query both the
Name and the Age field, creating a compound index on the Name and Age fields will
be more beneficial than an index that is created on either of the fields because the
compound index will cover both queries.

The following command creates a compound index on fields Name and Age of the
collection testindx .

> db.testindx.ensureIndex({"Name":1, "Age": 1})

Compound indexes help MongoDB execute queries with multiple clauses more
efficiently. When creating a compound index, it is also very important to keep in
mind that the fields that will be used for exact matches (e.g. Name : "S1" ) come first,
followed by fields that are used in ranges (e.g. Age : {"$gt":20} ).

Hence the above index will be beneficial for the following query:

>db.testindx.find({"Name": "user5","Age":{"$gt":25}}).explain("allPlansExecution")

"queryPlanner" : {

"plannerVersion" : 1,

"namespace" : "mydbproc.testindx",

"indexFilterSet" : false,

"parsedQuery" : {

"$and" : [

"Name" : {

"$eq" : "user5"

},

"Age" : {

"$gt" : 25

},

"winningPlan" : {

"stage" : "KEEP_MUTATIONS",

"inputStage" : {

"stage" : "FETCH",

"filter" : {
"Age" : {

"$gt" : 25

},

............................

"indexBounds" : {

"Name" : [

"[\"user5\", \"user5\"

},

"rejectedPlans" : [

"stage" : "FETCH",

......................................................

"indexName" : "Name_1_Age_1",

"isMultiKey" : false,

"direction" : "forward",

.....................................................

"executionStats" : {

"executionSuccess" : true,

"nReturned" : 1,

"executionTimeMillis" : 0,

"totalKeysExamined" : 1,

"totalDocsExamined" : 1,

.....................................................

"inputStage" : {

"stage" : "FETCH",

"filter" : {
"Age" : {

"$gt" : 25

},

"nReturned" : 1,

"executionTimeMillisEstimate" : 0,

"works" : 2,

"advanced" : 1,

"allPlansExecution" : [

"nReturned" : 1,

"executionTimeMillisEstimate" : 0,

"totalKeysExamined" : 1,

"totalDocsExamined" : 1,

"executionStages" : {

.............................................................

"serverInfo" : {

"host" : " ANOC9",

"port" : 27017,

"version" : "3.0.4",

"gitVersion" : "534b5a3f9d10f00cd27737fbcd951032248b5952"

},

"ok" : 1

>

6.1.3.3 Support for sort Operations

In MongoDB, a sort operation that uses an indexed field to sort documents provides
the greatest performance.As in other databases, indexes in MongoDB have an order
due to this. If an index is used to access documents, it returns results in the same
order as the index.A compound index needs to be created when sorting on multiple
fields. In a compound index, the output can be in the sorted order of either an index
prefix or the full index. The index prefix is a subset of the compound index, which
contains one or more fields from the start of

the index.For example, the following are the index prefix of the compound index: {
x:1, y: 1, z: 1} .

The sort operation can be on any of the combinations of index prefix like {x: 1}, {x: 1,
y: 1} .A compound index can only help with sorting if it is a prefix of the sort.

For example, a compound index on Age , Name , and Class, like

> db.testindx.ensureIndex({"Age": 1, "Name": 1, "Class": 1})

will be useful for the following queries:

> db.testindx.find().sort({"Age":1})

> db.testindx.find().sort({"Age":1,"Name":1})

> db.testindx.find().sort({"Age":1,"Name":1, "Class":1})

The above index won’t be of much help in the following query:

> db.testindx.find().sort({"Gender":1, "Age":1, "Name": 1})

You can diagnose how MongoDB is processing a query by using the explain()
command.

6.1.3.4 Unique Index

Creating index on a field doesn’t ensure uniqueness, so if an index is created


on the Name field, then two or more documents can have the same names.
However, if uniqueness is one of the constraints that needs to be enabled, the
unique property needs to be set to true when creating the index.

First, let’s drop the existing indexes.

>db.testindx.dropIndexes()

The following command will create a unique index on the Name field of the
testindx collection:

> db.testindx.ensureIndex({"Name":1},{"unique":true})
Now if you try to insert duplicate names in the collection as shown below,
MongoDB returns an error

and does not allow insertion of duplicate records:

> db.testindx.insert({"Name":"uniquename"})

> db.testindx.insert({"Name":"uniquename"})

"E11000 duplicate key error index: mydbpoc.testindx.$Name_1 dup key: { :


"uniquename" }"

If you check the collection, you’ll see that only the first uniquename was stored.

> db.testindx.find({"Name":"uniquename"})

{ "_id" : ObjectId("52f4b3c3958073ea07f092ca"), "Name" : "uniquename" }

>

Uniqueness can be enabled for compound indexes also, which means that
although individual fields

can have duplicate values, the combination will always be unique.

For example, if you have a unique index on {"name":1, "age":1} ,

> db.testindx.ensureIndex({"Name":1, "Age":1},{"unique":true})

>

then the following inserts will be permissible:

> db.testindx.insert({"Name":"usercit"})

> db.testindx.insert({"Name":"usercit", "Age":30})

However, if you execute

> db.testindx.insert({"Name":"usercit", "Age":30})

it’ll throw an error like

E11000 duplicate key error index: mydbpoc.testindx.$Name_1_Age_1

dup key: { : "usercit", : 30.0 }


6.1.3.5 system.indexes

Whenever you create a database, by default a system.indexes collection is


created. All of the information about a database’s indexes is stored in the
system.indexes collection. This is a reserved collection, so you cannot modify
its documents or remove documents from it. You can manipulate it only
through ensureIndex and the dropIndexes database commands. Whenever an
index is created, its meta information can be seen in system.indexes . The
following command can be used to fetch all the index information about the
mentioned collection:

db.collectionName.getIndexes()

For example, the following command will return all indexes created on the
testindx collection:

> db.testindx.getIndexes()

7 dropIndex

The dropIndex command is used to remove the index.

The following command will remove the Name field index from the testindx
collection:

> db.testindx.dropIndex({"Name":1})

{ "nIndexesWas" : 3, "ok" : 1 }

>

6.1.3.7 ReIndex

When you have performed a number of insertions and deletions on the


collection, you may have to rebuild the indexes so that the index can be used
optimally. The reIndex command is used to rebuild the indexes.The following
command rebuilds all the indexes of a collection. It will first drop the indexes,
including the default index on the id field, and then it will rebuild the indexes.
db.collectionname.reIndex()

The following command rebuilds the indexes of the testindx collection:

> db.testindx.reIndex()

"nIndexesWas" : 2,

"msg" : "indexes dropped for collection",


"nIndexes" : 2,

..............

"ok" : 1

>

6.1.3.8 How Indexing Works

MongoDB stores indexes in a BTree structure, so range queries are


automatically supported.If multiple selection criteria are used in a query,
MongoDB tries to find the best single index to select a candidate set. After
that, it sequentially iterates through the set to evaluate the other criteria.

When the query is executed for the first time, MongoDB creates multiple
execution plans for each index that is available for the query. It lets the plans
execute within certain number of ticks in turns, until the plan that executes the
fastest finishes. The result is then returned to the system, which remembers the
index that was used by the fastest execution plan.

For subsequent queries, the remembered index will be used until some certain
number of updates has happened within the collection. After the updating limit
is crossed, the system will again follow the process to find out the best index
that is applicable at that time. The reevaluation of the query plans will happen
when either of the following events has occurred:

• The collection receives 1,000 write operations.

• An index is added or dropped.

• A restart of the mongod process happens.

• A reindexing for rebuilding the index happens.

6.2 Stepping Beyond the Basics

This will cover advanced querying using conditional operators and regular
expressions in the selector part. Each of these operators and regular
expressions provides you with more control over the queries you write and
consequently over the information you can fetch from the MongoDB database.

6.2.1 Using Conditional Operators


Conditional operators enable you to have more control over the data you are
trying to extract from the database.
In this section, you will be focusing on the following operators: $lt , $lte , $gt ,
$gte , $in , $nin , and $not .
The following example assumes a collection named Students that contains the
following types of documents:
{
id: ObjectID(),
Name: "Full Name",
Age: 30,
Gender: "M",
Class: "C1",
Score: 95
}

You will first create the collection and insert few sample documents.

>db.students.insert({Name:"S1",Age:25,Gender:"M",Class:"C1",Score:95})
>db.students.insert({Name:"S2",Age:18,Gender:"M",Class:"C1",Score:85})
>db.students.insert({Name:"S3",Age:18,Gender:"F",Class:"C1",Score:85})
>db.students.insert({Name:"S4",Age:18,Gender:"F",Class:"C1",Score:75})
>db.students.insert({Name:"S5",Age:18,Gender:"F",Class:"C2",Score:75})
>db.students.insert({Name:"S6",Age:21,Gender:"M",Class:"C2",Score:100})
>db.students.insert({Name:"S7",Age:21,Gender:"M",Class:"C2",Score:100})
>db.students.insert({Name:"S8",Age:25,Gender:"F",Class:"C2",Score:100})
>db.students.insert({Name:"S9",Age:25,Gender:"F",Class:"C2",Score:90})
>db.students.insert({Name:"S10",Age:28,Gender:"F",Class:"C3",Score:90})
> db.students.find()
{ "_id" : ObjectId("52f874faa13cd6a65998734d"), "Name" : "S1", "Age" : 25,
"Gender" : "M",
"Class" : "C1", "Score" : 95 }
.......................
{ "_id" : ObjectId("52f8758da13cd6a659987356"), "Name" : "S10", "Age" : 28,
"Gender" : "F",
"Class" : "C3", "Score" : 90 }

6.2.1.1 $lt and $lte


Let’s start with the $lt and $lte operator. They stand for “less than” and “less
than or equal to ” respectively. If you want to find all students who are younger
than 25 (Age < 25), you can execute the following find
with a selector:
> db.students.find({"Age":{"$lt":25}})
{ "_id" : ObjectId("52f8750ca13cd6a65998734e"), "Name" : "S2", "Age" : 18,
"Gender" : "M",
"Class" : "C1", "Score" : 85 }
.............................
{ "_id" : ObjectId("52f87556a13cd6a659987353"), "Name" : "S7", "Age" : 21,
"Gender" : "M",
"Class" : "C2", "Score" : 100 }
>

If you want to find out all students who are older than 25 (Age <= 25), execute
the following:
> db.students.find({"Age":{"$lte":25}})
{ "_id" : ObjectId("52f874faa13cd6a65998734d"), "Name" : "S1", "Age" : 25,
"Gender" : "M",
"Class" : "C1", "Score" : 95 }
....................
{ "_id" : ObjectId("52f87578a13cd6a659987355"), "Name" : "S9", "Age" : 25,
"Gender" : "F",
"Class" : "C2", "Score" : 90 }
>

6.2.1.2 $gt and $gte


T he $ gt and $gte operators stand for “g reater than” and “greater than or
equal to,” respectively.
Let’s find out all of the students with Age > 25. This can be achieved by
executing the following command:
> db.students.find({"Age":{"$gt":25}})
{ "_id" : ObjectId("52f8758da13cd6a659987356"), "Name" : "S10", "Age" : 28,
"Gender" : "F",
"Class" : "C3", "Score" : 90 }
>
If you change the above example to return students with Age >= 25 , then the
command is
> db.students.find({"Age":{"$gte":25}})
{ "_id" : ObjectId("52f874faa13cd6a65998734d"), "Name" : "S1", "Age" : 25,
"Gender" : "M",
"Class" : "C1", "Score" : 95 }
......................................
{ "_id" : ObjectId("52f8758da13cd6a659987356"), "Name" : "S10", "Age" : 28,
"Gender" : "F",
"Class" : "C3", "Score" : 90 }
>
6.2.1.3 $in and $nin
Let’s find all students who belong to either class C1 or C2 . The command for
the same is
> db.students.find({"Class":{"$in":["C1","C2"]}})
{ "_id" : ObjectId("52f874faa13cd6a65998734d"), "Name" : "S1", "Age" : 25,
"Gender" : "M",
"Class" : "C1", "Score" : 95 }
................................
{ "_id" : ObjectId("52f87578a13cd6a659987355"), "Name" : "S9", "Age" : 25,
"Gender" : "F",
"Class" : "C2", "Score" : 90 }
>
The inverse of this can be returned by using $nin .
Let’s next find students who don’t belong to class C1 or C2 . The command is
> db.students.find({"Class":{"$nin":["C1","C2"]}})
{ "_id" : ObjectId("52f8758da13cd6a659987356"), "Name" : "S10", "Age" : 28,
"Gender" : "F",
"Class" : "C3", "Score" : 90 }
>

6.2.2 Regular Expressions


In this section, you will look at how to use regular expressions. Regular
expressions are useful in scenarios where you want to find students with name
starting with “A” .In order to understand this, let’s add three or four more students
with different names.
> db.students.insert({Name:"Student1", Age:30, Gender:"M", Class: "Biology",
Score:90})
> db.students.insert({Name:"Student2", Age:30, Gender:"M", Class: "Chemistry",
Score:90})
> db.students.insert({Name:"Test1", Age:30, Gender:"M", Class: "Chemistry",
Score:90})
> db.students.insert({Name:"Test2", Age:30, Gender:"M", Class: "Chemistry",
Score:90})
> db.students.insert({Name:"Test3", Age:30, Gender:"M", Class: "Chemistry",
Score:90})
>
Say you want to find all students with names starting with “St” or “Te” and whose
class begins with “Che”.The same can be filtered using regular expressions, like so:
> db.students.find({"Name":/(St|Te)*/i, "Class":/(Che)/i})
{ "_id" : ObjectId("52f89ecae451bb7a56e59086"), "Name" : "Student2", "Age" : 30,
"Gender" : "M", "Class" : "Chemistry", "Score" : 90 }
.........................
{ "_id" : ObjectId("52f89f06e451bb7a56e59089"), "Name" : "Test3", "Age" : 30,
"Gender" : "M", "Class" : "Chemistry", "Score" : 90 }
>
In order to understand how the regular expression works, let’s take the query
"Name":/(St|Te)*/i .
//i indicates that the regex is case insensitive.
(St|Te)* means the Name string must start with either “St” or “Te”.
The * at the end means it will match anything after that.

6.2.3 MapReduce
The MapReduce framework enables division of the task, which in this case is data
aggregation across a cluster of computers in order to reduce the time it takes to
aggregate the data set. It consists of two parts: Map and Reduce. Here’s a more
specific description: MapReduce is a framework that is used to process problems that
are highly distributable across enormous datasets and are run using multiple nodes.
If all the nodes have the same hardware, these nodes are collectively referred as a
cluster; otherwise, it’s referred as a grid. This processing can occur on structured data
(data stored in a database) and unstructured data (data stored in a file system).
• “Map”: In this step, the node that is acting as the master takes the input parameter
and divides the big problem into multiple small sub-problems. These sub-problems
are then distributed across the worker nodes. The worker nodes might further divide
the problem into sub-problems. This leads to a multi-level tree structure. The worker
nodes will then work on the sub-problems within them and return the answer back
to the master node.
• “Reduce”: In this step, all the sub-problems’ answers are available with the master
node, which then combines all the answers and produce the final output, which is
the answer to the big problem you were trying to solve.
In order to understand how it works, let’s consider a small example where you will
find out the number of male and female students in your collection.
This involves the following steps: first you create the map and reduce functions and
then you call the mapReduce function and pass the necessary arguments.
Let’s start by defining the map function:
> var map = function(){emit(this.Gender,1);};
>
This step takes as input the document and based on the Gender field it emits
documents of the type
{"F", 1} or {"M", 1} .
Next, you create the reduce function:
> var reduce = function(key, value){return Array.sum(value);};
This will group the documents emitted by the map function on the key field, which in
your example is
Gender , and will return the sum of values, which in the above example is emitted as
“1”. The output of the
reduce function defined above is a gender-wise count .
Finally, you put them together using the mapReduce function, like so:
> db.students.mapReduce(map, reduce, {out: "mapreducecount1"})
{
"result" : "mapreducecount1",
"timeMillis" : 29,
"counts" : {
"input" : 15,
"emit" : 15,
"reduce" : 2,
"output" : 2
},
"ok" : 1,

This actually is applying the map , reduce function, which you defined on the
students collection.
The final result is stored in a new collection called mapreducecount1 .
In order to vet it, run the find() command on the mapreducecount1 collection, as
shown:
> db.mapreducecount1.find()
{ "_id" : "F", "value" : 6 }
{ "_id" : "M", "value" : 9 }
>
Here’s one more example to explain the workings of MapReduce . Let’s use
MapReduce to find out a
class-wise average score . As you saw in the above example, you need to first create
the map function and then
the reduce function and finally you combine them to store the output in a collection
in your database. The
code snippet is
> var map_1 = function(){emit(this.Class,this.Score);};
> var reduce_1 = function(key, value){return Array.avg(value);};
>db.students.mapReduce(map_1,reduce_1, {out:"MR_ClassAvg_1"})
{
"result" : "MR_ClassAvg_1",
"timeMillis" : 4,
"counts" : {
"input" : 15, "emit" : 15,
"reduce" : 3 , "output" : 5
},
"ok" : 1,
}
> db.MR_ClassAvg_1.find()
{ "_id" : "Biology", "value" : 90 }
{ "_id" : "C1", "value" : 85 }
{ "_id" : "C2", "value" : 93 }
{ "_id" : "C3", "value" : 90 }
{ "_id" : "Chemistry", "value" : 90 }
>
The first step is to define the map function, which loops through the collection
documents and returns
output as {"Class": Score}, for example {"C1":95} . The second step does a grouping
on the class and computes the average of the scores for that class. The third step
combines the results; it defines the collection to which the map , reduce function
needs to be applied and finally it defines where to store the output, which in this
case is a new collection called MR_ClassAvg_1 . In the last step, you use find in order
to check the resulting output.

6.2.4 aggregate()
The previous section introduced the MapReduce function. In this section, you will get
a glimpse of the aggregation framework of MongoDB. The aggregation framework
enables you find out the aggregate value without using the MapReduce
function. Performance-wise, the aggregation framework is faster than the
MapReduce function. You always need to keep in mind that MapReduce is meant for
batch approach and not for real-time analysis.
You will next depict the above two discussed outputs using the aggregate function.
First, the output was
to find the count of male and female students . This can be achieved by executing
the following command:
> db.students.aggregate({$group:{_id:"$Gender", totalStudent: {$sum: 1}}})
{ "_id" : "F", "totalStudent" : 6 }
{ "_id" : "M", "totalStudent" : 9 }
>
Similarly, in order to find out the class-wise average score, the following command
can be executed:
> db.students.aggregate({$group:{_id:"$Class", AvgScore: {$avg: "$Score"}}})
{ "_id" : "Biology", "AvgScore" : 90 }
{ "_id" : "C3", "AvgScore" : 90 }
{ "_id" : "Chemistry", "AvgScore" : 90 }
{ "_id" : "C2", "AvgScore" : 93 }
{ "_id" : "C1", "AvgScore" : 85 }

6.3 Designing an Application’s Data Model

The MongoDB database provides two options for designing a data model: the user
can either embed related objects within one another, or it can reference each other
using ID. In this section, you will explore these options. In order to understand these
options, you will design a blogging application and demonstrate the usage
of the two options.

6.3.1 Relational Data Modeling and Normalization


Before jumping into MongoDB’s approach, let’s take a little detour into how you
would model this in a relational database such as SQL. In relational databases, the
data modelling typically progresses by defining the tables and gradually removing
data redundancy to achieve a normal form.

6.3.1.1 What Is a Normal Form?


In relational databases , a normal form typically begins by creating tables as per the
application requirement and then gradually removing redundancy to achieve the
highest normal form, which is also termed the third normal form or 3NF.In order to
understand this better, let’s put the blogging application data in tabular form.

This data is actually in the first normal form. You will have lots of redundancy
because you can have multiple comments against the posts and multiple tags can be
associated with the post. The problem with redundancy, of course, is that it
introduces the possibility of inconsistency, where various copies of the same data
may have different values. To remove this redundancy, you need to further normalize
the data by splitting it into multiple tables. As part of this step, you must identify a
key column that uniquely identifies each row in the table so that you can create links
between the tables. The above scenarios when modelled using the 3NF normal forms
will look like the RDBMs diagram shown in Figure 6-3 .
6.3.1.2 The Problem with Normal Forms
As mentioned, the nice thing about normalization is that it allows for easy updating
without any redundancy(i.e. it helps keep the data consistent). Updating a user name
means updating the name in the Users table. However, a problem arises when you
try to get the data back out. For instance, to find all tags and comments associated
with posts by a specific user, the relational database programmer uses a JOIN. By
using a JOIN, the database returns all data as per the application screen design, but
the real problem is what operation the database performs to get that result set.
Generally, any RDBMS reads from a disk and does a seek, which takes well over 99%
of the time spent reading a row. When it comes to disk access, random seeks are the
enemy. The reason why this is so important in this context is because JOINs typically
require random seeks. The JOIN operation is one of the most expensive operations
within a relational database. Additionally, if you end up needing to scale your
database to multiple servers, you introduce the problem of generating a distributed
join, a complex and generally slow operation.
6.3.2 MongoDB Document Data Model Approach
In MongoDB, data is stored in documents. Fortunately for us as application
designers, this opens some new possibilities in schema design. Unfortunately for us,
it also complicates our schema design process. Now when faced with a schema
design problem there’s no longer a fixed path of normalized database design, as
there is with relational databases. In MongoDB, the schema design depends on the
problem you are trying to solve. If you must model the above using the MongoDB
document model, you might store the blog data in a document as follows:
{
"_id" : ObjectId("509d27069cc1ae293b36928d"),
"title" : "Sample title",
"body" : "Sample text.",
"tags" : [
"Tag1",
"Tag2",
"Tag3",
"Tag4"
],
"created_date" : ISODate("2015-07-06T12:41:39.110Z"),
"author" : "Author 1",
"category_id" : ObjectId("509d29709cc1ae293b369295"),
"comments" : [
{
"subject" : "Sample comment",
"body" : "Comment Body",
"author " : "author 2",
"created_date":ISODate("2015-07-06T13:34:23.929Z")
}
As you can see, you have embedded the comments and tags within a single
document only.
Alternatively, you could “normalize” the model a bit by referencing the comments
and tags by the id field:
// Authors document:
{
"_id": ObjectId("509d280e9cc1ae293b36928e "),
"name": "Author 1",}
// Tags document:
{
"_id": ObjectId("509d35349cc1ae293b369299"),
"TagName": "Tag1",.....}
// Comments document:
{
"_id": ObjectId("509d359a9cc1ae293b3692a0"),
"Author": ObjectId("508d27069cc1ae293b36928d"),
.......
"created_date" : ISODate("2015-07-06T13:34:59.336Z")
}
//Category Document
{
"_id": ObjectId("509d29709cc1ae293b369295"),
"Category": "Catgeory1"......
}
//Posts Document
{
"_id" : ObjectId("509d27069cc1ae293b36928d"),
"title" : "Sample title","body" : "Sample text.",
"tags" : [ ObjectId("509d35349cc1ae293b369299"),
ObjectId("509d35349cc1ae293b36929c")
],
"created_date" : ISODate("2015-07-06T13:41:39.110Z"),
"author_id" : ObjectId("509d280e9cc1ae293b36928e"),
"category_id" : ObjectId("509d29709cc1ae293b369295"),
"comments" : [
ObjectId("509d359a9cc1ae293b3692a0"),
]}

6.3.2.2 Embedding
Embedding can be useful when you want to fetch some set of data and display it on
the screen, such as a page that displays comments associated with the blog; in this
case the comments can be embedded in the Blogs document. The benefit of this
approach is that since MongoDB stores the documents contiguously on disk, all the
related data can be fetched in a single seek.
Apart from this, since JOINs are not supported and you used referencing in this case,
the application
might do something like the following to fetch the comments data associated with
the blog.
1. Fetch the associated comments _id from the blogs document.
2. Fetch the comments document based on the comments_id found in the first step.
If you take this approach, which is referencing, not only does the database have to
do multiple seeks to
find your data, but additional latency is introduced into the lookup since it now takes
two round trips to the database to retrieve your data.
If the application frequently accesses the comments data along with the blogs, then
almost certainly embedding the comments within the blog documents will have a
positive impact on the performance.
Another concern that weighs in favor of embedding is the desire for atomicity and
isolation in writing data. MongoDB is designed without multi-documents
transactions. In MongoDB, the atomicity of the operation is provided only at a single
document level so data that needs to be updated together atomically
needs to be placed together in a single document. When you update data in your
database, you must ensure that your update either succeeds or fails entirely, never
having a “partial success,” and that no other database reader ever sees an incomplete
write operation.

6.3.2.3 Referencing

We have seen that embedding is the approach that will provide the best
performance in many cases; it also provides data consistency guarantees. However, in
some cases, a more normalized model works better in MongoDB.
One reason for having multiple collections and adding references is the increased
flexibility it gives when querying the data. Let’s understand this with the blogging
example mentioned above. You saw how to use embedded schema, which will work
very well when displaying all the data together on a single page (i.e. the page that
displays the blog post followed by all of the associated comments). Now suppose
you have a requirement to search for the comments posted by a particular user. The
query (using this embedded schema) would be as follows:

db.posts.find({'comments.author': 'author2'},{'comments': 1})


The result of this query, then, would be documents of the following form:
{
"_id" : ObjectId("509d27069cc1ae293b36928d"),
"comments" : [ {
"subject" : "Sample Comment 1 ",
"body" : "Comment1 Body.",
"author_id" : "author2",
"created_date" : ISODate("2015-07-06T13:34:23.929Z")}...]
}
"_id" : ObjectId("509d27069cc1ae293b36928d"),
"comments" : [
{
"subject" : "Sample Comment 2",
"body" : "Comments Body.",
"author_id" : "author2",
"created_date" : ISODate("2015-07-06T13:34:23.929Z")
}...]}
6.3.3 Decisions of Data Modelling
This involves deciding how to structure the documents so that the data is modeled
effectively. An important point to decide is whether you need to embed the data or
use references to the data (i.e. whether to use embedding or referencing).
The decision depends on the number of comments expected on per book and how
frequently the read vs. write operations will be performed

6.3.3. 1 Operational Considerations


In addition to the way the elements interact with each other (i.e. whether to store the
documents in an embedded manner or use references), a number of other
operational factors are important when designing a data model for the application.
These factors are covered in the following sections.
6.3.3.2 Data Lifecycle Management
This feature needs to be used if your application has datasets that need to be
persisted in the database only for a limited time period.
This is implemented by using the Time to Live (TTL) feature of the collection. The TTL
feature of the collection ensures that the documents are expired after a period of
time. Additionally, if the application requirement is to work with only the recently
inserted documents, using capped collections will help optimize the performance.

6.3.3.3 Indexes
Indexes can be created to support commonly used queries to increase the
performance. By default, an index is created by MongoDB on the id field.
The following are a few points to consider when creating indexes:
• At least 8KB of data space is required by each index.
• For write operations, an index addition has some negative performance impact.
Hence for collections with heavy writes, indexes might be expensive because for
each insert, the keys must be added to all the indexes.
• Indexes are beneficial for collections with heavy read operations such as where the
proportion of read-to-write operations is high. The un-indexed read operations are
not affected by an index.

6.33.4 Sharding
One of the important factors when designing the application model is whether to
partition the data or not.This is implemented using sharding in MongoDB.
Sharding is also referred as partitioning of data. In MongoDB, a collection is
partitioned with its documents distributed across cluster of machines, which are
referred as shards. This can have a significant impact on the performance.

6.3.3.5 A Large Number of Collections


The design considerations for having multiple collections vs. storing data in a single
collection are the following:
• There is no performance penalty in choosing multiple collections for storing data.
• Having distinct collections for different types of data can have performance
improvements in high-throughput batch processing applications.
When you are designing models that have a large number of collections, you need to
take into consideration the following behaviors:
• A certain minimum overhead of few kilobytes is associated with each collection.
• At least 8KB of data space is required by each index, including the id index.

6.3.3.6 Growth of the Document


Few updates, such as pushing an element to an array, adding new fields, etc., can
lead to an increase in the document size, which can lead to the movement of the
document from one slot to another in order to fit in the document. This process of
document relocation is both resource and time consuming. Although MongoDB
provides padding to minimize the relocation occurrences, you may need to handle
the document growth manually.
Chapter 7 MongoDB Architecture
7.1 Core Processes
The core components in the MongoDB package are
• mongod , which is the core database process
• mongos , which is the controller and query router for sharded clusters
• mongo , which is the interactive MongoDB shell

https://www.youtube.com/watch?v=qc0jRBwa0WU

7.1.1 mongod
The primary daemon in a MongoDB system is known as mongod. This daemon
handles all the data requests, manages the data format, and performs operations for
background management. When a mongod is run without any arguments, it
connects to the default data directory, which is
C:\data\db or /data/db , and default port 27017, where it listens for socket
connections.
It’s important to ensure that the data directory exists, and you have write permissions
to the directory before the mongod process is started.
If the directory doesn’t exist or you don’t have write permissions on the directory, the
start of this process will fail. If the default port 27017 is not available, the server will
fail to start. mongod also has a HTTP server which listens on a port 1000 higher than
the default port, so if you started the mongod with the default port 27017, in this
case the HTTP server will be on port 28017 and will be accessible using the URL
http://localhost:28017 . This basic HTTP server provides administrative information
about the database.
7.1.2 mongo
mongo provides an interactive JavaScript interface for the developer to test queries
and operations directly on the database and for the system administrators to
manage the database. This is all done via the command line. When the mongo shell
is started, it will connect to the default database called test . This database
connection value is assigned to global variable db . As a developer or administrator
you need to change the database from test to your database post the first
connection is made. You can do this by using <databasename>.

7.1.3
mongos is used in MongoDB sharding. It acts as a routing service that processes
queries from the application layer and determines where in the sharded cluster the
requested data is located. We will discuss mongos in more detail in the sharding
section. Right now you can think of mongos as the process that routes the queries to
the correct server holding the data.

7.1.4 MongoDB Tools


Apart from the core services, there are various tools that are available as part of the
MongoDB installation:
• mongodump : This utility is used as part of an effective backup strategy. It creates a
binary export of the database contents.
• mongorestore : The binary database dump created by the mongodump utility is
imported to a new or an existing database using the mongorestore utility.
• bsondump : This utility converts the BSON files into human-readable formats
such as JSON and CSV. For example, this utility can be used to read the output file
generated by mongodump.
• mongoimport , mongoexport : mongoimport provides a method for taking data in
JSON , CSV, or T SV formats and importing it into a mongod instance. mongoexport
provides a method to export data from a mongod instance into JSON, CSV, or TSV
formats.
• mongostat , mongotop , mongosniff : These utilities provide diagnostic information
related to the current operation of a mongod instance.

7.1.5 Deployment
Standalone deployment is used for development purpose; it doesn’t ensure any
redundancy of data and it doesn’t ensure recovery in case of failures. So it’s not
recommended for use in production environment. Standalone deployment has the
following components: a single mongod and a client connecting to the mongod,
Figure 7-1. Standalone deployment

7.2 Replication
In a standalone deployment, if the mongod is not available, you risk losing all the
data, which is not acceptable in a production environment. Replication is used to
offer safety against such kind of data loss. Replication provides for data redundancy
by replicating data on different nodes, thereby providing protection of data in case
of node failure. Replication provides high availability in a MongoDB deployment.
Replication also simplifies certain administrative tasks where the routine tasks such as
backups can be offloaded to the replica copies, freeing the main copy to handle the
important application requests. In some scenarios, it can also help in scaling the
reads by enabling the client to read from the different copies of data.

https://www.youtube.com/watch?v=UYlHGGluJx8
Master/Slave Replication
In MongoDB, the traditional master/slave replication is available but it is
recommended only for more than 50 node replications. In this type of replication,
there is one master and a number of slaves that replicate the data from the master.
The only advantage with this type of replication is that there’s no restriction on the
number of slaves within a cluster. However, thousands of slaves will overburden the
master node, so in practical scenarios it’s better to have less than dozen slaves. In
addition, this type of replication doesn’t automate failover and provides less
redundancy.
In a basic master/slave setup , you have two types of mongod instances: one instance
is in the master mode and the remaining are in the slave mode, as shown in Figure 7-
2 . Since the slaves are replicating from the master, all slaves need to be aware of the
master’s address.

Figure 7-2. Master/slave replication

The master node maintains a capped collection (oplog) that stores an ordered history
of logical writes to the database.
The slaves replicate the data using this oplog collection. Since the oplog is a capped
collection, if the slave’s state is far behind the master’s state, the slave may become
out of sync. In that scenario, the replication will stop and manual intervention will be
needed to re-establish the replication.
There are two main reasons behind a slave becoming out of sync:
• The slave shuts down or stops and restarts later. During this time, the oplog may
have deleted the log of operations required to be applied on the slave.
• The slave is slow in executing the updates that are available from the master.

Replica Set
The replica set is a sophisticated form of the traditional master-slave replication and
is a recommended method in MongoDB deployments.

Replica sets are basically a type of master-slave replication but they provide
automatic failover. A replica set has one master, which is termed as primary, and
multiple slaves, which are termed as secondary in the replica set context; however,
unlike master-slave replication, there’s no one node that is fixed to be primary in the
replica set.

If a master goes down in replica set, automatically one of the slave nodes is
promoted to the master. The clients start connecting to the new master, and both
data and application will remain available. In a replica set, this failover happens in an
automated fashion. We will explain the details of how this process
happens later.
The primary node is selected through an election mechanism. If the primary goes
down, the selected node will be chosen as the primary node.
Figure 7-3 shows how a two-member replica set failover happens. Let’s discuss the
various steps that happen for a two-member replica set in failover

Figure 7-3. Two-member replica set failover

1. The primary goes down, and the secondary is promoted as primary.


2. The original primary comes up, it acts as slave, and becomes the secondary node.
The points to be noted are
• A replica set is a mongod’s cluster, which replicates among one another and
ensures automatic failover.
• In the replica set, one mongod will be the primary member and the others will be
secondary members.
• The primary member is elected by the members of the replica set. All writes are
directed to the primary member whereas the secondary members replicate from
the primary asynchronously using oplog.
• The secondary’s data sets reflect the primary data sets, enabling them to be
promoted to primary in case of unavailability of the current primary.

Replica set replication has a limitation on the number of members. Prior to version
3.0, the limit was 12 but this has been changed to 50 in version 3.0. So now replica
set replication can have maximum of 50 members only, and at any given point of
time in a 50-member replica set, only 7 can participate in a vote.

7.2.1 Primary and Secondary Members


Before you move ahead and look at how the replica set functions, let’s look at the
type of members that a replica set can have. There are two types of members:
primary members and secondary members .

Primary member•: A replica set can have only one primary, which is elected by the
voting nodes in the replica set. Any node with associated priority as 1 can be elected
as a primary. The client redirects all the write operations to the primary member,
which is then later replicated to the secondary members.

7.2.2 Secondary member: A normal secondary member holds the copy of the data.
The
secondary member can vote and also can be a candidate for being promoted to
primary in case of failover of the current primary.

In addition to this, a replica set can have other types of secondary members.

Types of Secondary Members


Priority 0 members are secondary members that maintain the primary’s data copy
but can never become a primary in case of a failover. Apart from that, they function
as a normal secondary node, and they can participate in voting and can accept read
requests. The Priority 0 members are created by setting the priority to 0. Such types
of members are specifically useful for the following reasons:
1. They can serve as a cold standby.
2. In replica sets with varied hardware or geographic distribution, this configuration
ensures that only the qualified members get elected as primary.
3. In a replica set that spans multiple data centers across network partitioning, this
configuration can help ensure that the main data center has the eligible primary.
This is used to ensure that the failover is quick.
Hidden members are 0-priority members that are hidden from the client
applications. Like the 0-priority members, this member also maintains a copy of the
primary’s data, cannot become the primary, and can participate in the voting, but
unlike 0-priotiy members, it can’t serve any read requests or receive any traffic
beyond what replication requires. A node can be set as hidden member by setting
the hiddenproperty to true. In a replica set, these members can be dedicated for
reporting needs or backups.
Delayed members are secondary members that replicate data with a delay from the
primary’s oplog. This helps to recover from human errors, such as accidentally
dropped databases or errors that were caused by unsuccessful application upgrades.

7.2.3 Elections
In order to get elected, a server need to not just have the majority but needs to have
majority of the total votes. If there are X servers with each server having 1 vote, then
a server can become primary only when it has at least [(X/2) + 1] votes.
If a server gets the required number of votes or more, then it will become primary.

The primary that went down still remains part of the set; when it is up, it will act as a
secondary server until the time it gets a majority of votes again.

The complication with this type of voting system is that you cannot have just two
nodes acting as master and slave. In this scenario, you will have total of two votes,
and to become a master, a node will need the majority of votes, which will be both of
the votes in this case. If one of the servers goes down, the other server
will end up having one vote out of two, and it will never be promoted as master, so it
will remain a slave.

In case of network partitioning, the master will lose the majority of votes since it will
have only its own one vote and it’ll be demoted to slave and the node that is acting
as slave will also remain a slave in the absence of the majority of the votes. You will
end up having two slaves until both servers reach each other again.

A replica set has number of ways to avoid such situations. The simplest way is to use
an arbiter to help resolve such conflicts. It’s very lightweight and is just a voter, so it
can run on either of the servers itself.

Let’s now see how the above scenario will change with the use of an arbiter. Let’s first
consider the network partitioning scenario. If you have a master, a slave, and an
arbiter, each has one vote, totalling three votes. If a network partition occurs with the
master and arbiter in one data center and the slave in another data center, the
master will remain master since it will still have the majority of votes
7.2.3.1 Example - Working of Election Process in More Details

Let’s assume you have a replica set with the following three members: A1, B1, and
C1. Each member exchanges a heartbeat request with the other members every few
seconds. The members respond with their current situation information to such
requests. A1 sends out heartbeat request to B1 and C1. B1 and C1 respond with their
current situation information, such as the state they are in (primary or secondary),
their current clock time, their eligibility to be promoted as primary, and so on. A1
receives all this information’s and updates its “map” of the set, which maintains
information such as the members changed state, members that have gone down or
come up, and the round trip time.
While updating the A1’s map changes, it will check a few things depending on its
state:
• If A1 is primary and one of the members has gone down, then it will ensure that it’s
still able to reach the majority of the set. If it’s not able to do so, it will demote itself
to secondary state.
Demotions: There’s a problem when A1 undergoes a demotion. By default in
MongoDB writes are fire-and-forget (i.e. the client issues the writes but doesn’t
wait for a response). If an application is doing the default writes when the
primary is stepping down, it will never realize that the writes are actually not
happening and might end up losing data. Hence it’s recommended to use safe
writes. In this scenario, when the primary is stepping down, it closes all its client
connections, which will result in socket errors to the clients. The client libraries
then need to recheck who the new primary is and will be saved from losing their
write operations data.

• If A1 is a secondary and if the map has not changed, it will occasionally check
whether it should elect itself.

The first task A1 will do is run a sanity check where it will check answers to few
question such as, Does A1 think it’s already primary? Does another member think its
primary? Is A1 not eligible for election? If it can’t answer any of the basic questions,
A1 will continue idling as is; otherwise, it will proceed with the election process:
• A1 sends a message to the other members of the set, which in this case are B1
and C1, saying “I am planning to become a primary. Please suggest”
• When B1 and C1 receive the message, they will check the view around them.
They will run through a big list of sanity checks, such as, Is there any other
node that can be primary? Does A1 have the most recent data or is there any
other node that has the most recent data? If all the checks seem ok, they send
a “go-ahead” message; however, if any of the checks fail, a “stop election”
message is sent.
• If any of the members send a “stop election” reply, the election is cancelled and
A1 remains a secondary member.
• If the “go-ahead” is received from all, A1 goes to the election process
final phase.

7.2.4 Data Replication Process


The members of a replica set replicate data continuously. Every member, including
the primary member, maintains an oplog. An oplog is a capped collection where the
members maintain a record of all the operations that are performed on the data set.
The secondary members copy the primary member’s oplog and apply all the
operations in an asynchronous manner.

7.2.4.1 Oplog
Oplog stands for the operation log . An oplog is a capped collection where all the
operations that modify the data are recorded.
The oplog is maintained in a special database, namely local in the collection
oplog.$main . Every operation is maintained as a document, where each document
corresponds to one operation that is performed on the master server. The document
contains various keys, including the following keys :
• ts : This stores the timestamp when the operations are performed. It’s an internal
type and is composed of a 4-byte timestamp and a 4-byte incrementing counter.
• op : This stores information about the type of operation performed. The value is
stored as 1-byte code (e.g. it will store an “I” for an insert operation).
• ns : This key stores the collection namespace on which the operation was
performed.
• o : This key specifies the operation that is performed. In case of an insert, this will
store the document to insert.

Only operations that change the data are maintained in the oplog because it’s a
mechanism for ensuring that the secondary node data is in sync with the primary
node data.
The operations that are stored in the oplog are transformed so that they remain
idempotent, which means that even if it’s applied multiple times on the secondary,
the secondary node data will remain consistent. Since the oplog is a capped
collection, with every new addition of an operation, the oldest operations are
automatically moved out. This is done to ensure that it does not grow beyond a pre-
set bound, which is the oplog size.

7.2.4.2 Initial Sync and Replication


Initial sync is done when the member is in either of the following two cases:
1. The node has started for the first time (i.e. it’s a new node and has no data).
2. The node has become stale, where the primary has overwritten the oplog and the
node has not replicated the data. In this case, the data will be removed.
In both cases, the initial sync involves the following steps:
1. First, all databases are cloned.
2. Using oplog of the source node, the changes are applied to its dataset.
3. Finally, the indexes are built on all the collections.
Post the initial sync, the replica set members continuously replicate the changes in
order to be up-to-date. Most of the synchronization happens from the primary, but
chained replication can be enabled where the sync happens from a secondary only
(i.e. the sync targets are changed based on the ping time and state of other
member’s replication).
7.2.4.3 Syncing – Normal Operation
In normal operations, the secondary chooses a member from where it will sync its
data, and then the operations are pulled from the chosen source’s oplog collection
(local.oplog.rs ).
Once the operation (op) is get, the secondary does the following:
1. It first applies the op to its data copy.
2. Then it writes the op to its local oplog.
3. Once the op is written to the oplog, it requests the next op.
Suppose it crashes between step 1 and step 2, and then it comes back again. In this
scenario, it’ll assume the operation has not been performed and will re-apply it.

7.2.4.4 Starting Up
When a node is started, it checks its local collection to find out the
lastOpTimeWritten . This is the time of the latest op that was applied on the
secondary.
The following shell helper can be used to find the latest op in the shell:
> rs.debug.getLastOpWritten()
The output returns a field named ts , which depicts the last op time.
If a member starts up and finds the ts entry, it starts by choosing a target to sync
from and it will start syncing as in a normal operation. However, if no entry is found,
the node will begin the initial sync process.

7.2.4.5 Whom to Sync From?


In this section, you will look at how the source is chosen to sync from . As of 2.0,
based on the average ping time servers automatically sync from the “nearest” node.
When you bring up a new node, it sends heartbeats to all nodes and monitors the
response time. Based on the data received, it then decides the member to sync from
using the following algorithm:

for each healthy member Loop:


if state is Primary
add the member to possible sync target set
if member’s lastOpTimeWritten is greater then the local lastOpTime Written
add the member to possible sync target set
Set sync_from = MIN (PING TIME to members of sync target set)

Note: A “healthy member” can be thought of as a “normal” primary or secondary


member.
In version 2.0, the slave’s delayed nodes were debatably included in “healthy” nodes.
Starting from version 2.2, delayed nodes and hidden nodes are excluded from the
“healthy” nodes.
Running the following command will show the server that is chosen as the source for
syncing:
db.adminCommand({replSetGetStatus:1})

The output field of syncingTo is present only on secondary nodes and provides
information on the node from which it is syncing.

7.2.4.6 Making Writes Work with Chaining Slaves


You have seen that the above algorithm for choosing a source to sync from implies
that slave chaining is semi-automatic. When a server is started, it’ll most probably
choose a server within the same data center to sync from, thus reducing the WAN
traffic.
In this section, you will see how w ( write operation ) works with slave chaining . If N1
is syncing from N2, which is further syncing from N3, in this case how N3 will know
that until which point N1 is synced to.
When N1 starts its sync from N2, a special “handshake” message is sent, which
intimates to N2 that N1 will be syncing from its oplog. Since N2 is not primary, it will
forward the message to the node it is syncing from (i.e. it opens a connection to N3
pretending to be N1). By the end of the above step, N2 has two connections that are
opened with N3: one connection for itself and the other for N1.

Whenever an op request is made by N1 to N2, the op is sent by N2 from its oplog


and a dummy request is forwarded on the link of N1 to N3, as shown in Figure 7-4 .
Figure 7-4. Writes via chaining slaves

7.2.5 Failover
In this section, you will look at how primary and secondary member failovers are
handled in replica sets. All members of a replica set are connected to each other. As
shown in Figure 7-5 , they exchange a heartbeat message amongst each other.

Figure 7-5. Heartbeat message exchange

Hence a node with missing heartbeat is considered as crashed.


7.2.5.1 If the Node Is a Secondary Node
If the node is a secondary node , it will be removed from the membership of the
replica set. In the future, when it recovers, it can re-join. Once it re-joins, it needs to
update the latest changes.
1. If the down period is small, it connects to the primary and catches up with the
latest updates.
2. However, if the down period is lengthy, the secondary server will need to resync
with primary where it deletes all its data and does an initial sync as if it’s a new
server.

If the Node Is the Primary Node


If the node is a primary node, in this scenario if the majority of the members of the
original replica sets are able to connect to each other, a new primary will be elected
by these nodes, which is in accordance with the automatic failover capability of the
replica set.
The election process will be initiated by any node that cannot reach the primary.
The new primary is elected by majority of the replica set nodes. Arbiters can be used
to break ties in scenarios such as when network partitioning splits the participating
nodes into two halves and the majority
cannot be reached.
The node with the highest priority will be the new primary. If you have more than
one node with same priority, the data freshness can be used for breaking ties.
The primary node uses a heartbeat to track how many nodes are visible to it. If the
number of visible nodes falls below the majority, the primary automatically falls back
to the secondary state. This scenario prevents the primary from functioning when it’s
separated by a network partition.

7.2.5.2 Rollbacks
In scenario of a primary node change, the data on the new primary is assumed to be
the latest data in the system. When the former primary joins back, any operation that
is applied on it will also be rolled back.

Then it will be synced with the new primary.

The rollback operation reverts all the write operations that were not replicated across
the replica set.

This is done in order to maintain database consistency across the replica set.
When connecting to the new primary, all nodes go through a resync process to
ensure the rollback is accomplished. The nodes look through the operation that is
not there on the new primary, and then they query the new primary to return an
updated copy of the documents that were affected by the operations.
The nodes are in the process of resyncing and are said to be recovering; until the
process is complete, they will not be eligible for primary election.
This happens very rarely, and if it happens, it is often due to network partition with
replication lag where the secondaries cannot keep up with the operation’s
throughput on the former primary.

It needs to be noted that if the write operations replicate to other members before
the primary steps down, and those members are accessible to majority of the nodes
of the replica set, the rollback does not occur.
The rollback data is written to a BSON file with filenames such as
<database>.<collection>.
<timestamp>.bson in the database’s dbpath directory .
The administrator can decide to either ignore or apply the rollback data. Applying
the rollback data can only begin when all the nodes are in sync with the new primary
and have rolled back to a consistent state.
The content of the rollback files can be read using Bsondump , which then need to
be manually applied to the new primary using mongorestore.
There is no method to handle rollback situations automatically for MongoDB.

Therefore manual intervention is required to apply rollback data. While applying the
rollback, it’s vital to ensure that these are replicated to either all or at least some of
the members in the set so that in case of any failover rollbacks can be avoided.

7.2.5.3 Consistency
the replica set members keep on replicating data among each other by reading the
oplog.
How is the consistency of data maintained? In this section, you will look at how
MongoD B ensures that you always access consistent data.
In MongoDB, although the reads can be routed to the secondaries, the writes are
always routed to the primary, eradicating the scenario where two nodes are
simultaneously trying to update the same data set.
The data set on the primary node is always consistent.
If the read requests are routed to the primary node, it will always see the up-to-date
changes, which means the read operations are always consistent with the last write
operations. However, if the application has changed the read preference to read from
secondaries, there might be a probability of user not seeing the latest changes or
seeing previous states. This is because the writes are replicated asynchronously on
the secondaries.

7.2.5.4 Possible Replication Deployment


The architecture you chose to deploy a replica set affects its capability and capacity.
In this section, you will look at few strategies that you need to be aware of while
deciding on the architecture. We will also be discussing the deployment architecture .
1. Odd number of members : This should be done in order to ensure that there is
no tie when electing a primary. If the number of nodes is even, then an arbiter
can be used to ensure that the total nodes participating in election is odd, as
shown in Figure 7-6 .

Figure 7-6. Members replica set with primary, secondary, and arbiter

2. Replica set fault tolerance is the count of members, which can go down but
still the replica set has enough members to elect a primary in case of any failure.
Table 7-1 indicates the relationship between the member count in the replica set
and its fault tolerance. Fault tolerance should be considered when deciding on
the number of members.

3. If the application has specific dedicated requirements , such as for reporting


or backups, then delayed or hidden members can be considered as part of the
replica set, as shown in Figure 7-7 .
Figure 7-7. Members replica set with primary, secondary, and hidden members

4. If the application is read-heavy, the read can be distributed across secondaries.


As the requirement increases, more nodes can be added to increase the data
duplication; this can have a positive impact on the read throughput.

5. The members should be distributed geographically in order to cater to main data


center failure. As shown in Figure 7-8, the members that are kept at a geographically
different location other than the main data center can have priority set as 0, so that
they cannot be elected as primary and can act as a standby only.
Figure 7-8. Members replica set with primary, secondary, and a priority 0
member distributed across the data center

7.2.5.5 Scaling Reads

Although the primary purpose of the secondaries is to ensure data availability in case
of downtime of the primary node, there are other valid use cases for secondaries.
They can be used dedicatedly to perform backup operations or data processing jobs
or to scale out reads. One of the ways to scale reads is to issue the read queries
against the secondary nodes; by doing so the workload on the master is reduced.
One important point that you need to consider when using secondaries for scaling
read operations is that in MongoDB the replication is asynchronous, which means if
any write or update operation is performed on the master’s data, the secondary data
will be momentarily out-of-date. If the application in question is read-heavy and is
accessed over a network and does not need up-to-date data, the secondaries
can be used to scale out the read in order to provide a good read throughput.
Although by default the read requests are routed to the primary node, the requests
can be distributed over secondary nodes by specifying the read preferences . Figure
7-9 depicts the default read preference.
Figure 7-9. Default read preference

7.2.6 Application Write Concerns


When the client application interacts with MongoDB, it is generally not aware
whether the database is on standalone deployment or is deployed as a replica set.
However, when dealing with replica sets, the client should be aware of write concern
and read concern .
Since a replica set duplicates the data and stores it across multiple nodes, these two
concerns give a client application the flexibility to enforce data consistency across
nodes while performing read or write operations.
Using a write concern enables the application to get a success or failure response
from MongoDB.
When used in a replica set deployment of MongoDB, the write concern sends a
confirmation from the server to the application that the write has succeeded on the
primary node. However, this can be configured so that the write concern returns
success only when the write is replicated to all the nodes maintaining the data.
In practical scenario, this isn’t feasible because it will reduce the write performance.
Ideally the client can ensure, using a write concern, that the data is replicated to one
more node in addition to the primary, so that the data is not lost even if the primary
steps down.
The write concern returns an object that indicates either error or no error.
The w option ensures that the write has been replicated to the specified number of
members. Either a number or a majority can be specified as the value of the w
option.
If a number is specified, the write replicates to that many number of nodes before
returning success. If a majority is specified, the write is replicated to a majority of
members before returning the result.
Figure 7-12 shows how a write happens with w: 2.

Figure 7-12. writeConcern

7.2.6.1 How Writes Happen with Write Concern


In order to ensure that the written data is present on say at least two members, issue
the following command :
>db.testprod.insert({i:”test”, q: 50, t: “B”}, {writeConcern: {w:2}})

In order to understand how this command will be executed, say you have two
members, one named primary and the other named secondary, and it is syncing its
data from the primary. But how will the primary know the point at which the
secondary is synced? Since the primary’s oplog is queried by the secondary for op
results to be applied, if the secondary requests an op written at say t time, it implies
to the primary that the secondary has replicated all ops written before t .
The following are the steps that a write concern takes.
1. The write operation is directed to the primary.
2. The operation is written to the oplog of primary with ts depicting the time of
operation.
3. A w: 2 is issued, so the write operation needs to be written to one more server
before it’s marked successful.
4. The secondary queries the primary’s oplog for the op, and it applies the op.
5. Next, the secondary sends a request to the primary requesting for ops with ts
greater than t.
6. At this point, the primary sends an update that the operation until t has been
applied by the secondary as it’s requesting for ops with {ts: {$gt: t}} .
7. The writeConcern finds that a write has occurred on both the primary and
secondary, satisfying the w: 2 criteria, and the command returns success.

7.2.3 Implementing Advanced Clustering with Replica Sets


Having learned the architecture and inner workings of replica sets, you will now focus
on administration and usage of replica sets. You will be focusing on the following:
1. Setting up a replica set.
2. Removing a server.
3. Adding a server.
4. Adding an arbiter.
5. Inspecting the status.
6. Forcing a new election of a primary.
7. Using the web interface to inspect the status of the replica set.

The following examples assume a replica set named testset that has the
configuration shown in Table 7-2 .

7.2.3.1 Setting Up a Replica Set


In order to get the replica set up and running , you need to make all the active
members up and running.
The first step is to start the first active member. Open a terminal window and create
the data directory :
C:\>mkdir C:\db1\active1\data
C:\>
Connect to the mongod:
c:\practicalmongodb\bin>mongod --dbpath C:\db1\active1\data --port 27021 --
replSet
testset/ANOC9:27021 –restj

7.2.3.2 Removing a Server


We will remove the secondary active member from the set. Let’s connect to the
secondary member mongo instance. Open a new command prompt, like so:
C:\>cd c:\practicalmongodb\bin
c:\practicalmongodb\bin>mongo ANOC9 --port 27022
MongoDB shell version: 3.0.4
connecting to: 127.0.0.1:27022/ANOC9
testset:SECONDARY>

7.2.3.3 Adding a Server


You will next add a new active member to the replica set. As with other members,
you begin by opening a new command prompt and creating the data directory first:
C:\>mkdir C:\db1\active3\data
C:\>
Next, you start the mongod using the following command:
c:\practicalmongodb\bin>mongod --dbpath C:\db1\active3\data --port 27024 --
replSet testset/
ANOC9:27021 --rest
..........
You have the new mongod running, so now you need to add this to the replica set.
For this you connect to the primary’s mongo console:
C:\>c:\practicalmongodb\bin\mongo.exe --port 27021
MongoDB shell version: 3.0.4
connecting to: 127.0.0.1:27021/test
testset:PRIMARY>
Next, you switch to admin db:
testset:PRIMARY> use admin
switched to db admin
testset:PRIMARY>

Finally, the following command needs to be issued to add the new mongod to the
replica set:
testset:PRIMARY> rs.add("ANOC9:27024")
{ "ok" : 1 }

7.2.3.4Adding an Arbiter to a Replica Set


you will add an arbiter member to the set. As with the other members, you begin by
creating the data directory for the MongoDB instance:
C:\>mkdir c:\db1\arbiter\data
C:\>
You next start the mongod using the following command:
c:\practicalmongodb\bin>mongod --dbpath c:\db1\arbiter\data --port 30000 --
replSet testset/
ANOC9:27021 --rest
2015-07-13T22:05:10.205-0700 I CONTROL [initandlisten] MongoDB starting :
pid=3700
port=30000 dbpath=c:\db1\arbiter\data 64-bit host=ANOC9
..........................................................
Connect to the primary’s mongo console, switch to the admin db, and add the newly
created mongod as
an arbiter to the replica set:
C:\>c:\practicalmongodb\bin\mongo.exe --port 27021
MongoDB shell version: 3.0.4
connecting to: 127.0.0.1:27021/test
testset:PRIMARY> use admin
switched to db admin
testset:PRIMARY> rs.addArb("ANOC9:30000")
{ "ok" : 1 }
testset:PRIMARY>
Whether the step is successful or not can be verified using rs.status().

7.2.3.5 Inspecting the Status Using rs.status()


We have been referring to rs.status() throughout the examples above to check the
replica set status. In this section, you will learn what this command is all about.
It enables you to check the status of the member whose console they are connected
to and also enables them to view its role within the replica set.
The following command is issued from the primary’s mongo console:
testset:PRIMARY> rs.status()
{
"set" : "testset",
"date" : ISODate("2015-07-13T22:15:46.222Z")
"myState" : 1,
"members" : [
{
"_id" : 0,
...........................
"ok" : 1
testset:PRIMARY>

The myState field’s value indicates the status of the member and it can have the
values shown in Table 7-3
7.2.3.6 Forcing a New Election
The current primary server can be forced to step down using the rs.stepDown ()
command. This force starts the election for a new primary.
This command is useful in the following scenarios :
1. When you are simulating the impact of a primary failure, forcing the cluster to fail
over. This lets you test how your application responds in such a scenario.
2. When the primary server needs to be offline. This is done for either a maintenance
activity or for upgrading or to investigating the server.
3. When a diagnostic process need to be run against the data structures.

7.2.3.7 Inspecting Status of the Replica Set Using a Web Interface


A web-based console is maintained by MongoDB for viewing the system status. In
your example, the console can be accessed via http://localhost:28021 .
By default the web interface port number is set to X+1000 where X is the mongod
instance port number. In this chapter’s example, since the primary instance is on
27021, the web interface is on port 28021

7.3 Sharding
A page fault happens when data which is not there in memory is accessed by
MongoDB. If there’s free memory available, the OS will directly load the requested
page into memory; however, in the absence of free memory, the page in memory is
written to the disk and then the requested page is loaded in the memory, slowing
down the process. Few operations accidentally purge large portion of the working set
from the memory, leading to an adverse effect on the performance. One example is a
query scanning through all documents of a database where the size exceeds the
server memory. This leads to loading of the documents in memory and moving the
working set out to disk.
Ensuring you have defined the appropriate index coverage for your queries during
the schema design phase of the project will minimize the risk of this happening. The
MongoDB explain operation can be used to provide information on your query plan
and the indexes used.
MongoDB’s s erverStatus command returns a workingSet document that provides an
estimate of the instance’s working set size. The Operations team can track how many
pages the instance accessed over a given period of time and the elapsed time
between the working set’s oldest and newest document. Tracking all these metrics,
it’s possible to detect when the working set will be hitting the current memory limit,
so proactive actions can be taken to ensure the system is scaled well enough to
handle that.

Figure 7-15. Sharded collection across three shards

https://www.youtube.com/watch?v=FK3cKKccM5E
7.3.1 Sharding Components
You will next look at the components that enable sharding in MongoDB. Sharding is
enabled in MongoDB via sharded clusters.
The following are the components of a sharded cluster:
• Shards
• mongos
• Config servers

The shard is the component where the actual data is stored. For the sharded cluster,
it holds a subset of data and can either be a mongod or a replica set. All shard’s data
combined together forms the complete dataset for the sharded cluster.
Sharding is enabled per collection basis, so there might be collections that are not
sharded. In every sharded cluster there’s a primary shard where all the unsharded
collections are placed in addition to the sharded collection data. When deploying a
sharded cluster, by default the first shard becomes the primary shard although it’s
configurable. See Figure 7-16 .

Figure 7-16. Primary shard

7.3.2 Data Distribution Process


The data is distributed among the shards for the collections where sharding is
enabled. In MongoDB, the data is sharded or distributed at the collection level. The
collection is partitioned
by the shard key.

7.3.2.1 Shard Key


Any indexed single/compound field that exists within all documents of the collection
can be a shard key. You specify that this is the field basis which the documents of the
collection need to be distributed.
Internally, MongoDB divides the documents based on the value of the field into
chunks and distributes them across the shards.
There are two ways MongoDB enables distribution of the data: range-based
partitioning and hashbased partitioning.

7.3.2.1.1 Range-Based Partitioning


In range-based partitioning , the shard key values are divided into ranges. Say you
consider a timestamp field as the shard key. In this way of partitioning, the values are
considered as a straight line starting from a Min value to Max value where Min is the
starting period (say, 01/01/1970) and Max is the end period (say, 12/31/9999). Every
document in the collection will have timestamp value within this range only, and it
will represent some point on the line.

Based on the number of shards available, the line will be divided into ranges, and
documents will be distributed based on them.
In this scheme of partitioning, shown in Figure 7-17 , the documents where the
values of the shard key are nearby are likely to fall on the same shard. This can
significantly improve the performance of the range queries.

Figure 7-17. Range-based partitioning

7.3.2.1.2 Hash-Based Partitioning


In hash-based partitioning , the data is distributed on the basis of the hash value of
the shard field. If selected, this will lead to a more random distribution compared to
range-based partitioning.
It’s unlikely that the documents with close shard key will be part of the same chunk.
For example, for ranges based on the hash of the id field, there will be a straight line
of hash values, which will again be partitioned on basis of the number of shards. On
the basis of the hash values, the documents will lie in either
of the shards. See Figure 7-18 .

Figure 7-18. Hash-based partitioning

7.3.2.1.3 Chunks

The data is moved between the shards in form of chunks. The shard key range is
further partitioned into subranges, which are also termed as chunks. See Figure 7-19 .
Figure 7-19. Chunks

For a sharded cluster, 64MB is the default chunk size. In most situations, this is an apt
size for chunk slitting and migration.
Let’s discuss the execution of sharding and chunks with an example. Say you have a
blog posts collection which is sharded on the field date . This implies that the
collection will be split up on the basis of the date field values. Let’s assume further
that you have three shards. In this scenario the data might be distributed across
shards as follows:
Shard #1: Beginning of time up to July 2009
Shard #2: August 2009 to December 2009
Shard #3: January 2010 to through the end of time

7.3.3 Data Balancing Process


The addition of new data or modification of existing data, or the addition or removal
of servers, can lead to imbalance in the data distribution, which means either one
shard is overloaded with more chunks and the other shards have less number of
chunks, or it can lead to an increase in the chunk size, which is significantly greater
than the other chunks.
MongoDB ensures balance with the following background processes:
• Chunk splitting
• Balancer

7.3.3.1 Chunk Splitting


Chunk splitting is one of the processes that ensures the chunks are of the specified
size. As you have seen, a shard key is chosen and it is used to identify how the
documents will be distributed across the shards. The documents are further grouped
into chunks of 64MB (default and is configurable) and are stored in the
shards based on the range it is hosting.
Figure 7-20. Chunk splitting

7.3.3.2 Balancer
Balancer is the background process that is used to ensure that all of the shards are
equally loaded or are in a balanced state. This process manages chunk migrations.
Splitting of the chunk can cause imbalance. The addition or removal of documents
can also lead to a cluster imbalance. In a cluster imbalance, balancer is used, which is
the process of distributing data evenly.
When you have a shard with more chunks as compared to other shards, then the
chunks balancing is done automatically by MongoDB across the shards . This process
is transparent to the application and to you.
Any of the mongos within the cluster can initiate the balancer process. They do so by
acquiring a lock on the config database of the config server, as balancer involves
migration of chunks from one shard to another, which can lead to a change in the
metadata, which will lead to change in the config server database.
The balancer process can have huge impact on the database performance, so it can
either
1. Be configured to start the migration only when the migration threshold has
reached. The migration threshold is the difference in the number of maximum
and minimum chunks on the shards. Threshold is shown in Table 7-4 .

2. Or it can be scheduled to run in a time period that will not impact the production
traffic.

The balancer migrates one chunk at a time (see Figure 7-21 ) and follows these steps:
1. The moveChunk command is sent to the source shard.
2. A n internal m oveChunk command is started on the source where it creates the
copy of the documents within the chunk and queues it. In the meantime, any
operations for that chunk are routed to the source by the mongos because the
config database is not yet changed and the source will be responsible for serving
any read/write request on that chunk.
3. The destination shard starts receiving the copy of the data from the source.
4. Once all of the documents in the chunks have been received by the destination
shard, the synchronization process is initiated to ensure that all changes that
have happened to the data while migration are updated at the destination shard.
5. Once the synchronization is completed, the next step is to update the metadata
with the chunk’s new location in the config database. This activity is done by
the destination shard that connects to the config database and carries out the
necessary updates.
6. Post successful completion of all the above, the document copy that is
maintained at the source shard is deleted.

Figure 7-21. Chunk migration

7.3.4 Operations
The read and write operations are performed on the sharded cluster. As mentioned,
the config servers maintain the cluster metadata. This data is stored in the config
database. This data of the config database is used by the mongos to service the
application read and write requests.
The data is cached by the mongos instances, which is then used for routing write and
read operations to the shards. This way the config servers are not overburdened.
The mongos will only read from the config servers in the following scenarios:
• The mongos has started for first time or
• An existing mongos has restarted or
• After chunk migration when the mongos needs to update its cached metadata with
the new cluster metadata.

7.3.5 Implementing Sharding

You will be focusing on the following:


1. Setting up a sharded cluster.
2. Creating a database and collection, and enable sharding on the collection.
3. Using the import command to load data in the sharded collection.
4. Distributed data amongst the shards.
5. Adding and removing shards from the cluster and checking how data is
distributed automatically.

7.3.5.1 Setting the Shard Cluster


In order to set up the cluster , the first step is to set up the configuration server. Enter
the following code in a new terminal window to create the data directory for the
config server and start the mongod:
C:\> mkdir C:\db1\config\data
C:\>CD C:\practicalmongodb\bin
C:\ practicalmongodb\bin>mongod --port 27022 --dbpath C:\db1\config\data –
configsvr

7.3.5.2 Creating a Database and Shard Collection


In order to continue further with the example, you will create a database named
testdb and a collection named testcollection , which you will be sharding on the key
testkey . Connect to the mongos console and issue the following command to get
the database:
mongos> testdb=db.getSisterDB("testdb")
testdb
Next, enabling sharding at database level for testdb :
mongos> db.runCommand({enableSharding system: "testdb"})
{ "ok" : 1 }
mongos>

7.3.5.3 Adding a New Shard


You have a sharded cluster set up and you also have sharded a collection and looked
at how the data is distributed amongst the shards. Next, you’ll add a new shard to
the cluster so that the load is spread out a little more. You will be repeating the steps
mentioned above. Begin by creating a data directory for the new shard in
a new terminal window:
c:\>mkdir c:\db1\shard2\data
Next, start the mongod at port 27025:

7.3.5.4 Removing a Shard


In the following example, you will see how to remove a shard server. For this
example, you will be removing the server you added in the above example.
In order to initiate the process, you need to log on to the mongos console, switch to
the admin db, and execute the following command to remove the shard from the
shard cluster:
C:\>cd c:\practicalmongodb\bin
c:\practicalmongodb\bin>mongo localhost:27021
MongoDB shell version: 3.0.4
connecting to: localhost:27021/test
mongos> use admin
switched to db admin
mongos> db.runCommand({removeShard: "localhost:27025"})
{
"msg" : "draining started successfully",
"state" : "started",
"shard" : "shard0002",
"ok" : 1
}
mongos>

7.3.5.5 Listing the Sharded Cluster Status


The printShardingStatus() command gives lots of insight into the sharding system
internals.

7.3.6 Controlling Collection Distribution (Tag-Based Sharding)


Tagging gives operators control over which collections go to which shard.
In order to understand tag-based sharding, let’s set up a sharded cluster. You will be
using the shard cluster created above. For this example, you need three shards, so
you will add Shard2 again to the cluster.
7.3.6.1 Prerequisite
You will start the cluster first. Just to reiterate, follow these steps.
1. Start the config server. Enter the following command in a new terminal window
(if it’s not already running):
C:\> mkdir C:\db1\config\data
C:\>cd c:\practicalmongodb\bin
C:\practicalmongodb\bin>mongod --port 27022 --dbpath C:\db\config\data –
configsvr

2. Start the mongos. Enter the following command in a new terminal window (if it’s
not already running):
C:\>cd c:\practicalmongodb\bin
c:\practicalmongodb\bin>mongos --configdb localhost:27022 --port 27021

3. You will start the shard servers next.

7.3.6.2 Tagging
By the end of the above steps you have your sharded cluster with a config server,
three shards, and a mongos up and running. Next, connect to the mongos at 30999
port and configdb at 27022 in a new terminal window:
C:\ >cd c:\practicalmongodb\bin
c:\ practicalmongodb\bin>mongos --port 30999 --configdb localhost:2702

The steps are as follows:


1. Connect to the mongos console.
2. View the running databases connected to the mongos instance running at port
30999.
3. Get reference to the database movies .
4. Enable sharding of the database movies .
5. Shard the collection movies.drama by shard key originality .
6. Shard the collection movies.action by shard key distribution .
7. Shard the collection movies.comedy by shard key collections .

7.3.6.3 Scaling with Tagging


Multiple Tags
You can have multiple tags associated with the shards. Let’s add two different tags to
the shards. Say you want to distribute the writes based on the disk. You have one
shard that has a spinning disk and the other has a SSD (solid state drive). You want to
redirect 50% of the writes to the shard with SSD and the remaining to the one with
the spinning disk.
First, tag the shards based on these properties:
mongos> sh.addShardTag("shard0001", "spinning")
mongos> sh.addShardTag("shard0002", "ssd")
mongos>

7.3.7 Points to Remember When Importing Data in a ShardedEnvironment

7.3.7.1 Pre-Splitting of the Data


Instead of leaving the choice of chunks creation with MongoDB, you can tell
MongoDB how to do so using the following command:
db.runCommand( { split : "practicalmongodb.mycollection" , middle : { shardkey :
value } } );
Post this you can also let MongoDB know which chunks goes to which node.
For all this you will need knowledge of the data you will be imported to the database.
And this also depends on the use case you are aiming to solve and how the data is
being read by your application. When deciding where to place the chunk, keep
things like data locality in mind.
7.3.7.2 Deciding on the Chunk Size
You need to keep the following points in mind when deciding on the chunk size :
1. If the size is too small, the data will be distributed evenly but it will end up having
more frequent migrations, which will be an expensive operation at the mongos
layer.
2. If the size is large, it will lead to less migration, reducing the expense at the
mongos layer, but you will end up with uneven data distribution.
7.3.7.3 Choosing a Good Shard Key
It’s very essential to pick a good shard key for good distribution of data among
nodes of the shard cluster.

7.3.8 Monitoring for Sharding


In addition to the normal monitoring and analysis that is done for other MongoDB
instances, the sharding cluster requires an additional monitoring to ensure that all its
operations are functioning appropriately and
the data is distributed effectively among the nodes. In this section, you will see what
monitoring you should do for the proper functioning of the sharding cluster.

7.3.9 Monitoring the Config Servers


The config server, as you know by now, stores the metadata of the sharded cluster.
The mongos caches the data and routes the request to the respective shards. If the
config server goes down but there’s a running mongos instance, there’s no
immediate impact on the shard cluster and it will remain available for a while.
However, you won’t be able to perform operations like chunk migration or restart a
new mongos. In the long run, the unavailability of the config server can severely
impact the availability of the cluster. To ensure that the cluster remains balanced and
available, you should monitor the config servers.

7.3.9.1 Monitoring the Shard Status Balancing and Chunk Distribution


For a most effective sharded cluster deployment , it’s required that the chunks be
distributed evenly among the shards. As you know by now, this is done automatically
by MongoDB using a background process. You need to monitor the shard status to
ensure that the process is working effectively. For this, you can use the
db.printShardingStatus() or sh.status() command in the mongos mongo shell to
ensure that the process is working effectively.

7.3.9.2 Monitoring the Lock Status


In almost all cases the balancer releases its locks automatically after completing its
process, but you need to check the lock status of the database in order to ensure
there’s no long lasting lock because this can block future balancing, which will affect
the availability of the cluster. Issue the following from mongos mongo to check the
lock status:
use config
db.locks.find()
7.4 Production Cluster Architecture

We will look at the production cluster architecture. In order to understand it, let’s
consider a very generic use case of a social networking application where the user
can create a circle of friends and can share their comments or pictures across the
group. The user can also comment or like her friend’s comments or pictures. The
users are geographically distributed.The application requirement is immediate
availability across geographies of all the comments; data should be redundant so
that the user’s comments, posts and pictures are not lost; and it should be highly
available. So the application’s production cluster should have the following
components :
1. At least two mongos instance, but you can have more as per need.
2. Three c onfig servers, each on a separate system.
3. Two or more replica sets serving as shards . The replica sets are distributed across
geographies with read concern set to nearest.

7.4.1 Scenario 1
Mongos become unavailable: The application server where mongos has gone down
will not be able to communicate with the cluster but it will not lead to any data loss
since the mongos don’t maintain any data of its own. The mongos can restart, and
while restarting, it can sync up with the config servers to cache the cluster metadata,
and the application can normally start its operations.
Mongos become unavailable

7.4.2 Scenario 2
One of the mongod of the replica set becomes unavailable in a shard: Since you used
replica sets to provide high availability, there is no data loss. If a primary node is
down, a new primary is chosen, whereas if it’s a secondary node, then it is
disconnected and the functioning continues normally.

One of the mongod of replica set is unavailable


The only difference is that the duplication of the data is reduced, making the system
little weak, so you should in parallel check if the mongod is recoverable. If it is, it
should be recovered and restarted whereas if it’s non-recoverable, you need to
create a new replica set and replace it as soon as possible.

7.4.3 Scenario 3
If one of the shard becomes unavailable: In this scenario, the data on the shard will
be unavailable, but the other shards will be available, so it won’t stop the application.
The application can continue with its read/ write operations; however, the partial
results must be dealt with within the application. In parallel, the shard should attempt
to recover as soon as possible.

Shard unavailable

7.4.4 Scenario 4
Only one config server is available`e out of three: In this scenario, although the
cluster will become readonly, it will not serve any operations that might lead to
changes in the cluster structure, thereby leading to a change of metadata such as
chunk migration or chunk splitting. The config servers should be replaced ASAP
because if all config servers become unavailable, this will lead to an inoperable
cluster.
Only one config server available

Questions:

1. Explain MongoDB database model.


2. Write short note on JSON and BJSON
3. Explain the following term
i) Capped collection
ii) Identifier_ID
iii)Polymorphic Schemas
4. What is index? Explain its types.
5. What are the conditional operators available in MongoDB.
6. What are the core components of MongoDB. packages
7. Write short note on replica set.
8. How failovers are handled in MongoDB.
9. What are the components of sharding
10. Explain Production Cluster Architecture with all its scenario.

Multiple Choice Questions

1) A collection and a document in MongoDB is equivalent to which of the SQL concepts


respectively?

A - Table and Row


B - Table and Column
C - Column and Row
D - Database and Table

2) MongoDB stores all documents in :


A. tables
B. collections
C. rows
D. All of the mentioned

3) In MongoDB, _________ operations modify the data of a single collection.


A. CRUD
B. GRID
C. READ
D. All of the mentioned

4) After starting the mongo shell, your session will use the ________ database by default.
A. mongo
B. master
C. test
D. primary

5) Which of the following method returns one document ?


A. findOne()
B. findOne1()
C. selectOne()
D. All of the mentioned

6) The mongo shell and the MongoDB drivers use __________ as the default write concern.
A. Nacknowledged
B. Acknowledgement
C. Acknowledged
D. All of the mentioned

7) _____________ can be used for batch processing of data and aggregation operations.
A. Hive
B. MapReduce
C. Oozie
D. None of the mentioned

8) Sometimes the failover process may require a ____________ during operation.


A. savepoint
B. rollback
C. commit
D. None of the mentioned

9) The _________ property for an index causes MongoDB to reject duplicate values for the
indexed field.
A. Hashed
B. Unique
C. Multikey
D. None of the mentioned

10) Normalized data models describe relationships using ___________ between documents.
A. relativeness
B. references
C. evaluation
D. None of the mentioned

You might also like