Unit 2
Unit 2
Contents
The MongoDB Data Model: The Data Model, JSON and BSON, The
Identifier (_id), Capped Collection, Polymorphic Schemas, Object-
Oriented Programming, Schema Evolution
Recommended Books
• Practical MongoDB Shakuntala Gupta Edward Navin Sabharwal Apress
• Beginning jQuery Jack Franklin Russ Ferguson Apress Second
• Next Generation Databases Guy Harrison Apress
• Beginning JSON Ben Smith Apress
In this code, you have two documents in the Region collection. Although both
documents are part of a single collection, they have different structures: the second
collection has an additional field of information, which is country. In fact, if you look
at the “R_ID” field, it stores a STRING value in the first document whereas it’s a
number in the second document.
Thus, a collection’s documents can have entirely different schemas. It falls to the
application to store the documents in a collection together or to have multiple
collections.
MongoDB is a document-based database. It uses Binary JSON for storing its data.
JSON stands for JavaScript Object Notation. It’s a standard used for data interchange
in today’s modern Web (along with XML). The format is human and machine
readable. It is not only a great way to exchange data but also a nice way to store
data. All the basic data types (such as strings, numbers, Boolean values, and arrays)
are supported by JSON.
{
"_id" : 1,
"name" : { "first" : "John", "last" : "Doe" },
"publications" : [
{
"title" : "First Book",
"year" : 1989,
"publisher" : "publisher1"
},
{ "title" : "Second Book",
"year" : 1999,
"publisher" : "publisher2"
}
]
}
If you have not explicitly specified any value for a key, a unique value will be
automatically generated and assigned to it by MongoDB. This key value is immutable
and can be of any data type except arrays.
MongoDB has a concept of capping the collection. This means it stores the
documents in the collection in the inserted order. As the collection reaches its limit,
the documents will be removed from the collection in FIFO (first in, first out) order.
This means that the least recently inserted documents will be removed first.
This is good for use cases where the order of insertion is required to be maintained
automatically, and deletion of records after a fixed size is required. One such use
cases is log files that get automatically truncated after a certain size.
Suppose you have an application that lets the user upload and share different
content types such as HTML pages, documents, images, videos, etc. Although many
of the fields are common across all of the above-mentioned content types (such as
Name, ID, Author, Upload Date, and Time), not all fields are identical. For example, in
the case of images, you have a binary field that holds the image content, whereas an
HTML page has a large text field to hold the HTML content. In this scenario, the
MongoDB polymorphic schema can be used wherein all of the content node types
are stored in the same collection, such as LoadContent, and each document has
relevant fields only.
id: 1,
title: "Hello",
type: "HTMLpage",
text: "<html>Hi..Welcome to my world</html>"
}
...
// Document collection also has a "Picture" document
{
id: 3,
title: "Family Photo",
type: "JPEG",
sizeInMB: 10,........
}
This schema not only enables you to store related data with different structures
together in a same collection, it also simplifies the querying. The same collection can
be used to perform queries on common fields such as fetching all content uploaded
on a particular date and time as well as queries on specific fields such as finding
images with a size greater than X MB.
When you are working with databases, one of the most important considerations
that you need to account for is the schema evolution (i.e. the change in the schema’s
impact on the running application). The design should be done in a way as to have
minimal or no impact on the application, meaning no or minimal downtime, no or
very minimal code changes, etc.
“mongo shell comes with the standard distribution of MongoDB. It offers a JavaScript
environment with complete access to the language and the standard functions. It
provides a full interface for the MongoDB database.”
https://www.youtube.com/watch?v=EyC_Bi9kAtM
Now we know how the databases and collections are created. As explained earlier,
the documents in MongoDB are in the JSON format.
First, by issuing the db command you will confirm that the context is the mydbpoc
database.
> db
mydbpoc
>
Creating documents.
The first document complies with the first prototype whereas the second document
complies with the
second prototype. You have created two documents named user1 and user2 .
> user1 = {FName: "Test", LName: "User", Age:30, Gender: "M", Country: "US"}
{
"FName" : "Test",
"LName" : "User",
"Age" : 30,
"Gender" : "M",
"Country" : "US"
}
> user2 = {Name: "Test User", Age:45, Gender: "F", Country: "US"}
{ "Name" : "Test User", "Age" : 45, "Gender" : "F", "Country" : "US" }
>
You will next add both these documents (user1 and user2) to the users collection in
the following order
of operations:
> db.users.insert(user1)
> db.users.insert(user2)
The above operation will not only insert the two documents to the user’s collection
but it will also create the collection as well as the database. The same can be verified
using the show collections and show dbs commands.
And show collections will display the list of collection in the current database.
> db.users.find()
{ "_id" : ObjectId("5450c048199484c9a4d26b0a"), "FName" : "Test", "LName" : "User",
"Age" : 30, "Gender": "M", "Country" : "US" }
{ "_id" : ObjectId("5450c05d199484c9a4d26b0b"), "Name" : "Test", User", "Age" : 45,
"Gender" : "F", "Country" : "US"
6.1.2 Explicitly Creating Collections
user can also explicitly create a collection before executing the insert statement.
db.createCollection("users")
Documents can also be added to the collection using a for loop. The following code
inserts users using for.
> for(var i=1; i<=20; i++) db.users.insert({"Name" : "Test User" + i, "Age": 10+i,
"Gender" : "F", "Country" : "India"})
>
In order to verify that the insert is successful, run the find command on the collection.
> db.users.find()
{ "_id" : ObjectId("52f48cf474f8fdcfcae84f79"), "FName" : "Test", "LName" : "User",
"Age" : 30, "Gender" : "M", "Country" : "US" }
{ "_id" : ObjectId("52f48cfb74f8fdcfcae84f7a"), "Name" : "Test User", "Age" : 45
, "Gender" : "F", "Country" : "US" }
................
{ "_id" : ObjectId("52f48eeb74f8fdcfcae84f8c"), "Name" : "Test User18", "Age" :
28, "Gender" : "F", "Country" : "India" }
Type "it" for more
In your case, if you type “it” and press Enter, the following will appear:
> it
{ "_id" : ObjectId("52f48eeb74f8fdcfcae84f8d"), "Name" : "Test User19", "Age" :
29, "Gender" : "F", "Country" : "India" }
{ "_id" : ObjectId("52f48eeb74f8fdcfcae84f8e"), "Name" : "Test User20", "Age" :
30, "Gender" : "F", "Country" : "India" }
>
Since only two documents were remaining, it displays the remaining two documents
6.1.4 Inserting by Explicitly Specifying _id
The id field was not specified, so it was implicitly added. In the following example,
you will see how to explicitly specify the id field when inserting the documents within
a collection. While explicitly specifying the id field, you have to keep in mind the
uniqueness of the field; otherwise the insert will fail.
6.1.5 Update
The update() command, which is used to update the documents in a collection. The
update() method updates a single document by default. If you need to update all
documents that match the selection criteria, you can do so by setting the multi
option as true.
When working in a real-world application, you may come across a schema evolution
where you might end up adding or removing fields from the documents. Let’s see
how to perform these alterations in the MongoDB database.
The update() operations can be used at the document level, which helps in updating
either a single document or set of documents within a collection.
Next, let’s look at how to add new fields to the documents. In order to add fields to
the document, use
The update() command with the $set operator and the multi option.
If you use a field name with $set , which is non-existent, then the field will be added
to the documents.
The following command will add the field company to all the documents:
> db.users.update({},{$set:{"Company":"TestComp"}},{multi:true})
>
Issuing find command against the user’s collection, you will find the new field added
to all documents.
> db.users.find()
{ "Age" : 30, "Company" : "TestComp", "Country" : "US", "FName" : "Test", "Gender" : "M",
"LName" : "User", "_id" : ObjectId("52f48cf474f8fdcfcae84f79") }
{ "Age" : 45, "Company" : "TestComp", "Country" : "UK", "Gender" : "F", "Name" : "Test
User", "_id" : ObjectId("52f48cfb74f8fdcfcae84f7a") }
{ "Age" : 11, "Company" : "TestComp", "Country" : "UK", "Gender" : "F", ....................
Type "it" for more
>
The following command will remove the field Company from all the documents:
> db.users.update({},{$unset:{"Company":""}},{multi:true})
>
6.1.6 Delete
> db.users.remove({"Gender":"M"})
>
The same can be verified by issuing the find() command on Users :
> db.users.find({"Gender":"M"})
>
No documents are returned.
The following command will delete all documents:
> db.users.remove({})
> db.users.find()
Dropping the collection
Finally, if you want to drop the collection, the following command will drop the
collection:
> db.users.drop()
true
>
In order to validate whether the collection is dropped or not, issue the show
collections command.
system.indexes
>
6.1.2 Read
In this part of the chapter, you will look at various examples illustrating the querying
functionality available as part of MongoDB that enables you to read the stored data
from the database. In order to start with basic querying, first create the users
collection and insert data following the insert command.
> user1 = {FName: "Test", LName: "User", Age:30, Gender: "M", Country: "US"}
{
"FName" : "Test",
"LName" : "User",
"Age" : 30,
"Gender" : "M",
"Country" : "US"
}
user2 = {Name: "Test User", Age:45, Gender: "F", Country: "US"}
{ "Name" : "Test User", "Age" : 45, "Gender" : "F", "Country" : "US" }
> db.users.insert(user1)
> db.users.insert(user2)
> for(var i=1; i<=20; i++) db.users.insert({"Name" : "Test User" + i, "Age": 10+i,
"Gender" : "F", "Country" : "India"})
A selector is like a where condition in SQL or a filter that is used to filter out
the results.A projector is like the select condition or the selection list that is
used to display the data fields.
6.1.2.2 Selector
We will now see how to use the selector. The following command will return
all the female users:
> db.users.find({"Gender":"F"})
{ "_id" : ObjectId("52f4a826958073ea07e15071"), "Name" : "Test User", "Age" : 45,
"Gender" : "F", "Country" : "US" }
.............
{ "_id" : ObjectId("52f4a83f958073ea07e15084"), "Name" : "Test User19", "Age" :29,
"Gender" : "F", "Country" : "India" }
Type "it" for more
>
6.1.2.3. Projector
We have seen how to use selectors to filter out documents within the
collection. In the above example, the find() command returns all fields of the
documents matching the selector.
Let’s add a projector to the query document where, in addition to the selector,
you will also mention specific details or fields that need to be displayed.
Suppose you want to display the first name and age of all female employees.
In this case, along with the selector, a projector is also used.
Execute the following command to return the desired result set:
> db.users.find({"Gender":"F"}, {"Name":1,"Age":1})
{ "_id" : ObjectId("52f4a826958073ea07e15071"), "Name" : "Test User", "Age" :
45 }
..........
Type "it" for more
>
6.1.2.4 sort( )
In MongoDB, the sort order is specified as follows: 1 for ascending and -1 for
descending sort.If in the above example you want to sort the records by
ascending order of age , you execute the following command:
>db.users.find({"Gender":"F"}, {"Name":1,"Age":1}).sort({"Age":1})
{ "_id" : ObjectId("52f4a83f958073ea07e15072"), "Name" : "Test User1", "Age" : 11 }
{ "_id" : ObjectId("52f4a83f958073ea07e15073"), "Name" : "Test User2", "Age" : 12 }
{ "_id" : ObjectId("52f4a83f958073ea07e15074"), "Name" : "Test User3", "Age" : 13 }
..............
{ "_id" : ObjectId("52f4a83f958073ea07e15085"), "Name" : "Test User20", "Age" :30 }
Type "it" for more
If you want to display the records in descending order by name and ascending
order by age , you execute the following command:
>db.users.find({"Gender":"F"},{"Name":1,"Age":1}).sort({"Name":-1,"Age":1})
............
6.1.2.5 limit( )
You will now look at how you can limit the records in your result set. For example, in
huge collections with thousands of documents, if you want to return only five
matching documents, the limit command is used,which enables you to do exactly
that.Returning to your previous query of female users who live in either India or US,
say you want to limit the result set and return only two users. The following
command needs to be executed:
>db.users.find({"Gender":"F",$or:[{"Country":"India"},{"Country":"US"}]}).limit(2)
6.1.2.6 skip( )
If the requirement is to skip the first two records and return the third and fourth user,
the skip command is used. The following command needs to be executed:
>db.users.find({"Gender":"F",$or:[{"Country":"India"}, {"Country":"US"}]}).limit(2).skip(2)
>
6.1.2.7 findOne( )
Similar to find() is the findOne() command. The findOne() method can take the
same parameters as find() , but rather then returning a cursor, it returns a
single document. Say you want to return one female user who stays in either
India or US. This can be achieved using the following command:
"_id" : ObjectId("52f4a826958073ea07e15071"),
"Age" : 45
>
Similarly, if you want to return the first record irrespective of any selector in that case,
you can use
> db.users.findOne()
"_id" : ObjectId("52f4a823958073ea07e15070"),
"FName" : "Test",
"LName" : "User",
"Age" : 30,
"Gender" : "M",
"Country" : "US"}
When the find() method is used, MongoDB returns the results of the query as a
cursor object. In order to display the result, the mongo shell iterates over the
returned cursor.MongoDB enables the users to work with the Cursor object of
the find method. In the next example,you will see how to store the cursor
object in a variable and manipulate it using a while loop. Say you want to
return all the users in the US. In order to do so, you created a variable, assigned
the output of find() to the variable, which is a cursor, and then using the while
loop you iterate and print the output.
"_id" : ObjectId("52f4a823958073ea07e15070"),
"FName" : "Test",
"LName" : "User",
"Age" : 30,
"Gender" : "M",
"Country" : "US"
6.1.2.9 explain( )
The explain() function can be used to see what steps the MongoDB database
is running while executing a query. Starting from version 3.0, the output
format of the function and the parameter that is passed to the function have
changed. It takes an optional parameter called verbose , which determines
what the explain output should look like. The following are the verbosity
modes: allPlansExecution , executionStats , and queryPlanner . The default
verbosity mode is queryPlanner , which means if nothing is specified, it
defaults to queryPlanner .
The following code covers the steps executed when filtering on the username
field:
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "mydbproc.users",
"indexFilterSet" : false,
"parsedQuery" : {
"$and" : [ ]
},
"winningPlan" : {
"stage" : "COLLSCAN",
"filter" : {
"$and" : [ ]
},
"direction" : "forward"
},
"rejectedPlans" : [ ]
"executionStats" : {
"executionSuccess" : true,
"nReturned" : 20,
"executionTimeMillis" : 0,
"totalKeysExamined" : 0,
"totalDocsExamined" : 20,
"executionStages" : {
"stage" : "COLLSCAN",
"filter" : {
"$and" : [ ]
},
"nReturned" : 20,
"executionTimeMillisEstimate" : 0,
"works" : 22,
"advanced" : 20,
"needTime" : 1,
"needFetch" : 0,
"saveState" : 0,
"restoreState" : 0,
"isEOF" : 1,
"invalidates" : 0,
"direction" : "forward",
"docsExamined" : 20
},
"allPlansExecution" : [ ]
},
"serverInfo" : {
"port" : 27017,
"version" : "3.0.4",
"gitVersion" : "534b5a3f9d10f00cd27737fbcd951032248b5952"
},
"ok" : 1
Indexes are used to provide high performance read operations for queries that are
used frequently. Bydefault, whenever a collection is created and documents are
added to it, an index is created on the id field of the document. In this section, you
will look at how different types of indexes can be created. Let’s begin by inserting 1
million documents using for loop in a new collection called testindx.
>for(i=0;i<1000000;i++){db.testindx.insert({"Name":"user"+i,"Age":Math.floor(Math.
random()*120)})}
Next, issue the find() command to fetch a Name with value of user101 . Run the
explain() command
to check what steps MongoDB is executing in order to return the result set.
> db.testindx.find({"Name":"user101"}).explain("allPlansExecution")
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "mydbproc.testindx",
"indexFilterSet" : false,
"parsedQuery" : {
"Name" : {
"$eq" : "user101"
},
"winningPlan" : {
"stage" : "COLLSCAN",
"filter" : {
"Name" : {
"$eq" : "user101"
},
"direction" : "forward"
},
"rejectedPlans" : [ ]
},
"executionStats" : {
"executionSuccess" : true,
"nReturned" : 1,
"executionTimeMillis" : 645,
"totalKeysExamined" : 0,
"totalDocsExamined" : 1000000,
"executionStages" : {
"stage" : "COLLSCAN",
"filter" : {
"Name" : {
"$eq" : "user101"
},
"nReturned" : 1,
"executionTimeMillisEstimate" : 20,
"works" : 1000002,
"advanced" : 1,
"needTime" : 1000000,
"needFetch" : 0,
"saveState" : 7812,
"restoreState" : 7812,
"isEOF" : 1,
"invalidates" : 0,
"direction" : "forward",
"docsExamined" : 1000000
},
"allPlansExecution" : [ ]
},
"serverInfo" : {
"port" : 27017,
"version" : "3.0.4",
"gitVersion" : "534b5a3f9d10f00cd27737fbcd951032248b5952"
},
"ok" : 1
"indexBounds" : {
"Name" : [
"[\"user101\", \"user101\"]"
]
}
}
},
"rejectedPlans" : [ ]
},
"executionStats" : {
"executionSuccess" : true,
"nReturned" : 1,
"executionTimeMillis" : 0,
"totalKeysExamined" : 1,
"totalDocsExamined" : 1,
"executionStages" : {
"stage" : "FETCH",
"nReturned" : 1,
"executionTimeMillisEstimate" : 0,
"works" : 2,
"advanced" : 1,
"needTime" : 0,
"needFetch" : 0,
"saveState" : 0,
"restoreState" : 0,
"isEOF" : 1,
"invalidates" : 0,
"docsExamined" : 1,
"alreadyHasObj" : 0,
"inputStage" : {
"stage" : "IXSCAN",
"nReturned" : 1,
"executionTimeMillisEstimate" : 0,
"works" : 2,
"advanced" : 1,
"needTime" : 0,
"needFetch" : 0,
"saveState" : 0,
"restoreState" : 0,
"isEOF" : 1,
"invalidates" : 0,
"keyPattern" : {
"Name" : 1
},
"indexName" : "Name_1",
"isMultiKey" : false,
"direction" : "forward",
"indexBounds" : {
"Name" : [
"[\"user101\", \"user101\"]"
]
},
"keysExamined" : 1,
"dupsTested" : 0,
"dupsDropped" : 0,
"seenInvalidated" : 0,
"matchTested" : 0
}
},
"allPlansExecution" : [ ]
},
"serverInfo" : {
"host" : "ANOC9",
"port" : 27017,
"version" : "3.0.4",
"gitVersion" : "534b5a3f9d10f00cd27737fbcd951032248b5952"
},
"ok" : 1
}
>
When creating an index, you should keep in mind that the index covers most of your
queries. If you sometimes query only the Name field and at times you query both the
Name and the Age field, creating a compound index on the Name and Age fields will
be more beneficial than an index that is created on either of the fields because the
compound index will cover both queries.
The following command creates a compound index on fields Name and Age of the
collection testindx .
Compound indexes help MongoDB execute queries with multiple clauses more
efficiently. When creating a compound index, it is also very important to keep in
mind that the fields that will be used for exact matches (e.g. Name : "S1" ) come first,
followed by fields that are used in ranges (e.g. Age : {"$gt":20} ).
Hence the above index will be beneficial for the following query:
>db.testindx.find({"Name": "user5","Age":{"$gt":25}}).explain("allPlansExecution")
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "mydbproc.testindx",
"indexFilterSet" : false,
"parsedQuery" : {
"$and" : [
"Name" : {
"$eq" : "user5"
},
"Age" : {
"$gt" : 25
},
"winningPlan" : {
"stage" : "KEEP_MUTATIONS",
"inputStage" : {
"stage" : "FETCH",
"filter" : {
"Age" : {
"$gt" : 25
},
............................
"indexBounds" : {
"Name" : [
"[\"user5\", \"user5\"
},
"rejectedPlans" : [
"stage" : "FETCH",
......................................................
"indexName" : "Name_1_Age_1",
"isMultiKey" : false,
"direction" : "forward",
.....................................................
"executionStats" : {
"executionSuccess" : true,
"nReturned" : 1,
"executionTimeMillis" : 0,
"totalKeysExamined" : 1,
"totalDocsExamined" : 1,
.....................................................
"inputStage" : {
"stage" : "FETCH",
"filter" : {
"Age" : {
"$gt" : 25
},
"nReturned" : 1,
"executionTimeMillisEstimate" : 0,
"works" : 2,
"advanced" : 1,
"allPlansExecution" : [
"nReturned" : 1,
"executionTimeMillisEstimate" : 0,
"totalKeysExamined" : 1,
"totalDocsExamined" : 1,
"executionStages" : {
.............................................................
"serverInfo" : {
"port" : 27017,
"version" : "3.0.4",
"gitVersion" : "534b5a3f9d10f00cd27737fbcd951032248b5952"
},
"ok" : 1
>
In MongoDB, a sort operation that uses an indexed field to sort documents provides
the greatest performance.As in other databases, indexes in MongoDB have an order
due to this. If an index is used to access documents, it returns results in the same
order as the index.A compound index needs to be created when sorting on multiple
fields. In a compound index, the output can be in the sorted order of either an index
prefix or the full index. The index prefix is a subset of the compound index, which
contains one or more fields from the start of
the index.For example, the following are the index prefix of the compound index: {
x:1, y: 1, z: 1} .
The sort operation can be on any of the combinations of index prefix like {x: 1}, {x: 1,
y: 1} .A compound index can only help with sorting if it is a prefix of the sort.
> db.testindx.find().sort({"Age":1})
> db.testindx.find().sort({"Age":1,"Name":1})
You can diagnose how MongoDB is processing a query by using the explain()
command.
>db.testindx.dropIndexes()
The following command will create a unique index on the Name field of the
testindx collection:
> db.testindx.ensureIndex({"Name":1},{"unique":true})
Now if you try to insert duplicate names in the collection as shown below,
MongoDB returns an error
> db.testindx.insert({"Name":"uniquename"})
> db.testindx.insert({"Name":"uniquename"})
If you check the collection, you’ll see that only the first uniquename was stored.
> db.testindx.find({"Name":"uniquename"})
>
Uniqueness can be enabled for compound indexes also, which means that
although individual fields
>
> db.testindx.insert({"Name":"usercit"})
db.collectionName.getIndexes()
For example, the following command will return all indexes created on the
testindx collection:
> db.testindx.getIndexes()
7 dropIndex
The following command will remove the Name field index from the testindx
collection:
> db.testindx.dropIndex({"Name":1})
{ "nIndexesWas" : 3, "ok" : 1 }
>
6.1.3.7 ReIndex
> db.testindx.reIndex()
"nIndexesWas" : 2,
..............
"ok" : 1
>
When the query is executed for the first time, MongoDB creates multiple
execution plans for each index that is available for the query. It lets the plans
execute within certain number of ticks in turns, until the plan that executes the
fastest finishes. The result is then returned to the system, which remembers the
index that was used by the fastest execution plan.
For subsequent queries, the remembered index will be used until some certain
number of updates has happened within the collection. After the updating limit
is crossed, the system will again follow the process to find out the best index
that is applicable at that time. The reevaluation of the query plans will happen
when either of the following events has occurred:
This will cover advanced querying using conditional operators and regular
expressions in the selector part. Each of these operators and regular
expressions provides you with more control over the queries you write and
consequently over the information you can fetch from the MongoDB database.
You will first create the collection and insert few sample documents.
>db.students.insert({Name:"S1",Age:25,Gender:"M",Class:"C1",Score:95})
>db.students.insert({Name:"S2",Age:18,Gender:"M",Class:"C1",Score:85})
>db.students.insert({Name:"S3",Age:18,Gender:"F",Class:"C1",Score:85})
>db.students.insert({Name:"S4",Age:18,Gender:"F",Class:"C1",Score:75})
>db.students.insert({Name:"S5",Age:18,Gender:"F",Class:"C2",Score:75})
>db.students.insert({Name:"S6",Age:21,Gender:"M",Class:"C2",Score:100})
>db.students.insert({Name:"S7",Age:21,Gender:"M",Class:"C2",Score:100})
>db.students.insert({Name:"S8",Age:25,Gender:"F",Class:"C2",Score:100})
>db.students.insert({Name:"S9",Age:25,Gender:"F",Class:"C2",Score:90})
>db.students.insert({Name:"S10",Age:28,Gender:"F",Class:"C3",Score:90})
> db.students.find()
{ "_id" : ObjectId("52f874faa13cd6a65998734d"), "Name" : "S1", "Age" : 25,
"Gender" : "M",
"Class" : "C1", "Score" : 95 }
.......................
{ "_id" : ObjectId("52f8758da13cd6a659987356"), "Name" : "S10", "Age" : 28,
"Gender" : "F",
"Class" : "C3", "Score" : 90 }
If you want to find out all students who are older than 25 (Age <= 25), execute
the following:
> db.students.find({"Age":{"$lte":25}})
{ "_id" : ObjectId("52f874faa13cd6a65998734d"), "Name" : "S1", "Age" : 25,
"Gender" : "M",
"Class" : "C1", "Score" : 95 }
....................
{ "_id" : ObjectId("52f87578a13cd6a659987355"), "Name" : "S9", "Age" : 25,
"Gender" : "F",
"Class" : "C2", "Score" : 90 }
>
6.2.3 MapReduce
The MapReduce framework enables division of the task, which in this case is data
aggregation across a cluster of computers in order to reduce the time it takes to
aggregate the data set. It consists of two parts: Map and Reduce. Here’s a more
specific description: MapReduce is a framework that is used to process problems that
are highly distributable across enormous datasets and are run using multiple nodes.
If all the nodes have the same hardware, these nodes are collectively referred as a
cluster; otherwise, it’s referred as a grid. This processing can occur on structured data
(data stored in a database) and unstructured data (data stored in a file system).
• “Map”: In this step, the node that is acting as the master takes the input parameter
and divides the big problem into multiple small sub-problems. These sub-problems
are then distributed across the worker nodes. The worker nodes might further divide
the problem into sub-problems. This leads to a multi-level tree structure. The worker
nodes will then work on the sub-problems within them and return the answer back
to the master node.
• “Reduce”: In this step, all the sub-problems’ answers are available with the master
node, which then combines all the answers and produce the final output, which is
the answer to the big problem you were trying to solve.
In order to understand how it works, let’s consider a small example where you will
find out the number of male and female students in your collection.
This involves the following steps: first you create the map and reduce functions and
then you call the mapReduce function and pass the necessary arguments.
Let’s start by defining the map function:
> var map = function(){emit(this.Gender,1);};
>
This step takes as input the document and based on the Gender field it emits
documents of the type
{"F", 1} or {"M", 1} .
Next, you create the reduce function:
> var reduce = function(key, value){return Array.sum(value);};
This will group the documents emitted by the map function on the key field, which in
your example is
Gender , and will return the sum of values, which in the above example is emitted as
“1”. The output of the
reduce function defined above is a gender-wise count .
Finally, you put them together using the mapReduce function, like so:
> db.students.mapReduce(map, reduce, {out: "mapreducecount1"})
{
"result" : "mapreducecount1",
"timeMillis" : 29,
"counts" : {
"input" : 15,
"emit" : 15,
"reduce" : 2,
"output" : 2
},
"ok" : 1,
This actually is applying the map , reduce function, which you defined on the
students collection.
The final result is stored in a new collection called mapreducecount1 .
In order to vet it, run the find() command on the mapreducecount1 collection, as
shown:
> db.mapreducecount1.find()
{ "_id" : "F", "value" : 6 }
{ "_id" : "M", "value" : 9 }
>
Here’s one more example to explain the workings of MapReduce . Let’s use
MapReduce to find out a
class-wise average score . As you saw in the above example, you need to first create
the map function and then
the reduce function and finally you combine them to store the output in a collection
in your database. The
code snippet is
> var map_1 = function(){emit(this.Class,this.Score);};
> var reduce_1 = function(key, value){return Array.avg(value);};
>db.students.mapReduce(map_1,reduce_1, {out:"MR_ClassAvg_1"})
{
"result" : "MR_ClassAvg_1",
"timeMillis" : 4,
"counts" : {
"input" : 15, "emit" : 15,
"reduce" : 3 , "output" : 5
},
"ok" : 1,
}
> db.MR_ClassAvg_1.find()
{ "_id" : "Biology", "value" : 90 }
{ "_id" : "C1", "value" : 85 }
{ "_id" : "C2", "value" : 93 }
{ "_id" : "C3", "value" : 90 }
{ "_id" : "Chemistry", "value" : 90 }
>
The first step is to define the map function, which loops through the collection
documents and returns
output as {"Class": Score}, for example {"C1":95} . The second step does a grouping
on the class and computes the average of the scores for that class. The third step
combines the results; it defines the collection to which the map , reduce function
needs to be applied and finally it defines where to store the output, which in this
case is a new collection called MR_ClassAvg_1 . In the last step, you use find in order
to check the resulting output.
6.2.4 aggregate()
The previous section introduced the MapReduce function. In this section, you will get
a glimpse of the aggregation framework of MongoDB. The aggregation framework
enables you find out the aggregate value without using the MapReduce
function. Performance-wise, the aggregation framework is faster than the
MapReduce function. You always need to keep in mind that MapReduce is meant for
batch approach and not for real-time analysis.
You will next depict the above two discussed outputs using the aggregate function.
First, the output was
to find the count of male and female students . This can be achieved by executing
the following command:
> db.students.aggregate({$group:{_id:"$Gender", totalStudent: {$sum: 1}}})
{ "_id" : "F", "totalStudent" : 6 }
{ "_id" : "M", "totalStudent" : 9 }
>
Similarly, in order to find out the class-wise average score, the following command
can be executed:
> db.students.aggregate({$group:{_id:"$Class", AvgScore: {$avg: "$Score"}}})
{ "_id" : "Biology", "AvgScore" : 90 }
{ "_id" : "C3", "AvgScore" : 90 }
{ "_id" : "Chemistry", "AvgScore" : 90 }
{ "_id" : "C2", "AvgScore" : 93 }
{ "_id" : "C1", "AvgScore" : 85 }
The MongoDB database provides two options for designing a data model: the user
can either embed related objects within one another, or it can reference each other
using ID. In this section, you will explore these options. In order to understand these
options, you will design a blogging application and demonstrate the usage
of the two options.
This data is actually in the first normal form. You will have lots of redundancy
because you can have multiple comments against the posts and multiple tags can be
associated with the post. The problem with redundancy, of course, is that it
introduces the possibility of inconsistency, where various copies of the same data
may have different values. To remove this redundancy, you need to further normalize
the data by splitting it into multiple tables. As part of this step, you must identify a
key column that uniquely identifies each row in the table so that you can create links
between the tables. The above scenarios when modelled using the 3NF normal forms
will look like the RDBMs diagram shown in Figure 6-3 .
6.3.1.2 The Problem with Normal Forms
As mentioned, the nice thing about normalization is that it allows for easy updating
without any redundancy(i.e. it helps keep the data consistent). Updating a user name
means updating the name in the Users table. However, a problem arises when you
try to get the data back out. For instance, to find all tags and comments associated
with posts by a specific user, the relational database programmer uses a JOIN. By
using a JOIN, the database returns all data as per the application screen design, but
the real problem is what operation the database performs to get that result set.
Generally, any RDBMS reads from a disk and does a seek, which takes well over 99%
of the time spent reading a row. When it comes to disk access, random seeks are the
enemy. The reason why this is so important in this context is because JOINs typically
require random seeks. The JOIN operation is one of the most expensive operations
within a relational database. Additionally, if you end up needing to scale your
database to multiple servers, you introduce the problem of generating a distributed
join, a complex and generally slow operation.
6.3.2 MongoDB Document Data Model Approach
In MongoDB, data is stored in documents. Fortunately for us as application
designers, this opens some new possibilities in schema design. Unfortunately for us,
it also complicates our schema design process. Now when faced with a schema
design problem there’s no longer a fixed path of normalized database design, as
there is with relational databases. In MongoDB, the schema design depends on the
problem you are trying to solve. If you must model the above using the MongoDB
document model, you might store the blog data in a document as follows:
{
"_id" : ObjectId("509d27069cc1ae293b36928d"),
"title" : "Sample title",
"body" : "Sample text.",
"tags" : [
"Tag1",
"Tag2",
"Tag3",
"Tag4"
],
"created_date" : ISODate("2015-07-06T12:41:39.110Z"),
"author" : "Author 1",
"category_id" : ObjectId("509d29709cc1ae293b369295"),
"comments" : [
{
"subject" : "Sample comment",
"body" : "Comment Body",
"author " : "author 2",
"created_date":ISODate("2015-07-06T13:34:23.929Z")
}
As you can see, you have embedded the comments and tags within a single
document only.
Alternatively, you could “normalize” the model a bit by referencing the comments
and tags by the id field:
// Authors document:
{
"_id": ObjectId("509d280e9cc1ae293b36928e "),
"name": "Author 1",}
// Tags document:
{
"_id": ObjectId("509d35349cc1ae293b369299"),
"TagName": "Tag1",.....}
// Comments document:
{
"_id": ObjectId("509d359a9cc1ae293b3692a0"),
"Author": ObjectId("508d27069cc1ae293b36928d"),
.......
"created_date" : ISODate("2015-07-06T13:34:59.336Z")
}
//Category Document
{
"_id": ObjectId("509d29709cc1ae293b369295"),
"Category": "Catgeory1"......
}
//Posts Document
{
"_id" : ObjectId("509d27069cc1ae293b36928d"),
"title" : "Sample title","body" : "Sample text.",
"tags" : [ ObjectId("509d35349cc1ae293b369299"),
ObjectId("509d35349cc1ae293b36929c")
],
"created_date" : ISODate("2015-07-06T13:41:39.110Z"),
"author_id" : ObjectId("509d280e9cc1ae293b36928e"),
"category_id" : ObjectId("509d29709cc1ae293b369295"),
"comments" : [
ObjectId("509d359a9cc1ae293b3692a0"),
]}
6.3.2.2 Embedding
Embedding can be useful when you want to fetch some set of data and display it on
the screen, such as a page that displays comments associated with the blog; in this
case the comments can be embedded in the Blogs document. The benefit of this
approach is that since MongoDB stores the documents contiguously on disk, all the
related data can be fetched in a single seek.
Apart from this, since JOINs are not supported and you used referencing in this case,
the application
might do something like the following to fetch the comments data associated with
the blog.
1. Fetch the associated comments _id from the blogs document.
2. Fetch the comments document based on the comments_id found in the first step.
If you take this approach, which is referencing, not only does the database have to
do multiple seeks to
find your data, but additional latency is introduced into the lookup since it now takes
two round trips to the database to retrieve your data.
If the application frequently accesses the comments data along with the blogs, then
almost certainly embedding the comments within the blog documents will have a
positive impact on the performance.
Another concern that weighs in favor of embedding is the desire for atomicity and
isolation in writing data. MongoDB is designed without multi-documents
transactions. In MongoDB, the atomicity of the operation is provided only at a single
document level so data that needs to be updated together atomically
needs to be placed together in a single document. When you update data in your
database, you must ensure that your update either succeeds or fails entirely, never
having a “partial success,” and that no other database reader ever sees an incomplete
write operation.
6.3.2.3 Referencing
We have seen that embedding is the approach that will provide the best
performance in many cases; it also provides data consistency guarantees. However, in
some cases, a more normalized model works better in MongoDB.
One reason for having multiple collections and adding references is the increased
flexibility it gives when querying the data. Let’s understand this with the blogging
example mentioned above. You saw how to use embedded schema, which will work
very well when displaying all the data together on a single page (i.e. the page that
displays the blog post followed by all of the associated comments). Now suppose
you have a requirement to search for the comments posted by a particular user. The
query (using this embedded schema) would be as follows:
6.3.3.3 Indexes
Indexes can be created to support commonly used queries to increase the
performance. By default, an index is created by MongoDB on the id field.
The following are a few points to consider when creating indexes:
• At least 8KB of data space is required by each index.
• For write operations, an index addition has some negative performance impact.
Hence for collections with heavy writes, indexes might be expensive because for
each insert, the keys must be added to all the indexes.
• Indexes are beneficial for collections with heavy read operations such as where the
proportion of read-to-write operations is high. The un-indexed read operations are
not affected by an index.
6.33.4 Sharding
One of the important factors when designing the application model is whether to
partition the data or not.This is implemented using sharding in MongoDB.
Sharding is also referred as partitioning of data. In MongoDB, a collection is
partitioned with its documents distributed across cluster of machines, which are
referred as shards. This can have a significant impact on the performance.
https://www.youtube.com/watch?v=qc0jRBwa0WU
7.1.1 mongod
The primary daemon in a MongoDB system is known as mongod. This daemon
handles all the data requests, manages the data format, and performs operations for
background management. When a mongod is run without any arguments, it
connects to the default data directory, which is
C:\data\db or /data/db , and default port 27017, where it listens for socket
connections.
It’s important to ensure that the data directory exists, and you have write permissions
to the directory before the mongod process is started.
If the directory doesn’t exist or you don’t have write permissions on the directory, the
start of this process will fail. If the default port 27017 is not available, the server will
fail to start. mongod also has a HTTP server which listens on a port 1000 higher than
the default port, so if you started the mongod with the default port 27017, in this
case the HTTP server will be on port 28017 and will be accessible using the URL
http://localhost:28017 . This basic HTTP server provides administrative information
about the database.
7.1.2 mongo
mongo provides an interactive JavaScript interface for the developer to test queries
and operations directly on the database and for the system administrators to
manage the database. This is all done via the command line. When the mongo shell
is started, it will connect to the default database called test . This database
connection value is assigned to global variable db . As a developer or administrator
you need to change the database from test to your database post the first
connection is made. You can do this by using <databasename>.
7.1.3
mongos is used in MongoDB sharding. It acts as a routing service that processes
queries from the application layer and determines where in the sharded cluster the
requested data is located. We will discuss mongos in more detail in the sharding
section. Right now you can think of mongos as the process that routes the queries to
the correct server holding the data.
7.1.5 Deployment
Standalone deployment is used for development purpose; it doesn’t ensure any
redundancy of data and it doesn’t ensure recovery in case of failures. So it’s not
recommended for use in production environment. Standalone deployment has the
following components: a single mongod and a client connecting to the mongod,
Figure 7-1. Standalone deployment
7.2 Replication
In a standalone deployment, if the mongod is not available, you risk losing all the
data, which is not acceptable in a production environment. Replication is used to
offer safety against such kind of data loss. Replication provides for data redundancy
by replicating data on different nodes, thereby providing protection of data in case
of node failure. Replication provides high availability in a MongoDB deployment.
Replication also simplifies certain administrative tasks where the routine tasks such as
backups can be offloaded to the replica copies, freeing the main copy to handle the
important application requests. In some scenarios, it can also help in scaling the
reads by enabling the client to read from the different copies of data.
https://www.youtube.com/watch?v=UYlHGGluJx8
Master/Slave Replication
In MongoDB, the traditional master/slave replication is available but it is
recommended only for more than 50 node replications. In this type of replication,
there is one master and a number of slaves that replicate the data from the master.
The only advantage with this type of replication is that there’s no restriction on the
number of slaves within a cluster. However, thousands of slaves will overburden the
master node, so in practical scenarios it’s better to have less than dozen slaves. In
addition, this type of replication doesn’t automate failover and provides less
redundancy.
In a basic master/slave setup , you have two types of mongod instances: one instance
is in the master mode and the remaining are in the slave mode, as shown in Figure 7-
2 . Since the slaves are replicating from the master, all slaves need to be aware of the
master’s address.
The master node maintains a capped collection (oplog) that stores an ordered history
of logical writes to the database.
The slaves replicate the data using this oplog collection. Since the oplog is a capped
collection, if the slave’s state is far behind the master’s state, the slave may become
out of sync. In that scenario, the replication will stop and manual intervention will be
needed to re-establish the replication.
There are two main reasons behind a slave becoming out of sync:
• The slave shuts down or stops and restarts later. During this time, the oplog may
have deleted the log of operations required to be applied on the slave.
• The slave is slow in executing the updates that are available from the master.
Replica Set
The replica set is a sophisticated form of the traditional master-slave replication and
is a recommended method in MongoDB deployments.
Replica sets are basically a type of master-slave replication but they provide
automatic failover. A replica set has one master, which is termed as primary, and
multiple slaves, which are termed as secondary in the replica set context; however,
unlike master-slave replication, there’s no one node that is fixed to be primary in the
replica set.
If a master goes down in replica set, automatically one of the slave nodes is
promoted to the master. The clients start connecting to the new master, and both
data and application will remain available. In a replica set, this failover happens in an
automated fashion. We will explain the details of how this process
happens later.
The primary node is selected through an election mechanism. If the primary goes
down, the selected node will be chosen as the primary node.
Figure 7-3 shows how a two-member replica set failover happens. Let’s discuss the
various steps that happen for a two-member replica set in failover
Replica set replication has a limitation on the number of members. Prior to version
3.0, the limit was 12 but this has been changed to 50 in version 3.0. So now replica
set replication can have maximum of 50 members only, and at any given point of
time in a 50-member replica set, only 7 can participate in a vote.
Primary member•: A replica set can have only one primary, which is elected by the
voting nodes in the replica set. Any node with associated priority as 1 can be elected
as a primary. The client redirects all the write operations to the primary member,
which is then later replicated to the secondary members.
7.2.2 Secondary member: A normal secondary member holds the copy of the data.
The
secondary member can vote and also can be a candidate for being promoted to
primary in case of failover of the current primary.
In addition to this, a replica set can have other types of secondary members.
7.2.3 Elections
In order to get elected, a server need to not just have the majority but needs to have
majority of the total votes. If there are X servers with each server having 1 vote, then
a server can become primary only when it has at least [(X/2) + 1] votes.
If a server gets the required number of votes or more, then it will become primary.
The primary that went down still remains part of the set; when it is up, it will act as a
secondary server until the time it gets a majority of votes again.
The complication with this type of voting system is that you cannot have just two
nodes acting as master and slave. In this scenario, you will have total of two votes,
and to become a master, a node will need the majority of votes, which will be both of
the votes in this case. If one of the servers goes down, the other server
will end up having one vote out of two, and it will never be promoted as master, so it
will remain a slave.
In case of network partitioning, the master will lose the majority of votes since it will
have only its own one vote and it’ll be demoted to slave and the node that is acting
as slave will also remain a slave in the absence of the majority of the votes. You will
end up having two slaves until both servers reach each other again.
A replica set has number of ways to avoid such situations. The simplest way is to use
an arbiter to help resolve such conflicts. It’s very lightweight and is just a voter, so it
can run on either of the servers itself.
Let’s now see how the above scenario will change with the use of an arbiter. Let’s first
consider the network partitioning scenario. If you have a master, a slave, and an
arbiter, each has one vote, totalling three votes. If a network partition occurs with the
master and arbiter in one data center and the slave in another data center, the
master will remain master since it will still have the majority of votes
7.2.3.1 Example - Working of Election Process in More Details
Let’s assume you have a replica set with the following three members: A1, B1, and
C1. Each member exchanges a heartbeat request with the other members every few
seconds. The members respond with their current situation information to such
requests. A1 sends out heartbeat request to B1 and C1. B1 and C1 respond with their
current situation information, such as the state they are in (primary or secondary),
their current clock time, their eligibility to be promoted as primary, and so on. A1
receives all this information’s and updates its “map” of the set, which maintains
information such as the members changed state, members that have gone down or
come up, and the round trip time.
While updating the A1’s map changes, it will check a few things depending on its
state:
• If A1 is primary and one of the members has gone down, then it will ensure that it’s
still able to reach the majority of the set. If it’s not able to do so, it will demote itself
to secondary state.
Demotions: There’s a problem when A1 undergoes a demotion. By default in
MongoDB writes are fire-and-forget (i.e. the client issues the writes but doesn’t
wait for a response). If an application is doing the default writes when the
primary is stepping down, it will never realize that the writes are actually not
happening and might end up losing data. Hence it’s recommended to use safe
writes. In this scenario, when the primary is stepping down, it closes all its client
connections, which will result in socket errors to the clients. The client libraries
then need to recheck who the new primary is and will be saved from losing their
write operations data.
• If A1 is a secondary and if the map has not changed, it will occasionally check
whether it should elect itself.
The first task A1 will do is run a sanity check where it will check answers to few
question such as, Does A1 think it’s already primary? Does another member think its
primary? Is A1 not eligible for election? If it can’t answer any of the basic questions,
A1 will continue idling as is; otherwise, it will proceed with the election process:
• A1 sends a message to the other members of the set, which in this case are B1
and C1, saying “I am planning to become a primary. Please suggest”
• When B1 and C1 receive the message, they will check the view around them.
They will run through a big list of sanity checks, such as, Is there any other
node that can be primary? Does A1 have the most recent data or is there any
other node that has the most recent data? If all the checks seem ok, they send
a “go-ahead” message; however, if any of the checks fail, a “stop election”
message is sent.
• If any of the members send a “stop election” reply, the election is cancelled and
A1 remains a secondary member.
• If the “go-ahead” is received from all, A1 goes to the election process
final phase.
7.2.4.1 Oplog
Oplog stands for the operation log . An oplog is a capped collection where all the
operations that modify the data are recorded.
The oplog is maintained in a special database, namely local in the collection
oplog.$main . Every operation is maintained as a document, where each document
corresponds to one operation that is performed on the master server. The document
contains various keys, including the following keys :
• ts : This stores the timestamp when the operations are performed. It’s an internal
type and is composed of a 4-byte timestamp and a 4-byte incrementing counter.
• op : This stores information about the type of operation performed. The value is
stored as 1-byte code (e.g. it will store an “I” for an insert operation).
• ns : This key stores the collection namespace on which the operation was
performed.
• o : This key specifies the operation that is performed. In case of an insert, this will
store the document to insert.
Only operations that change the data are maintained in the oplog because it’s a
mechanism for ensuring that the secondary node data is in sync with the primary
node data.
The operations that are stored in the oplog are transformed so that they remain
idempotent, which means that even if it’s applied multiple times on the secondary,
the secondary node data will remain consistent. Since the oplog is a capped
collection, with every new addition of an operation, the oldest operations are
automatically moved out. This is done to ensure that it does not grow beyond a pre-
set bound, which is the oplog size.
7.2.4.4 Starting Up
When a node is started, it checks its local collection to find out the
lastOpTimeWritten . This is the time of the latest op that was applied on the
secondary.
The following shell helper can be used to find the latest op in the shell:
> rs.debug.getLastOpWritten()
The output returns a field named ts , which depicts the last op time.
If a member starts up and finds the ts entry, it starts by choosing a target to sync
from and it will start syncing as in a normal operation. However, if no entry is found,
the node will begin the initial sync process.
The output field of syncingTo is present only on secondary nodes and provides
information on the node from which it is syncing.
7.2.5 Failover
In this section, you will look at how primary and secondary member failovers are
handled in replica sets. All members of a replica set are connected to each other. As
shown in Figure 7-5 , they exchange a heartbeat message amongst each other.
7.2.5.2 Rollbacks
In scenario of a primary node change, the data on the new primary is assumed to be
the latest data in the system. When the former primary joins back, any operation that
is applied on it will also be rolled back.
The rollback operation reverts all the write operations that were not replicated across
the replica set.
This is done in order to maintain database consistency across the replica set.
When connecting to the new primary, all nodes go through a resync process to
ensure the rollback is accomplished. The nodes look through the operation that is
not there on the new primary, and then they query the new primary to return an
updated copy of the documents that were affected by the operations.
The nodes are in the process of resyncing and are said to be recovering; until the
process is complete, they will not be eligible for primary election.
This happens very rarely, and if it happens, it is often due to network partition with
replication lag where the secondaries cannot keep up with the operation’s
throughput on the former primary.
It needs to be noted that if the write operations replicate to other members before
the primary steps down, and those members are accessible to majority of the nodes
of the replica set, the rollback does not occur.
The rollback data is written to a BSON file with filenames such as
<database>.<collection>.
<timestamp>.bson in the database’s dbpath directory .
The administrator can decide to either ignore or apply the rollback data. Applying
the rollback data can only begin when all the nodes are in sync with the new primary
and have rolled back to a consistent state.
The content of the rollback files can be read using Bsondump , which then need to
be manually applied to the new primary using mongorestore.
There is no method to handle rollback situations automatically for MongoDB.
Therefore manual intervention is required to apply rollback data. While applying the
rollback, it’s vital to ensure that these are replicated to either all or at least some of
the members in the set so that in case of any failover rollbacks can be avoided.
7.2.5.3 Consistency
the replica set members keep on replicating data among each other by reading the
oplog.
How is the consistency of data maintained? In this section, you will look at how
MongoD B ensures that you always access consistent data.
In MongoDB, although the reads can be routed to the secondaries, the writes are
always routed to the primary, eradicating the scenario where two nodes are
simultaneously trying to update the same data set.
The data set on the primary node is always consistent.
If the read requests are routed to the primary node, it will always see the up-to-date
changes, which means the read operations are always consistent with the last write
operations. However, if the application has changed the read preference to read from
secondaries, there might be a probability of user not seeing the latest changes or
seeing previous states. This is because the writes are replicated asynchronously on
the secondaries.
Figure 7-6. Members replica set with primary, secondary, and arbiter
2. Replica set fault tolerance is the count of members, which can go down but
still the replica set has enough members to elect a primary in case of any failure.
Table 7-1 indicates the relationship between the member count in the replica set
and its fault tolerance. Fault tolerance should be considered when deciding on
the number of members.
Although the primary purpose of the secondaries is to ensure data availability in case
of downtime of the primary node, there are other valid use cases for secondaries.
They can be used dedicatedly to perform backup operations or data processing jobs
or to scale out reads. One of the ways to scale reads is to issue the read queries
against the secondary nodes; by doing so the workload on the master is reduced.
One important point that you need to consider when using secondaries for scaling
read operations is that in MongoDB the replication is asynchronous, which means if
any write or update operation is performed on the master’s data, the secondary data
will be momentarily out-of-date. If the application in question is read-heavy and is
accessed over a network and does not need up-to-date data, the secondaries
can be used to scale out the read in order to provide a good read throughput.
Although by default the read requests are routed to the primary node, the requests
can be distributed over secondary nodes by specifying the read preferences . Figure
7-9 depicts the default read preference.
Figure 7-9. Default read preference
In order to understand how this command will be executed, say you have two
members, one named primary and the other named secondary, and it is syncing its
data from the primary. But how will the primary know the point at which the
secondary is synced? Since the primary’s oplog is queried by the secondary for op
results to be applied, if the secondary requests an op written at say t time, it implies
to the primary that the secondary has replicated all ops written before t .
The following are the steps that a write concern takes.
1. The write operation is directed to the primary.
2. The operation is written to the oplog of primary with ts depicting the time of
operation.
3. A w: 2 is issued, so the write operation needs to be written to one more server
before it’s marked successful.
4. The secondary queries the primary’s oplog for the op, and it applies the op.
5. Next, the secondary sends a request to the primary requesting for ops with ts
greater than t.
6. At this point, the primary sends an update that the operation until t has been
applied by the secondary as it’s requesting for ops with {ts: {$gt: t}} .
7. The writeConcern finds that a write has occurred on both the primary and
secondary, satisfying the w: 2 criteria, and the command returns success.
The following examples assume a replica set named testset that has the
configuration shown in Table 7-2 .
Finally, the following command needs to be issued to add the new mongod to the
replica set:
testset:PRIMARY> rs.add("ANOC9:27024")
{ "ok" : 1 }
The myState field’s value indicates the status of the member and it can have the
values shown in Table 7-3
7.2.3.6 Forcing a New Election
The current primary server can be forced to step down using the rs.stepDown ()
command. This force starts the election for a new primary.
This command is useful in the following scenarios :
1. When you are simulating the impact of a primary failure, forcing the cluster to fail
over. This lets you test how your application responds in such a scenario.
2. When the primary server needs to be offline. This is done for either a maintenance
activity or for upgrading or to investigating the server.
3. When a diagnostic process need to be run against the data structures.
7.3 Sharding
A page fault happens when data which is not there in memory is accessed by
MongoDB. If there’s free memory available, the OS will directly load the requested
page into memory; however, in the absence of free memory, the page in memory is
written to the disk and then the requested page is loaded in the memory, slowing
down the process. Few operations accidentally purge large portion of the working set
from the memory, leading to an adverse effect on the performance. One example is a
query scanning through all documents of a database where the size exceeds the
server memory. This leads to loading of the documents in memory and moving the
working set out to disk.
Ensuring you have defined the appropriate index coverage for your queries during
the schema design phase of the project will minimize the risk of this happening. The
MongoDB explain operation can be used to provide information on your query plan
and the indexes used.
MongoDB’s s erverStatus command returns a workingSet document that provides an
estimate of the instance’s working set size. The Operations team can track how many
pages the instance accessed over a given period of time and the elapsed time
between the working set’s oldest and newest document. Tracking all these metrics,
it’s possible to detect when the working set will be hitting the current memory limit,
so proactive actions can be taken to ensure the system is scaled well enough to
handle that.
https://www.youtube.com/watch?v=FK3cKKccM5E
7.3.1 Sharding Components
You will next look at the components that enable sharding in MongoDB. Sharding is
enabled in MongoDB via sharded clusters.
The following are the components of a sharded cluster:
• Shards
• mongos
• Config servers
The shard is the component where the actual data is stored. For the sharded cluster,
it holds a subset of data and can either be a mongod or a replica set. All shard’s data
combined together forms the complete dataset for the sharded cluster.
Sharding is enabled per collection basis, so there might be collections that are not
sharded. In every sharded cluster there’s a primary shard where all the unsharded
collections are placed in addition to the sharded collection data. When deploying a
sharded cluster, by default the first shard becomes the primary shard although it’s
configurable. See Figure 7-16 .
Based on the number of shards available, the line will be divided into ranges, and
documents will be distributed based on them.
In this scheme of partitioning, shown in Figure 7-17 , the documents where the
values of the shard key are nearby are likely to fall on the same shard. This can
significantly improve the performance of the range queries.
7.3.2.1.3 Chunks
The data is moved between the shards in form of chunks. The shard key range is
further partitioned into subranges, which are also termed as chunks. See Figure 7-19 .
Figure 7-19. Chunks
For a sharded cluster, 64MB is the default chunk size. In most situations, this is an apt
size for chunk slitting and migration.
Let’s discuss the execution of sharding and chunks with an example. Say you have a
blog posts collection which is sharded on the field date . This implies that the
collection will be split up on the basis of the date field values. Let’s assume further
that you have three shards. In this scenario the data might be distributed across
shards as follows:
Shard #1: Beginning of time up to July 2009
Shard #2: August 2009 to December 2009
Shard #3: January 2010 to through the end of time
7.3.3.2 Balancer
Balancer is the background process that is used to ensure that all of the shards are
equally loaded or are in a balanced state. This process manages chunk migrations.
Splitting of the chunk can cause imbalance. The addition or removal of documents
can also lead to a cluster imbalance. In a cluster imbalance, balancer is used, which is
the process of distributing data evenly.
When you have a shard with more chunks as compared to other shards, then the
chunks balancing is done automatically by MongoDB across the shards . This process
is transparent to the application and to you.
Any of the mongos within the cluster can initiate the balancer process. They do so by
acquiring a lock on the config database of the config server, as balancer involves
migration of chunks from one shard to another, which can lead to a change in the
metadata, which will lead to change in the config server database.
The balancer process can have huge impact on the database performance, so it can
either
1. Be configured to start the migration only when the migration threshold has
reached. The migration threshold is the difference in the number of maximum
and minimum chunks on the shards. Threshold is shown in Table 7-4 .
2. Or it can be scheduled to run in a time period that will not impact the production
traffic.
The balancer migrates one chunk at a time (see Figure 7-21 ) and follows these steps:
1. The moveChunk command is sent to the source shard.
2. A n internal m oveChunk command is started on the source where it creates the
copy of the documents within the chunk and queues it. In the meantime, any
operations for that chunk are routed to the source by the mongos because the
config database is not yet changed and the source will be responsible for serving
any read/write request on that chunk.
3. The destination shard starts receiving the copy of the data from the source.
4. Once all of the documents in the chunks have been received by the destination
shard, the synchronization process is initiated to ensure that all changes that
have happened to the data while migration are updated at the destination shard.
5. Once the synchronization is completed, the next step is to update the metadata
with the chunk’s new location in the config database. This activity is done by
the destination shard that connects to the config database and carries out the
necessary updates.
6. Post successful completion of all the above, the document copy that is
maintained at the source shard is deleted.
7.3.4 Operations
The read and write operations are performed on the sharded cluster. As mentioned,
the config servers maintain the cluster metadata. This data is stored in the config
database. This data of the config database is used by the mongos to service the
application read and write requests.
The data is cached by the mongos instances, which is then used for routing write and
read operations to the shards. This way the config servers are not overburdened.
The mongos will only read from the config servers in the following scenarios:
• The mongos has started for first time or
• An existing mongos has restarted or
• After chunk migration when the mongos needs to update its cached metadata with
the new cluster metadata.
2. Start the mongos. Enter the following command in a new terminal window (if it’s
not already running):
C:\>cd c:\practicalmongodb\bin
c:\practicalmongodb\bin>mongos --configdb localhost:27022 --port 27021
7.3.6.2 Tagging
By the end of the above steps you have your sharded cluster with a config server,
three shards, and a mongos up and running. Next, connect to the mongos at 30999
port and configdb at 27022 in a new terminal window:
C:\ >cd c:\practicalmongodb\bin
c:\ practicalmongodb\bin>mongos --port 30999 --configdb localhost:2702
We will look at the production cluster architecture. In order to understand it, let’s
consider a very generic use case of a social networking application where the user
can create a circle of friends and can share their comments or pictures across the
group. The user can also comment or like her friend’s comments or pictures. The
users are geographically distributed.The application requirement is immediate
availability across geographies of all the comments; data should be redundant so
that the user’s comments, posts and pictures are not lost; and it should be highly
available. So the application’s production cluster should have the following
components :
1. At least two mongos instance, but you can have more as per need.
2. Three c onfig servers, each on a separate system.
3. Two or more replica sets serving as shards . The replica sets are distributed across
geographies with read concern set to nearest.
7.4.1 Scenario 1
Mongos become unavailable: The application server where mongos has gone down
will not be able to communicate with the cluster but it will not lead to any data loss
since the mongos don’t maintain any data of its own. The mongos can restart, and
while restarting, it can sync up with the config servers to cache the cluster metadata,
and the application can normally start its operations.
Mongos become unavailable
7.4.2 Scenario 2
One of the mongod of the replica set becomes unavailable in a shard: Since you used
replica sets to provide high availability, there is no data loss. If a primary node is
down, a new primary is chosen, whereas if it’s a secondary node, then it is
disconnected and the functioning continues normally.
7.4.3 Scenario 3
If one of the shard becomes unavailable: In this scenario, the data on the shard will
be unavailable, but the other shards will be available, so it won’t stop the application.
The application can continue with its read/ write operations; however, the partial
results must be dealt with within the application. In parallel, the shard should attempt
to recover as soon as possible.
Shard unavailable
7.4.4 Scenario 4
Only one config server is available`e out of three: In this scenario, although the
cluster will become readonly, it will not serve any operations that might lead to
changes in the cluster structure, thereby leading to a change of metadata such as
chunk migration or chunk splitting. The config servers should be replaced ASAP
because if all config servers become unavailable, this will lead to an inoperable
cluster.
Only one config server available
Questions:
4) After starting the mongo shell, your session will use the ________ database by default.
A. mongo
B. master
C. test
D. primary
6) The mongo shell and the MongoDB drivers use __________ as the default write concern.
A. Nacknowledged
B. Acknowledgement
C. Acknowledged
D. All of the mentioned
7) _____________ can be used for batch processing of data and aggregation operations.
A. Hive
B. MapReduce
C. Oozie
D. None of the mentioned
9) The _________ property for an index causes MongoDB to reject duplicate values for the
indexed field.
A. Hashed
B. Unique
C. Multikey
D. None of the mentioned
10) Normalized data models describe relationships using ___________ between documents.
A. relativeness
B. references
C. evaluation
D. None of the mentioned