BDTT Lab 2023 24 Week9
BDTT Lab 2023 24 Week9
1|Page
Table of Contents
Part 1: Fire up the Atlas workspace ........................................................................................... 3
References ............................................................................................................................... 26
2|Page
Part 1: Fire up the Atlas workspace
2- Make sure your current IP is listed as the trusted IP. If not, go through Network Access and add
your local IP address.
3|Page
4- Click on Browse Collections. Now you can see the list of databases on the left, and the content
of a selected collection on the right.
Part-2: MongoDB
MongoDB is a cross-platform, document-oriented NoSQL database. It is designed to store and manage
unstructured or semi-structured data. Unlike traditional relational databases, MongoDB uses a flexible
document model, which allows developers to store and query data in a more intuitive and natural way.
Data in MongoDB is stored in documents, which are similar to JSON objects and can contain any number
of fields, arrays, and sub-documents. This makes it easy to store complex data structures and to modify
them as requirements change.
MongoDB stores data records as documents (specifically BSON documents) which are gathered in
collections. You can create secondary indexes on these collections, join them together, and use the
powerful aggregation framework embedded in MongoDB. A database stores one or more collections of
documents.
In this workshop we mostly work with reading data from MongoDB. To select all documents in the
collection, pass an empty document as the query filter parameter to the find method. The query filter
parameter determines the select criteria:
4|Page
This operation uses a filter predicate of {}, which corresponds to the following SQL statement:
To specify equality conditions, use <field>:<value> expressions in the query filter document:
The following example selects from the inventory collection all documents where the status equals
"D":
This operation uses a filter predicate of { status: "D" }, which corresponds to the following SQL
statement:
5|Page
Part-3: Working with Atlas
1- On Atlas, expand sample_mflix database
4- Search for those movies that have won more than one award.
Note: To access a field inside of a nested document, you can use the dot operator.
6|Page
Note: To use conditional operators such as greater than, less than, greater than and equals and so
on, you can use an operator along with the intended value as a dictionary. For example, to check
whether the value of a field is less than or equal to 5, you can use {“field_Name”: {$lte:5}}
5- Find those movies that have won more than one award and that have the USA as their country.
6- You can also specify a projection list to just show the information needed. Find those movies
whose genre is “Short” and just project the title:
Challenge-1
Go to sample_ restaurants database, and in the restaurants collection, find those restaurants that have
grades’ score greater than or equal to 10, and which serve American cuisine.
7|Page
Part-4: Aggregation Framework
The Aggregation Framework in MongoDB is a powerful data processing tool that allows you to perform
complex data analysis on collections of documents in a database. It provides a set of operators that can
be used to perform data filtering, grouping, sorting, and data transformations. With the Aggregation
Framework, you can combine data from multiple collections, perform calculations on data, and analyse
data in real-time. This makes it a very powerful tool for business intelligence and data analysis
applications.
• Pipelined data processing: The aggregation framework allows you to combine multiple operators
into a single pipeline, where the output of one operator is the input to the next operator. This
makes it easy to perform complex data transformations and analysis.
• Extensive operator set: The Aggregation Framework provides a wide range of operators for data
filtering, grouping, sorting, and transformations. This includes operators for conditional logic,
arithmetic operations, string manipulation, date manipulation, and more.
• Integration with MongoDB: The Aggregation Framework is tightly integrated with MongoDB,
which means that it can take advantage of MongoDB's scalability, replication, and sharding
features.
The MongoDB Aggregation Framework includes several stages that can be used to perform various data
processing operations. The stages are applied in a pipeline, where the output of one stage becomes the
input of the next stage. The stages in the Aggregation Framework are:
• $match: This stage is used to filter documents based on certain criteria. It works like a query filter
and can use various comparison operators to filter documents.
• $project: This stage is used to select certain fields from documents and project them in the output.
It can also be used to create new fields or transform existing fields.
• $group: This stage is used to group documents by a specified field or fields. It can also perform
various aggregate functions such as sum, average, and count on the grouped data.
• $sort: This stage is used to sort the output documents based on one or more fields. It can sort in
ascending or descending order.
• $limit: This stage is used to limit the number of documents returned in the output.
• $skip: This stage is used to skip a specified number of documents in the input before processing.
8|Page
• $unwind: This stage is used to break up an array field into separate documents, each containing
a single value from the array.
• $lookup: This stage is used to perform a left outer join between two collections.
• $facet: This stage is used to perform multiple aggregation operations on the same set of input
documents. It returns multiple sets of documents, each representing the result of a separate
aggregation operation.
These stages can be combined in different ways to perform a wide range of data processing operations
on MongoDB collections.
1- Click on movies collection and select the aggregation tab. Then click on “create new”.
9|Page
3- From the stages drop down list select $match and specify directors as “Sam Raimi”.
4- Add another stage and select its type as $project. Then filter out _id and select title and imdb
rating to show.
5- Add another stage and select $group. The objective is to calculate the average imdb rating for
those movies that are directed by “Sam Raimi”.
10 | P a g e
These are the stages of your pipeline.
6- By clicking on “EXPORT TO LANGUAGE”, you can export the pipeline into other programming
languages.
You can join some collections through their shared keys. To this end, you need to use $lookup stage.
The activity is to count the number of comments for each movie. There are two different collections as
movies and comments. You need to join them together by movie_id in the comments collection and _id
in the movies collection. We want to take 1980s movies into account.
11 | P a g e
8- In the current stage select $match and filter year between 1980 and 1990.
12 | P a g e
9- Add a new stage and select $lookup as its type.
Note: from shows the collection that you want to join to. let links the _id of movies collection to a
temporary variable as id and make it accessible inside of the pipeline. In pipeline you can define any
aggregation stages, however, you have to join the primary and foreign keys together. And finally, as
defines the name of the desired field.
10- Add the $count stage inside of the $lookup pipeline. As it is a pipeline you don’t need to add any
other stage.
13 | P a g e
Challenge-2
Calculate the average number of tomatoes viewer reviews of those movies that have a production year
after 1920, English as their language, tomatoes viewer ratings which are greater than 3.5 and they have
mflix comments.
NB: You only need to carry out this step if you are using your own device and have not previously
installed Anaconda. If you have previously installed Anaconda or are using a university device, please
skip to page 19, step 3.
14 | P a g e
If you want to install Anaconda on other operating systems, click on “Get Additional
Installer”. This will take you to the bottom of the page where you can find other versions of the
Anaconda installers. Right click on the option that works for you and click on “save link as” to
download installer.
15 | P a g e
Go to your Downloads folder and double-click the installer to launch (If you encounter issues
during installation, temporarily disable your anti-virus software during install, then re-enable it
after the installation).
Click on the next button.
Select a destination folder to install Anaconda. You can use Browse button to change the location
(The directory path should not contain spaces or unicode characters). After choosing install
location click on the next button.
16 | P a g e
In the next step, check “Register Anaconda3 as my default Python 3.9” and click on the Install button.
Please wait while Anaconda3 is being installed. It will take few minutes. Then the Next button will be
enabled.
In the next Dialog box, click on the Next button. Finally, you should see the “Completing Anaconda3
Setup” dialog box. Click the Finish button to complete the installation (If you wish to read more about
Anaconda.org and how to get started with Anaconda, check the boxes “Anaconda Distribution Tutorial”
and “Getting started with Anaconda”.)
17 | P a g e
Now open Anaconda and carry on with the following tasks.
1- Open Anaconda and launch Jupyter Notebook
18 | P a g e
3- pymongo is the required library to work with MongoDB in Python. Install it on your Jupyter
notebook. You need to do this just once.
Note: MongoClient object is a part of pymongo. You need to pass a url containing most of the
information required to access to MongoDB Atlas to instantiate from this Object. The url is your
connection string. So first you should collect it from Atlas.
4- Go to Atlas, select your database and click on connect button.
6- Select Python and version later than 3.6. Then copy the generated connection string. You
have to replace your own password and check “include full driver code example.
19 | P a g e
7- Go back to Jupyter Notebook, import pymongo, define your url based on step 6 and
instantiate a client.
8- We can list the databases connected to this client object through the following command:
11- Select movies collection and define a new object upon that collection.
20 | P a g e
12- Count the number of documents in this collection.
Note: there are two methods to read from a collection. find_one() that returns the first document
satisfying the defined condition(s) in a natural order.
13- Find a movie.
14- Find a movie casted by Salma Hayek. So, you need to pass a dictionary to find_one method
containing a field name and the associated condition.
21 | P a g e
Note: Most of the time, find_one is not what we want to use, as we typically want to find all
documents satisfying a condition. On this occasion we use find() method. Find() doesn’t return a
response – instead it returns a cursor object. We can store the cursor in a variable and dump it from a
JSON format. Now we can access the documents inside of the cursor. Dump is in the JSON library and
gives us an output in a nice format.
15- Find all movies with Salma Hayek in the cast and print them.
16- Now what happens once you don’t need to have all the fields? In this situation you need to specify
the projection list. The second dictionary is a projection list, containing the field name and either
a 1 (meaning that you want to show it) or a 0 (meaning you want to omit it). Find movies casted
by Salma Hayek and just print their title.
22 | P a g e
17- You can also limit the number of documents returned by pymonogo through limit(). Just show
two documents regarding the above conditions.
18- You can design the above command through the aggregation framework.
23 | P a g e
19- The next operator is sorting. Sort() method takes two parameter including key and the sorting
order “ASCENDING” or “DESCENDING”. Sort movies casted by Salma Hayek based on their
production year in the ascending order.
Note: Aggregation is a pipeline. Pipelines are composed of stages, which are broad units of work.
Within stages, expressions are used to specify individual units of works. Expressions are functions. Each
stage is like an assembly station and does a specific task. For example, $match checks a specific
condition and select documents that fulfil that condition, $projection filters out unnecessary fields, and
$group collects them together.
Let’s look at a function called add in Python and Aggregation framework.
Python
Aggregation Framework
{“$add” :[“$a”, “$b”]}; all stages in the aggregation framework have $ before them.
24 | P a g e
20- Count the number of movies directed by Sam Raimi
21- Run the exported pipeline of the first aggregation in Part-4 in Python.
Challenge 3
Implement the pipeline designed for challenge-2 in Python.
25 | P a g e
References:
• MongoDB University. MongoDB offers a range of online courses and certifications that
cover various aspects of MongoDB development, administration, and deployment.
These courses are self-paced and are designed to help you learn at your own pace
(https://learn.mongodb.com/).
• https://www.mongodb.com/docs/manual/tutorial/query-documents/
26 | P a g e