031 Data Wrangling With Mongodb
031 Data Wrangling With Mongodb
import pandas as pd
from IPython.display import VimeoVideo
from pymongo import MongoClient
1 Prepare Data
1.1 Connect
[6]: VimeoVideo("665412155", h="1ca0dd03d0", width=600)
Task 3.1.2: Create a client that connects to the database running at localhost on port 27017.
• What’s a database client?
• What’s a database server?
• Create a client object for a MongoDB instance.
[7]: client = MongoClient(host="localhost", port=27017)
1
1.2 Explore
[8]: VimeoVideo("665412176", h="6fea7c6346", width=600)
# for i in my_list:
# print(i)
# for i in my_range:
# print(i)
print(getsizeof(my_list))
print(getsizeof(my_range))
152
48
Task 3.1.3: Print a list of the databases available on client.
• What’s an iterator?
• List the databases of a server using PyMongo.
• Print output using pprint.
[10]: db_list = list(client.list_databases())
#print(getsizeof(db_list))
pp.pprint(db_list)
2
[13]: <IPython.lib.display.VimeoVideo at 0x7fb542b118b0>
Task 3.1.5: Use the list_collections method to print a list of the collections available in db.
• What’s a MongoDB collection?
• List the collections in a database using PyMongo.
[14]: #list(db.list_collections())[0]
for c in db.list_collections():
print(c["name"])
lagos
system.buckets.lagos
nairobi
system.buckets.nairobi
system.views
dar-es-salaam
system.buckets.dar-es-salaam
Task 3.1.6: Assign the "nairobi" collection in db to the variable name nairobi.
• Access a collection in a database using PyMongo.
[16]: nairobi = db["nairobi"]
Task 3.1.7: Use the count_documents method to see how many documents are in the nairobi
collection.
• What’s a MongoDB document?
• Count the documents in a collection using PyMongo.
[18]: nairobi.count_documents({})
[18]: 202212
Task 3.1.8: Use the find_one method to retrieve one document from the nairobi collection, and
assign it to the variable name result.
• What’s metadata?
3
• What’s semi-structured data?
• Retrieve a document from a collection using PyMongo.
[20]: result = nairobi.find_one({})
pp.pprint(result)
{ 'P1': 39.67,
'_id': ObjectId('6261a046e76424a61615daaf'),
'metadata': { 'lat': -1.3,
'lon': 36.785,
'measurement': 'P1',
'sensor_id': 57,
'sensor_type': 'SDS011',
'site': 29},
'timestamp': datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)}
Task 3.1.9: Use the distinct method to determine how many sensor sites are included in the
nairobi collection.
• Get a list of distinct values for a key among all documents using PyMongo.
[22]: nairobi.distinct("metadata.site")
[22]: [29, 6]
Task 3.1.10: Use the count_documents method to determine how many readings there are for
each site in the nairobi collection.
• Count the documents in a collection using PyMongo.
[24]: print("Documents from site 6:", nairobi.count_documents({"metadata.site":6}))
print("Documents from site 29:", nairobi.count_documents({"metadata.site":29}))
Task 3.1.11: Use the aggregate method to determine how many readings there are for each site
in the nairobi collection.
4
• Perform aggregation calculations on documents using PyMongo.
[26]: result = nairobi.aggregate(
[
{"$group":{"_id":"$metadata.site","count":{"$count": {}}}}
]
)
pp.pprint(list(result))
Task 3.1.12: Use the distinct method to determine how many types of measurements have been
taken in the nairobi collection.
• Get a list of distinct values for a key among all documents using PyMongo.
[28]: nairobi.distinct("metadata.measurement")
Task 3.1.13: Use the find method to retrieve the PM 2.5 readings from all sites. Be sure to limit
your results to 3 records only.
• Query a collection using PyMongo.
[30]: result = nairobi.find({"metadata.measurement":"P2"}).limit(4)
pp.pprint(list(result))
[ { 'P2': 34.43,
'_id': ObjectId('6261a046e76424a616165b3a'),
'metadata': { 'lat': -1.3,
'lon': 36.785,
'measurement': 'P2',
'sensor_id': 57,
'sensor_type': 'SDS011',
'site': 29},
'timestamp': datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)},
{ 'P2': 30.53,
'_id': ObjectId('6261a046e76424a616165b3b'),
'metadata': { 'lat': -1.3,
'lon': 36.785,
'measurement': 'P2',
5
'sensor_id': 57,
'sensor_type': 'SDS011',
'site': 29},
'timestamp': datetime.datetime(2018, 9, 1, 0, 5, 3, 941000)},
{ 'P2': 22.8,
'_id': ObjectId('6261a046e76424a616165b3c'),
'metadata': { 'lat': -1.3,
'lon': 36.785,
'measurement': 'P2',
'sensor_id': 57,
'sensor_type': 'SDS011',
'site': 29},
'timestamp': datetime.datetime(2018, 9, 1, 0, 10, 4, 374000)},
{ 'P2': 13.3,
'_id': ObjectId('6261a046e76424a616165b3d'),
'metadata': { 'lat': -1.3,
'lon': 36.785,
'measurement': 'P2',
'sensor_id': 57,
'sensor_type': 'SDS011',
'site': 29},
'timestamp': datetime.datetime(2018, 9, 1, 0, 15, 4, 245000)}]
Task 3.1.14: Use the aggregate method to calculate how many readings there are for each type
("humidity", "temperature", "P2", and "P1") in site 6.
• Perform aggregation calculations on documents using PyMongo.
[32]: result = nairobi.aggregate(
[
{"$match": {"metadata.site":6}},
{"$group":{"_id":"$metadata.measurement","count":{"$count": {}}}}
]
)
pp.pprint(list(result))
6
Task 3.1.15: Use the aggregate method to calculate how many readings there are for each type
("humidity", "temperature", "P2", and "P1") in site 29.
• Perform aggregation calculations on documents using PyMongo.
[34]: result = nairobi.aggregate(
[
{"$match": {"metadata.site":29}},
{"$group":{"_id":"$metadata.measurement","count":{"$count": {}}}}
]
)
pp.pprint(list(result))
1.3 Import
[35]: VimeoVideo("665412437", h="7a436c7e7e", width=600)
Task 3.1.16: Use the find method to retrieve the PM 2.5 readings from site 29. Be sure to limit
your results to 3 records only. Since we won’t need the metadata for our model, use the projection
argument to limit the results to the "P2" and "timestamp" keys only.
• Query a collection using PyMongo.
[42]: result = nairobi.find(
{"metadata.site":29, "metadata.measurement":"P2"},
projection={"P2":1, "timestamp":1, "_id":0}
)
#pp.pprint(result.next())
Task 3.1.17: Read records from your result into the DataFrame df. Be sure to set the index to
"timestamp".
• Create a DataFrame from a dictionary using pandas.
[43]: df = pd.DataFrame(result).set_index("timestamp")
df.head()
[43]: P2
timestamp
7
2018-09-01 00:00:02.472 34.43
2018-09-01 00:05:03.941 30.53
2018-09-01 00:10:04.374 22.80
2018-09-01 00:15:04.245 13.30
2018-09-01 00:20:04.869 16.57
Copyright © 2022 WorldQuant University. This content is licensed solely for personal use. Redis-
tribution or publication of this material is strictly prohibited.