Description
Pandas version checks
- I have checked that the issue still exists on the latest versions of the docs on
main
here
Location of the documentation
Documentation problem
The "Memory Usage" section states that the memory usage of an object
dtype is a constant times the length of the data, then provides an example of memory usage for object
and category
Series using the .nbytes
property.
In the example provided, the Series data contains only 3-character strings. The documentation does not address the fact that nbytes only includes the size of the "pointers" to the objects (e.g. 8-bytes * 2000 items), and does not include the memory allocated to the string objects themselves (e.g. 52 bytes * 2000 items). An array of longer strings will take up even more memory.
import sys
import pandas as pd
s = pd.Series(["foo", "bar"] * 1000)
sys.getsizeof(s.iloc[0])
>>> 52
s.nbytes
>>> 16000
s.memory_usage(deep=True)
>>> 120128
pd.Series(["foooo", "barrr"] * 1000).memory_usage(deep=True)
>>> 124128
Even though this is in the "Gotchas" section, I think it is important to draw attention to the impact of object size. A Categorical will tend to provide better memory reduction on a large objects than on small objects, which might impact whether a user will want to use a Categorical or not. Here's a quick example:
for t in [np.int8, np.int16, np.int32, np.int64]:
# 64 unique categories, each repeated 16 times
s = pd.Series([t(i) for i in range(0,64)] * 16)
print(s.memory_usage(deep=True) - s.astype("category").memory_usage(deep=True))
# Negative => Categorical uses more memory; Positive => Categorical uses less memory
>>> -2616
>>> -1592
>>> 456
>>> 4552
Suggested fix for documentation
I recommend the text be changed to:
The memory usage of a Categorical is proportional to the number and size of categories plus the length of the data. In contrast, an object dtype is proportional to the size of the objects times the length of the data.
And the code be changed to use .memory_usage(deep=True)
for a more accurate understanding of the memory difference.
s = pd.Series(["foo", "bar"] * 1000)
# object dtype
s.memory_usage(deep=True)
>>> 120128
# category dtype
s.astype("category").memory_usage(deep=True)
>>> 2356