Skip to content

DOC: Categorical "Memory Usage" uses nbytes instead of memory_usage(deep=True) #48438

Open
@tehunter

Description

@tehunter

Pandas version checks

  • I have checked that the issue still exists on the latest versions of the docs on main here

Location of the documentation

https://pandas.pydata.org/docs/dev/user_guide/categorical.html#memory-usage

Documentation problem

The "Memory Usage" section states that the memory usage of an object dtype is a constant times the length of the data, then provides an example of memory usage for object and category Series using the .nbytes property.

In the example provided, the Series data contains only 3-character strings. The documentation does not address the fact that nbytes only includes the size of the "pointers" to the objects (e.g. 8-bytes * 2000 items), and does not include the memory allocated to the string objects themselves (e.g. 52 bytes * 2000 items). An array of longer strings will take up even more memory.

import sys
import pandas as pd

s = pd.Series(["foo", "bar"] * 1000)
sys.getsizeof(s.iloc[0])
>>> 52
s.nbytes
>>> 16000
s.memory_usage(deep=True)
>>> 120128

pd.Series(["foooo", "barrr"] * 1000).memory_usage(deep=True)
>>> 124128

Even though this is in the "Gotchas" section, I think it is important to draw attention to the impact of object size. A Categorical will tend to provide better memory reduction on a large objects than on small objects, which might impact whether a user will want to use a Categorical or not. Here's a quick example:

for t in [np.int8, np.int16, np.int32, np.int64]:
    # 64 unique categories, each repeated 16 times
    s = pd.Series([t(i) for i in range(0,64)] * 16)
    print(s.memory_usage(deep=True) - s.astype("category").memory_usage(deep=True))

# Negative => Categorical uses more memory; Positive => Categorical uses less memory
>>> -2616
>>> -1592
>>> 456
>>> 4552

Suggested fix for documentation

I recommend the text be changed to:

The memory usage of a Categorical is proportional to the number and size of categories plus the length of the data. In contrast, an object dtype is proportional to the size of the objects times the length of the data.

And the code be changed to use .memory_usage(deep=True) for a more accurate understanding of the memory difference.

s = pd.Series(["foo", "bar"] * 1000)

# object dtype
s.memory_usage(deep=True)
>>> 120128

# category dtype
s.astype("category").memory_usage(deep=True)
>>> 2356

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions