Skip to content

DOC: Categorical "Memory Usage" uses nbytes instead of memory_usage(deep=True) #48438

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 task done
tehunter opened this issue Sep 7, 2022 · 0 comments
Open
1 task done
Labels
Categorical Categorical Data Type Docs

Comments

@tehunter
Copy link
Contributor

tehunter commented Sep 7, 2022

Pandas version checks

  • I have checked that the issue still exists on the latest versions of the docs on main here

Location of the documentation

https://pandas.pydata.org/docs/dev/user_guide/categorical.html#memory-usage

Documentation problem

The "Memory Usage" section states that the memory usage of an object dtype is a constant times the length of the data, then provides an example of memory usage for object and category Series using the .nbytes property.

In the example provided, the Series data contains only 3-character strings. The documentation does not address the fact that nbytes only includes the size of the "pointers" to the objects (e.g. 8-bytes * 2000 items), and does not include the memory allocated to the string objects themselves (e.g. 52 bytes * 2000 items). An array of longer strings will take up even more memory.

import sys
import pandas as pd

s = pd.Series(["foo", "bar"] * 1000)
sys.getsizeof(s.iloc[0])
>>> 52
s.nbytes
>>> 16000
s.memory_usage(deep=True)
>>> 120128

pd.Series(["foooo", "barrr"] * 1000).memory_usage(deep=True)
>>> 124128

Even though this is in the "Gotchas" section, I think it is important to draw attention to the impact of object size. A Categorical will tend to provide better memory reduction on a large objects than on small objects, which might impact whether a user will want to use a Categorical or not. Here's a quick example:

for t in [np.int8, np.int16, np.int32, np.int64]:
    # 64 unique categories, each repeated 16 times
    s = pd.Series([t(i) for i in range(0,64)] * 16)
    print(s.memory_usage(deep=True) - s.astype("category").memory_usage(deep=True))

# Negative => Categorical uses more memory; Positive => Categorical uses less memory
>>> -2616
>>> -1592
>>> 456
>>> 4552

Suggested fix for documentation

I recommend the text be changed to:

The memory usage of a Categorical is proportional to the number and size of categories plus the length of the data. In contrast, an object dtype is proportional to the size of the objects times the length of the data.

And the code be changed to use .memory_usage(deep=True) for a more accurate understanding of the memory difference.

s = pd.Series(["foo", "bar"] * 1000)

# object dtype
s.memory_usage(deep=True)
>>> 120128

# category dtype
s.astype("category").memory_usage(deep=True)
>>> 2356
@tehunter tehunter added Docs Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 7, 2022
@mroeschke mroeschke added Categorical Categorical Data Type and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Docs
Projects
None yet
Development

No branches or pull requests

2 participants