Skip to content

Categorical.from_codes shouldn't coerce to int64 #18501

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dcolascione opened this issue Nov 26, 2017 · 7 comments · Fixed by #21000
Closed

Categorical.from_codes shouldn't coerce to int64 #18501

dcolascione opened this issue Nov 26, 2017 · 7 comments · Fixed by #21000
Labels
Categorical Categorical Data Type good first issue Performance Memory or execution speed performance
Milestone

Comments

@dcolascione
Copy link

Categorical.from_codes coerces its input to an array of np.int64 unconditionally even though the Categorical constructor immediately coerces the input to some other dtype using coerce_indexer_dtype. This coercion might cause a memory usage spike when codes is large. ISTM that we can just avoid the conversion in from_codes entirely and let coerce_indexer_dtype take care of any error case.

Version: master

@jreback
Copy link
Contributor

jreback commented Nov 26, 2017

pls show a copy pastable example

@dcolascione
Copy link
Author

I'm not sure what you want. The intermediate conversion to int64 isn't observable, so there'd be nothing to show in a copy-paste example. It's an efficiency enhancement request, not a functionality one.

@jreback
Copy link
Contributor

jreback commented Nov 26, 2017

i want a copy pastable example of your input

@dcolascione
Copy link
Author

    # Shouldn't convert the input codes array to int64, because we then convert it to 
    # to int8 anyway.
    pd.Categorical.from_codes(codes=np.asarray([0,1], np.int16), categories=["foo", "bar"])

@TomAugspurger
Copy link
Contributor

Should be able to wrap this:

try:
codes = np.asarray(codes, np.int64)
except:
raise ValueError(
"codes need to be convertible to an arrays of integers")

in an

if not is_integer_dtype(codes):
    # do the try / except

And see what breaks. @dcolascione could you submit a PR for that, along with tests and a release note?

Note that we even if we avoid the cast to int64, I think we will still try to cast it to a smaller dtype (int8 in this case).

In [6]: pd.Categorical.from_codes(codes=np.asarray([0,1], np.int16), categories=["foo", "bar"]).codes.dtype
Out[6]: dtype('int8')

Avoiding all copies may be more difficult, but possible.

@TomAugspurger TomAugspurger added Categorical Categorical Data Type Difficulty Novice Performance Memory or execution speed performance labels Nov 26, 2017
@TomAugspurger TomAugspurger added this to the Next Major Release milestone Nov 26, 2017
@jreback
Copy link
Contributor

jreback commented Nov 26, 2017

ok this looks fine to do

In [1]: arr = np.ones(10000000,dtype='int8')

# with change
In [5]: %timeit pd.Categorical.from_codes(arr, ['foo', 'bar'])
8.92 ms ± 510 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# master
In [2]: %timeit pd.Categorical.from_codes(arr, ['foo', 'bar'])
53.2 ms ± 3.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

change

diff --git a/pandas/core/categorical.py b/pandas/core/categorical.py
index deaec20..8710f50 100644
--- a/pandas/core/categorical.py
+++ b/pandas/core/categorical.py
@@ -597,7 +597,7 @@ class Categorical(PandasObject):
             unordered.
         """
         try:
-            codes = np.asarray(codes, np.int64)
+            codes = np.asarray(codes)
         except:
             raise ValueError(
                 "codes need to be convertible to an arrays of integers")

@renewang
Copy link

Hi, I would like to work on this issue. Is any issue that I couldn't work on this? And any suggestions or documents for the first starter like how to write a test case etc? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type good first issue Performance Memory or execution speed performance
Projects
None yet
4 participants