Skip to content

PERF: NaT groups cause wrong path in grouping #11010

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jreback opened this issue Sep 5, 2015 · 3 comments · Fixed by #11023
Closed

PERF: NaT groups cause wrong path in grouping #11010

jreback opened this issue Sep 5, 2015 · 3 comments · Fixed by #11023
Labels
Datetime Datetime data dtype Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Performance Memory or execution speed performance
Milestone

Comments

@jreback
Copy link
Contributor

jreback commented Sep 5, 2015

xref #10625

before this patch:

In [1]: from string import ascii_lowercase
In [2]: np.random.seed(2718281)
In [3]: n = 1 << 21
In [4]: dr = date_range('2015-08-30', periods=n // 10, freq='T')
In [5]: df = DataFrame({
   ...:         '1st':np.random.choice(list(ascii_lowercase), n),
   ...:         '2nd':np.random.randint(0, 5, n),
   ...:         '3rd':np.random.choice(dr, n)})

In [6]: df.loc[np.random.choice(n, n // 10), '3rd'] = np.nan
In [7]: gr = df.groupby(['1st', '2nd'])

In [8]: %timeit gr.count()
The slowest run took 21.22 times longer than the fastest. This could mean that an intermediate result is being cached 
1 loops, best of 3: 13.3 ms per loop

In [9]: %timeit gr.count()
100 loops, best of 3: 13.8 ms per loop

In [10]: pd.__version__
Out[10]: '0.16.2+521.g207efc2'

with this patch:

In [8]: %timeit gr.count()
1 loops, best of 3: 144 ms per loop

In [9]: %timeit gr.count()
10 loops, best of 3: 149 ms per loop

In [10]: pd.__version__
Out[10]: '0.16.2+522.g9c2d1a6'
@jreback jreback added Datetime Datetime data dtype Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Performance Memory or execution speed performance labels Sep 5, 2015
@jreback jreback added this to the 0.17.0 milestone Sep 5, 2015
@jreback
Copy link
Contributor Author

jreback commented Sep 5, 2015

cc @larvian
cc @behzadnouri

@jreback
Copy link
Contributor Author

jreback commented Sep 7, 2015

After #11013 this seems ok

is their a asv bench for this? (e.g. count on datetime64 with NaT)?

In [11]: %timeit gr.count()
100 loops, best of 3: 7.16 ms per loop

In [12]: pd.__version__
Out[12]: '0.16.2+599.g33530b3'

@behzadnouri
Copy link
Contributor

it is only ok for count since i removed the cython wrapper. it does still break other cythonized methods

jreback added a commit to jreback/pandas that referenced this issue Sep 7, 2015
jreback added a commit that referenced this issue Sep 8, 2015
PERF: use NaT comparisons in int64/datetimelikes #11010
nickeubank pushed a commit to nickeubank/pandas that referenced this issue Sep 29, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants