PERF: Allow jitting of groupby agg loop #35759

mroeschke · 2020-08-17T06:01:31Z

tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

New performance comparison for 10,000 groups

In [1]: In [1]: df_g = pd.DataFrame({'a': range(10**4), 'b': range(10**4), 'c': range(10**4)})

In [2]: In [2]: def f(x):
   ...:    ...:     return np.sum(x) + 1
   ...:

In [3]: df_g.groupby('a').agg(f)
Out[3]:
          b      c
a
0         1      1
1         2      2
2         3      3
3         4      4
4         5      5
...     ...    ...
9995   9996   9996
9996   9997   9997
9997   9998   9998
9998   9999   9999
9999  10000  10000

[10000 rows x 2 columns]

In [4]: %timeit df_g.groupby('a').agg(f)
1.2 s ± 70.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [5]: def f(values, index):
   ...:     return np.sum(values) + 1
   ...:

In [6]: df_g.groupby('a').agg(f, engine='numba', engine_kwargs={'parallel': True})
Out[6]:
            b        c
a
0         1.0      1.0
1         2.0      2.0
2         3.0      3.0
3         4.0      4.0
4         5.0      5.0
...       ...      ...
9995   9996.0   9996.0
9996   9997.0   9997.0
9997   9998.0   9998.0
9998   9999.0   9999.0
9999  10000.0  10000.0

In [8]: %timeit df_g.groupby('a').agg(f, engine='numba', engine_kwargs={'parallel': True})
2.07 ms ± 64.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

…sm_groupby_agg

…by agg

bashtage · 2020-08-17T10:10:20Z

pandas/core/groupby/groupby.py

+            sorted_data, sorted_index, starts, ends, len(group_keys), len(data.columns),
+        )
+        if cache_key not in NUMBA_FUNC_CACHE:
+            NUMBA_FUNC_CACHE[cache_key] = numba_agg_func


Can't this be moved into the else?

Should you check that the cache is being used property lin a test?

I was thinking to evaluate the function first with all arguments first before putting the function in the cache so we're not caching a function that may fail.

I have existing tests that check for the presence of the function in the cache here:

pandas/pandas/tests/groupby/aggregate/test_numba.py

Line 104 in 97ec062

assert (func_1, "groupby_agg") in NUMBA_FUNC_CACHE

yeah would want to move this all to groupby/numba_.py rather than here (you can certainly cache afer, but ideally all of the caching is not exposed here; I think we did this elsewhere IIRC)

pandas/core/groupby/numba_.py

…sm_groupby_agg

jreback · 2020-08-19T18:09:18Z

pandas/core/groupby/generic.py

@@ -230,6 +227,18 @@ def apply(self, func, *args, **kwargs):
    )
    def aggregate(self, func=None, *args, engine=None, engine_kwargs=None, **kwargs):

+        if maybe_use_numba(engine):


would not object to making a _aggregate_with_python_cython (where you put everything L242 and on down.

I could do this in a follow up refactor PR.

I guess I would need to make a Series and DataFrame version of this function since it looks like both are different.

jreback · 2020-08-19T18:10:00Z

pandas/core/groupby/generic.py

            )
+            return self.obj._constructor(result, index=index, columns=data.columns)
+
+        relabeling, func, columns, order = reconstruct_func(func, **kwargs)


same comment here

jreback · 2020-08-19T18:10:21Z

pandas/core/groupby/groupby.py

-        self, func, *args, engine="cython", engine_kwargs=None, **kwargs
-    ):
+    def _aggregate_with_numba(self, data, func, *args, engine_kwargs=None, **kwargs):
+        group_keys = self.grouper._get_group_keys()


can you add a doc-string a and type as much as possible

jreback · 2020-08-19T18:11:04Z

pandas/core/groupby/groupby.py

+        sorted_labels = algorithms.take_nd(labels, sorted_index, allow_fill=False)
+        sorted_data = data.take(sorted_index, axis=self.axis).to_numpy()
+        starts, ends = lib.generate_slices(sorted_labels, n_groups)
+        cache_key = (func, "groupby_agg")


is this consistent with other functions, e.g. transform and rolling and such (the cache keys)?

Yeah these keys are all formatted similarly (function, "string of the operation")

jreback · 2020-08-19T18:12:13Z

pandas/core/groupby/groupby.py

+            sorted_data, sorted_index, starts, ends, len(group_keys), len(data.columns),
+        )
+        if cache_key not in NUMBA_FUNC_CACHE:
+            NUMBA_FUNC_CACHE[cache_key] = numba_agg_func


yeah would want to move this all to groupby/numba_.py rather than here (you can certainly cache afer, but ideally all of the caching is not exposed here; I think we did this elsewhere IIRC)

jreback · 2020-08-19T18:13:41Z

pandas/core/groupby/numba_.py

+        num_groups: int,
+        num_columns: int,
+    ) -> np.ndarray:
+        result = np.empty((num_groups, num_columns))


do we need to type this?

I was thinking float (the default type) would be the safest here?

Mixed int & float frame = float

Float frame = float

Int frame = int

If there's a desire to infer a more appropriate type (int) I could include inference logic

jreback · 2020-08-19T18:14:03Z

pandas/core/groupby/numba_.py

+
+    numba = import_optional_dependency("numba")
+
+    if parallel:


could make a helper function for this (as we likley need this elsewhere?)

This is only mimicked one other place currently for rolling. I can consolidate when the pattern grows

jreback · 2020-08-19T18:14:46Z

pandas/core/util/numba_.py

@@ -129,94 +127,3 @@ def impl(data, *_args):
            return impl

    return numba_func
-
-
-def split_for_numba(arg: FrameOrSeries) -> Tuple[np.ndarray, np.ndarray]:


are these not used in the window functions?

Nope. Only the groupby functions so that's why I moved them to the groupby/numba_.py file

…sm_groupby_agg

pep8speaks · 2020-08-20T05:33:25Z

Hello @mroeschke! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-08-21 23:15:46 UTC

…sm_groupby_agg

mroeschke · 2020-08-22T02:49:28Z

@jreback all green

jreback · 2020-08-22T03:30:37Z

thanks @mroeschke

as discussed if u can follow up to consolidate 2 clean

Matt Roeschke added 11 commits August 10, 2020 23:54

Roll back groupby agg changes

c7ba0eb

Merge remote-tracking branch 'upstream/master' into feature/paralleli…

7f1159b

…sm_groupby_agg

Add aggragate_with_numba

2d79984

Merge remote-tracking branch 'upstream/master' into feature/paralleli…

ddfe6d8

…sm_groupby_agg

Fix cases where operation on Series inputs

d349816

Simplify case, handle Series correctly

6292d75

Ensure function is being cached, validate the udf signature for group…

66edc21

…by agg

Move some functionality to groupby/numba_.py

608c955

Change ValueError to NotImplementedError

cd0ed3f

Comment that it's only 1 function that is supported

3d2f955

Add whatsnew

b4d8dab

mroeschke added Apply Apply, Aggregate, Transform, Map Groupby Performance Memory or execution speed performance labels Aug 17, 2020

mroeschke added this to the 1.2 milestone Aug 17, 2020

bashtage reviewed Aug 17, 2020

View reviewed changes

pandas/core/groupby/numba_.py Show resolved Hide resolved

bashtage reviewed Aug 17, 2020

View reviewed changes

pandas/core/groupby/numba_.py Outdated Show resolved Hide resolved

Matt Roeschke added 2 commits August 18, 2020 21:09

Merge remote-tracking branch 'upstream/master' into feature/paralleli…

dfad4f5

…sm_groupby_agg

Add issue number and correct typing

8f5e9db

jreback reviewed Aug 19, 2020

View reviewed changes

Matt Roeschke added 2 commits August 19, 2020 22:32

Add docstring for _aggregate_with_numba

09c4309

Merge remote-tracking branch 'upstream/master' into feature/paralleli…

0009fc4

…sm_groupby_agg

Matt Roeschke added 2 commits August 19, 2020 22:35

Lint

7234f7e

Merge remote-tracking branch 'upstream/master' into feature/paralleli…

5282f16

…sm_groupby_agg

jreback approved these changes Aug 22, 2020

View reviewed changes

jreback merged commit 068e654 into pandas-dev:master Aug 22, 2020

mroeschke deleted the feature/parallelism_groupby_agg branch August 22, 2020 03:31

rhshadrach mentioned this pull request May 30, 2024

BUG: Unexpected cast to float for DataFrame.groupby().agg(engine="numba") #58869

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Allow jitting of groupby agg loop #35759

PERF: Allow jitting of groupby agg loop #35759

mroeschke commented Aug 17, 2020 •

edited

Loading

bashtage Aug 17, 2020

bashtage Aug 17, 2020

mroeschke Aug 19, 2020

jreback Aug 19, 2020

jreback Aug 19, 2020

mroeschke Aug 20, 2020

jreback Aug 19, 2020

jreback Aug 19, 2020

jreback Aug 19, 2020

mroeschke Aug 20, 2020

jreback Aug 19, 2020

jreback Aug 19, 2020

mroeschke Aug 20, 2020

jreback Aug 19, 2020

mroeschke Aug 20, 2020

jreback Aug 19, 2020

mroeschke Aug 20, 2020

pep8speaks commented Aug 20, 2020 •

edited

Loading

mroeschke commented Aug 22, 2020

jreback commented Aug 22, 2020

PERF: Allow jitting of groupby agg loop #35759

PERF: Allow jitting of groupby agg loop #35759

Conversation

mroeschke commented Aug 17, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pep8speaks commented Aug 20, 2020 • edited Loading

Comment last updated at 2020-08-21 23:15:46 UTC

mroeschke commented Aug 22, 2020

jreback commented Aug 22, 2020

mroeschke commented Aug 17, 2020 •

edited

Loading

pep8speaks commented Aug 20, 2020 •

edited

Loading