-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
PERF: Allow jitting of groupby agg loop #35759
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: Allow jitting of groupby agg loop #35759
Conversation
sorted_data, sorted_index, starts, ends, len(group_keys), len(data.columns), | ||
) | ||
if cache_key not in NUMBA_FUNC_CACHE: | ||
NUMBA_FUNC_CACHE[cache_key] = numba_agg_func |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't this be moved into the else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should you check that the cache is being used property lin a test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking to evaluate the function first with all arguments first before putting the function in the cache so we're not caching a function that may fail.
I have existing tests that check for the presence of the function in the cache here:
assert (func_1, "groupby_agg") in NUMBA_FUNC_CACHE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah would want to move this all to groupby/numba_.py rather than here (you can certainly cache afer, but ideally all of the caching is not exposed here; I think we did this elsewhere IIRC)
@@ -230,6 +227,18 @@ def apply(self, func, *args, **kwargs): | |||
) | |||
def aggregate(self, func=None, *args, engine=None, engine_kwargs=None, **kwargs): | |||
|
|||
if maybe_use_numba(engine): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would not object to making a _aggregate_with_python_cython (where you put everything L242 and on down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could do this in a follow up refactor PR.
I guess I would need to make a Series and DataFrame version of this function since it looks like both are different.
) | ||
return self.obj._constructor(result, index=index, columns=data.columns) | ||
|
||
relabeling, func, columns, order = reconstruct_func(func, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment here
self, func, *args, engine="cython", engine_kwargs=None, **kwargs | ||
): | ||
def _aggregate_with_numba(self, data, func, *args, engine_kwargs=None, **kwargs): | ||
group_keys = self.grouper._get_group_keys() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a doc-string a and type as much as possible
sorted_labels = algorithms.take_nd(labels, sorted_index, allow_fill=False) | ||
sorted_data = data.take(sorted_index, axis=self.axis).to_numpy() | ||
starts, ends = lib.generate_slices(sorted_labels, n_groups) | ||
cache_key = (func, "groupby_agg") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this consistent with other functions, e.g. transform and rolling and such (the cache keys)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah these keys are all formatted similarly (function, "string of the operation")
sorted_data, sorted_index, starts, ends, len(group_keys), len(data.columns), | ||
) | ||
if cache_key not in NUMBA_FUNC_CACHE: | ||
NUMBA_FUNC_CACHE[cache_key] = numba_agg_func |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah would want to move this all to groupby/numba_.py rather than here (you can certainly cache afer, but ideally all of the caching is not exposed here; I think we did this elsewhere IIRC)
num_groups: int, | ||
num_columns: int, | ||
) -> np.ndarray: | ||
result = np.empty((num_groups, num_columns)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to type this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking float (the default type) would be the safest here?
- Mixed int & float frame = float
- Float frame = float
- Int frame = int
If there's a desire to infer a more appropriate type (int) I could include inference logic
|
||
numba = import_optional_dependency("numba") | ||
|
||
if parallel: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could make a helper function for this (as we likley need this elsewhere?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only mimicked one other place currently for rolling. I can consolidate when the pattern grows
@@ -129,94 +127,3 @@ def impl(data, *_args): | |||
return impl | |||
|
|||
return numba_func | |||
|
|||
|
|||
def split_for_numba(arg: FrameOrSeries) -> Tuple[np.ndarray, np.ndarray]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are these not used in the window functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope. Only the groupby functions so that's why I moved them to the groupby/numba_.py
file
Hello @mroeschke! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2020-08-21 23:15:46 UTC |
@jreback all green |
thanks @mroeschke as discussed if u can follow up to consolidate 2 clean |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
New performance comparison for 10,000 groups