-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
PERF: faster placement creating extension blocks from arrays #32856
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Using the same example from #32826. With:
it gives
|
This is also indirectly covered by the sparse benchmark, but adding some benchmarks specifically for
|
pandas/core/internals/managers.py
Outdated
make_block(array, klass=ObjectValuesExtensionBlock, placement=[i]) | ||
make_block( | ||
array, klass=ObjectValuesExtensionBlock, placement=slice(i, i + 1, 1) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than changing this here
simply convert a single integer into a slice at a lower level
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, indeed, I can create the slice inside BlockPlacement constructor. But don't we then want to explicitly pass a single integer instead of a list of 1 integer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we require list / slice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To compare, I just pushed a commit that does both.
Personally, I like passing it as an integer. It's another 33% faster compared to passing it as a slice or 1-element list, and it makes it also explicit when constructing it that it is about a single column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, I missed you proposed a single integer yourself (for some reason I thought you wanted to catch the single element list in BlockPlacement). Updated.
another option would be to pass BlockPlacement objects. If we do that consistently, we could remove the checks/casting in the constructor/property |
Sorry, I don't understand that option. You still need to create BlockPlacement objects, right? And the question here is about how to create them (from an integer, or from a slice, or a 1-len list) |
Yah, the thought was about creating the BlockPlacement object and passing it to |
It shouldn't matter much I think, it is just passed through until ExtensionBlock init, and there it's doing a So we would first need to eliminate all other places where we pass a slice/array as placement to block creation, and then it would only elimiate one isinstance call |
Yes, it is not a trivial idea.
We could also make mgr_locs not-a-property, so get marginally faster lookups. But again, ignore as orthogonal. |
Yep, any other comments on the PR itself? |
LGTM |
self.columns = pd.Index(range(N_cols)) | ||
|
||
def time_frame_from_arrays_float(self): | ||
self.df = DataFrame._from_arrays( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't change this if no other feedback but I don't think you need the assignment here in any of the benchmarks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True (was only mimicking the other benchmarks in this file)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm this is wrong, but yeah can clean this up in a followup (if you want to create a followup issue or PR to do it)
self.columns = pd.Index(range(N_cols)) | ||
|
||
def time_frame_from_arrays_float(self): | ||
self.df = DataFrame._from_arrays( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm this is wrong, but yeah can clean this up in a followup (if you want to create a followup issue or PR to do it)
thanks @jorisvandenbossche |
When creating a DataFrame from many arrays stored in ExtensionBlocks, it seems quite some time is taken inside BlockPlacement using
np.require
on the passed list. Specifying the placement as a slice instead gives a much faster creation of the BlockPlacement. This delays the conversion to an array, though, but afterwards the conversion of the slice to an array inside BlockPlacement when neeeded is faster than an initial creation of a BlockPlacement from a list/array of 1 element.From investigating #32196 (comment)
@rth this reduces it with another third! (only from the dataframe creation, to be clear)