-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: DataFrame.diff(axis=1) with mixed (or EA) dtypes #32995
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can just raise NotImplemented here. I suspect a mix of floats and bool will be somewhat common.
Can we instead skip doing things blockwise for axis=1
and do it columnwise instead? Something like
In [39]: df = pd.DataFrame({'a': [1,1,2,2],'b': [1, 2, 3, 4.], 'c': [True, False, False, True]})
In [40]: import toolz
In [41]: pd.concat([a - b for a, b in toolz.sliding_window(2, (df.iloc[:, i] for i in range(len(df.columns))))], axis=1)
Out[41]:
0 1
0 0.0 0.0
1 -1.0 2.0
2 -1.0 3.0
3 -2.0 3.0
(without using toolz
, and including the all-NA columns and fixing the column names). We could perhaps only do that when nblocks > 1
, to preserve the performance in the homogenous case.
That would also allow us to avoid consolidating, which would be nice. |
updated to operate column-wise. DatetimeTZBlock.diff could have its NotImplementedError case removed in a follow-up |
@@ -6667,6 +6667,30 @@ def diff(self, periods: int = 1, axis: Axis = 0) -> "DataFrame": | |||
5 NaN NaN NaN | |||
""" | |||
bm_axis = self._get_block_manager_axis(axis) | |||
self._consolidate_inplace() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not pretty. why are we not simply transposing and calling .diff()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we not simply transposing and calling .diff()?
I'd be fine with that, but it is potentially costly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally think the column-wise approach could be fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be co-located with BlockManager.diff. but to be honest I think transposing is just fine here. The block type is already inferred and handled. This is just adding a lot of complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you really want to do this column based approach them move this with the other .diff methods (I actually prefer just a transpose here, its totally fine for now)
Do we have anything approaching consensus here? I'm OK with either column-wise or transpose |
I'm pretty strongly in favor of columnwise when there's more than 1 block. While I don't think sparse support should ever be a primary motivator, the difference in performance of a columnwise diff vs. a In [6]: a = pd.arrays.SparseArray([1] + [0] * 1000)
In [7]: df = pd.DataFrame({"A": a, "B": a}) will be huge. |
@@ -6667,6 +6667,30 @@ def diff(self, periods: int = 1, axis: Axis = 0) -> "DataFrame": | |||
5 NaN NaN NaN | |||
""" | |||
bm_axis = self._get_block_manager_axis(axis) | |||
self._consolidate_inplace() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you really want to do this column based approach them move this with the other .diff methods (I actually prefer just a transpose here, its totally fine for now)
@jreback is column-wise vs transpose a deal breaker? if not, i think this is ready |
if you really want to do columnwise can you move impl to another location as indicated the code is pretty complex and should be separated from inline in the function |
thanks, IIRC there might be some issues this closes if you'd have a look |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff