-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
"Function does not reduce" for multiindex, but works fine for single index. #4293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
what exactly are you trying to accomplish? the first groupby is very odd to do as is passing a reduction function of |
not sure the first case ought to work either... |
A simplified example would be this: |
how does using I think you can one of these:
|
Well, if wrapping the list in a tuple is an acceptable solution, removing the check for the list from the source should also be, since it really does not add any functionality or help. |
ok....will think about it.... but still I am curious, how here's another though, why don't you |
I want to compare, yes, but I need to know if the median KPIs of the group are greater or smaller than the same KPIs of each entry which makes up the group. It is not a question of equal or not. The original dataframe contains a lot more than just the columns I keep with the tolist "reduction". I could delete the other columns, but the columns coming out of our API keeps changing independent of my work, so a whitelist approach is really the only way. I can of course make a whitelist approach in other ways, but this is a very simple way of getting what I need. |
ok...if it works for you why can't you do the actualy compare in the agg/applied function itself?
|
That may be a nice way of doing it. |
np...just trying to help...will keep this open in any event..thanks for the report |
I will be able to proceed with a mix of the workarounds you provided, thanks! |
Just ran into this too. Rather than being forced to collect the unique values in a column into a string, I would have liked them to be collected into a list (e.g. for use later). reduce in python is really a fold left, and there's absolutely nothing wrong with a fold left returning a collection. e.g. the identity reduction for a list 'reducing' the quantity of data is not a requirement of reduce, functionally speaking (that argument also fails considering that |
I run into this working with shredded (EAV) data. Often, there are multiple values for a single column. Take data representing authors on articles:
This data is coming from a 3rd party, and I'd like to fix it so I can work with it. I want to create a 3-level index of ['recordId', 'attributeName', 'instance'] with a single column for the values. The way to create 'instance' is to:
Sadly, this doesn't work because Many other data processing platforms have recognized the utility of aggregating into lists: PostgreSQL has The exclusion is even less sensical when you consider that many methods such as |
@nbateshaus what's your desired output there? Is it In [12]: df['instance'] = df.groupby(['recordId', 'attributeName']).value.cumcount()
In [13]: df.set_index(['recordId', 'attributeName', 'instance'])
Out[13]:
value
recordId attributeName instance
1 title 0 Map-Reduce Extensions and Recursive Queries
author 0 Foto N. Afrati
1 Vinayak Borkar
2 Michael Carey
3 Neoklis Polyzotis
4 Jeffrey D. Ullman I know that nested data is important, but the building-blocks aren't in place for pandas to support it well at the moment. |
Yes, that's the output I want. Assuming I sort by ['e','a'] first, this is probably faster, too. It is still very convenient to be able to create list values. Looking at the history of this restriction, it looks like it was accidentally introduced by a transcription error in f3c0a08 - at the time, it was an assertion, and it went from
where, as far as I can tell,
The original assertion was added without comment in 71e9046, in response to #612 "Pure python multi-key groupby can't handle non-numeric results". Which reveals another oddity: groupby().agg(pd.Series.tolist) works fine for single-key groupings; it only fails for multi-key groupings.
|
The example by @frinx seems to now work fine, and @bobhaffner said that #18354 "might" close this. So assuming this is solved. @nbateshaus please feel free to provide a reproducible example if this is still an issue. |
When I use pd.Series.tolist as a reducer with a single column groupby, it works.
When I do the same with multiindex, it does not.
It seems the "fast" cython groupby function, which has no quarrel with reducing into lists, throws an exception if the index is "complex", which seem to mean multiindex. When that exception is caught, the groupby function falls back to the "pure_python" groupby, which throws a new exception if the reducing function returns a list.
Is this a bug or is there some logic to this which is not apparent to me?
Reproduce:
The text was updated successfully, but these errors were encountered: