Skip to content

"Function does not reduce" for multiindex, but works fine for single index. #4293

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
frlnx opened this issue Jul 19, 2013 · 17 comments
Closed
Labels
Dtype Conversions Unexpected or buggy dtype conversions Groupby MultiIndex

Comments

@frlnx
Copy link

frlnx commented Jul 19, 2013

When I use pd.Series.tolist as a reducer with a single column groupby, it works.
When I do the same with multiindex, it does not.

It seems the "fast" cython groupby function, which has no quarrel with reducing into lists, throws an exception if the index is "complex", which seem to mean multiindex. When that exception is caught, the groupby function falls back to the "pure_python" groupby, which throws a new exception if the reducing function returns a list.

Is this a bug or is there some logic to this which is not apparent to me?

Reproduce:

import pandas as pd
s1 = pd.Series(randn(5), index=['a', 'b', 'c', 'd', 'e'])
df = pd.DataFrame([s1], columns=['a', 'b', 'c', 'd', 'e'])
for i in range(0,10):
    s1 = pd.Series(randn(5), index=['a', 'b', 'c', 'd', 'e'])
    df2 = pd.DataFrame([s1], columns=['a', 'b', 'c', 'd', 'e'])
    df = pd.concat([df, df2])
df['gk'] = 'foo'
df['gk2'] = 'bar'

# This works.
df.groupby(['gk']).agg(pd.Series.tolist)

# This does not.
df.groupby(['gk', 'gk2']).agg(pd.Series.tolist)
@jreback
Copy link
Contributor

jreback commented Jul 19, 2013

what exactly are you trying to accomplish?

the first groupby is very odd to do

as is passing a reduction function of pd.Series.tolist which is technically not a reduction function at all

@jtratner
Copy link
Contributor

not sure the first case ought to work either...

@frlnx
Copy link
Author

frlnx commented Jul 22, 2013

A simplified example would be this:
I use it to compare KPIs of single rows to their peers. The peers being determined by the group key. I compare the KPIs to the averages / means of the KPIs of each group. It's an efficient way of doing it since I do not want to keep the original dataset in memory. This example is only for two levels. The superset and the entry compared. I actually do this on four levels, which makes it a whole lot messier, and the tolist helps strip the complexitivity down.
If I could not do tolist as an aggregator (which in my experience is quite common practice, lots of SQL variants have support for it) I would have to keep both the original dataset and the grouped dataset and then access the original by index.
Needless to say that would eat my memory in no-time, and my intuition tells me it would be slower, I could perform some tests and publish if necessary.

@jreback
Copy link
Contributor

jreback commented Jul 22, 2013

how does using tolist save you any data? its the same data just in a list and comparisons are then hard

I think you can one of these:

  • df.groupby(keys).apply(lambda x: x._get_numeric_data().abs()sum()) or another function that effectively hashes a row together
  • df.groupby(['gk','gk2']).agg(lambda x: tuple(x.tolist())) will do what you want with the multi-indexes (or single index); as a tuple it is inferred as a reduction

@frlnx
Copy link
Author

frlnx commented Jul 22, 2013

Well, if wrapping the list in a tuple is an acceptable solution, removing the check for the list from the source should also be, since it really does not add any functionality or help.

@jreback
Copy link
Contributor

jreback commented Jul 22, 2013

ok....will think about it....

but still I am curious, how tolist actually reduces? it doesn't change the amount of data you have all, so your argument about not keeping data in memory is fallacious

here's another though, why don't you hash the result? you want to compare it to other items right? (to see if they are the same)

@frlnx
Copy link
Author

frlnx commented Jul 22, 2013

I want to compare, yes, but I need to know if the median KPIs of the group are greater or smaller than the same KPIs of each entry which makes up the group. It is not a question of equal or not.

The original dataframe contains a lot more than just the columns I keep with the tolist "reduction". I could delete the other columns, but the columns coming out of our API keeps changing independent of my work, so a whitelist approach is really the only way. I can of course make a whitelist approach in other ways, but this is a very simple way of getting what I need.

@jreback
Copy link
Contributor

jreback commented Jul 22, 2013

ok...if it works for you

why can't you do the actualy compare in the agg/applied function itself?

def f(x):
     if (x.median()>=other_data).all()
            return x.median()
     return nan

df.groupby(keys).apply(f)

@frlnx
Copy link
Author

frlnx commented Jul 22, 2013

That may be a nice way of doing it.
I still need the lists outputted to different files for validation though.

@jreback
Copy link
Contributor

jreback commented Jul 22, 2013

np...just trying to help...will keep this open in any event..thanks for the report

@frlnx
Copy link
Author

frlnx commented Jul 22, 2013

I will be able to proceed with a mix of the workarounds you provided, thanks!

@jtratner
Copy link
Contributor

@frinx you'd think that removing the check for list would be just fine, but there are some quirks with how groupby is handled in pandas, so it's not straightforward to just remove it. :-/ You can see #3805 for a start on this.

@florianverhein
Copy link

Just ran into this too. Rather than being forced to collect the unique values in a column into a string, I would have liked them to be collected into a list (e.g. for use later).

reduce in python is really a fold left, and there's absolutely nothing wrong with a fold left returning a collection. e.g. the identity reduction for a list s is reduce(lambda x,y: x+[y],s,[]), which fails in pd.agg.

'reducing' the quantity of data is not a requirement of reduce, functionally speaking (that argument also fails considering that str works in pd.agg).

@nbateshaus
Copy link

I run into this working with shredded (EAV) data. Often, there are multiple values for a single column. Take data representing authors on articles:

recordId,attributeName,value
1,title,"Map-Reduce Extensions and Recursive Queries"
1,author,"Foto N. Afrati"
1,author,"Vinayak Borkar"
1,author,"Michael Carey"
1,author,"Neoklis Polyzotis"
1,author,"Jeffrey D. Ullman"

This data is coming from a 3rd party, and I'd like to fix it so I can work with it. I want to create a 3-level index of ['recordId', 'attributeName', 'instance'] with a single column for the values. The way to create 'instance' is to:

  1. eav.set_index(['recordId', 'attributeName'], inplace=True) to promote recordId and attributeName to indexes; then
  2. s=eav.groupby(level=['recordId', 'attributeName']).agg(pd.Series.tolist) to aggregate into to a Series of lists of values, then
  3. use t=pd.DataFrame(s.tolist(), s.index) to split into columns with numeric labels, then
  4. use t.stack() to create the 3rd level index.

Sadly, this doesn't work because _aggregate_series_pure_python(self, obj, func) has an explicit exclusion for the case when the aggregator returns a list for the first group.

Many other data processing platforms have recognized the utility of aggregating into lists: PostgreSQL has array_agg; Spark has collect_list(); MySQL has group_concat; etc.

The exclusion is even less sensical when you consider that many methods such as Series.str.split() will create columns of lists, but the exclusion in _aggregate_series_pure_python(self, obj, func) prevents creation of list values when grouping.

@TomAugspurger
Copy link
Contributor

@nbateshaus what's your desired output there? Is it

In [12]: df['instance'] = df.groupby(['recordId', 'attributeName']).value.cumcount()

In [13]: df.set_index(['recordId', 'attributeName', 'instance'])
Out[13]:
                                                                       value
recordId attributeName instance
1        title         0         Map-Reduce Extensions and Recursive Queries
         author        0                                      Foto N. Afrati
                       1                                      Vinayak Borkar
                       2                                       Michael Carey
                       3                                   Neoklis Polyzotis
                       4                                   Jeffrey D. Ullman

I know that nested data is important, but the building-blocks aren't in place for pandas to support it well at the moment.

@nbateshaus
Copy link

Yes, that's the output I want. Assuming I sort by ['e','a'] first, this is probably faster, too. It is still very convenient to be able to create list values.

Looking at the history of this restriction, it looks like it was accidentally introduced by a transcription error in f3c0a08 - at the time, it was an assertion, and it went from

assert(not (isinstance(res, list) and len(res) == len(self.dummy)))

where, as far as I can tell, dummy is uninitialized, to

assert(not isinstance(res, list))

The original assertion was added without comment in 71e9046, in response to #612 "Pure python multi-key groupby can't handle non-numeric results". Which reveals another oddity: groupby().agg(pd.Series.tolist) works fine for single-key groupings; it only fails for multi-key groupings.

>>> eav.groupby(['attributeName']).agg(pd.Series.tolist)
                      recordId  \
attributeName                    
author         [1, 1, 1, 1, 1]   
title                      [1]   

                                                           value  
attributeName                                                     
author         [Foto N. Afrati, Vinayak Borkar, Michael Carey...  
title              [Map-Reduce Extensions and Recursive Queries]  
>>> eav.groupby(['recordId', 'attributeName']).agg(pd.Series.tolist)
Traceback (most recent call last):
  File "/Users/nik/anaconda/envs/python3/lib/python3.5/site-packages/pandas/core/groupby.py", line 1863, in agg_series
    return self._aggregate_series_fast(obj, func)
  File "/Users/nik/anaconda/envs/python3/lib/python3.5/site-packages/pandas/core/groupby.py", line 1868, in _aggregate_series_fast
    func = self._is_builtin_func(func)
AttributeError: 'BaseGrouper' object has no attribute '_is_builtin_func'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/nik/anaconda/envs/python3/lib/python3.5/site-packages/pandas/core/groupby.py", line 3597, in aggregate
    return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)
  File "/Users/nik/anaconda/envs/python3/lib/python3.5/site-packages/pandas/core/groupby.py", line 3122, in aggregate
    return self._python_agg_general(arg, *args, **kwargs)
  File "/Users/nik/anaconda/envs/python3/lib/python3.5/site-packages/pandas/core/groupby.py", line 777, in _python_agg_general
    result, counts = self.grouper.agg_series(obj, f)
  File "/Users/nik/anaconda/envs/python3/lib/python3.5/site-packages/pandas/core/groupby.py", line 1865, in agg_series
    return self._aggregate_series_pure_python(obj, func)
  File "/Users/nik/anaconda/envs/python3/lib/python3.5/site-packages/pandas/core/groupby.py", line 1899, in _aggregate_series_pure_python
    raise ValueError('Function does not reduce')
ValueError: Function does not reduce

@toobaz
Copy link
Member

toobaz commented May 18, 2018

The example by @frinx seems to now work fine, and @bobhaffner said that #18354 "might" close this. So assuming this is solved. @nbateshaus please feel free to provide a reproducible example if this is still an issue.

@toobaz toobaz closed this as completed May 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Groupby MultiIndex
Projects
None yet
Development

No branches or pull requests

7 participants