Skip to content

BUG: ngroups and len(groups) do not equal when grouping with a list of Grouper and column label #26326

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lixuwen1991 opened this issue May 9, 2019 · 5 comments · Fixed by #26374
Milestone

Comments

@lixuwen1991
Copy link

Code

m_index = pd.MultiIndex.from_product([['Mona', 'Dennis'], ['good', 'great']], names=['name', 'desc']) 
df = pd.DataFrame({'foo': [1, 2, 1, 2], 'bar': np.random.randn(4)},  index=m_index)
n1 = df.groupby([pd.Grouper(level='name'), 'foo']).ngroups
n2 = len(df.groupby([pd.Grouper(level='name'), 'foo'])) 
n3 = len(df.groupby([pd.Grouper(level='name'), 'foo']).groups) 
print(n1, n2, n3)
# 4 2 2

Problem description

I grouped a DataFrame by a list of pandas.Grouper object and column label, the length of groups or group keys should be 4, but I checked the length of GroupBy object as n2, it is 2, why? Thanks for your answer in advance.

ommit: None
python: 3.7.3.final.0
python-bits: 64
OS: Darwin
OS-release: 18.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: zh_CN.UTF-8
LOCALE: zh_CN.UTF-8

@shantanu-gontia
Copy link
Contributor

shantanu-gontia commented May 9, 2019

len(<groupByObject>) returns the number of groups as does calling the length on the groups property as in n3

ngroups on the other hand

 def ngroups(self):
        return len(self.result_index)

returns the result_index which is a MultiIndex object. For the above example, the result_index output is the following MultiIndex

MultiIndex(levels=[['Dennis', 'Mona'], [1, 2]],
           codes=[[0, 0, 1, 1], [0, 1, 0, 1]],
           names=['name', 'foo'])

As of pandas 0.24.2, the MultiIndex __len__ method is implemented as

def __len__(self):
        return len(self.codes[0])

so it returns the length of the first set of codes in the MultiIndex. Hence, ngroups outputs 4.'


The groups that are formed in the example's groupby operation are

{('Dennis', 1): MultiIndex(levels=[['Dennis', 'Mona'], ['good', 'great']],
            codes=[[1], [0]],
            names=['name', 'desc']),
 ('Mona', 2): MultiIndex(levels=[['Dennis', 'Mona'], ['good', 'great']],
            codes=[[1], [1]],
            names=['name', 'desc'])}

It appears that the correct groups are not being formed correctly when you are passing a pandas.Grouper object


Not sure if this is appropriate behaviour though.

@WillAyd
Copy link
Member

WillAyd commented May 10, 2019

This is pretty tricky. I'm surprised to see a difference between the below as well:

>>> df.groupby([pd.Grouper(level=0), 'foo']).grouper.groups.keys()
dict_keys([('Dennis', 1), ('Mona', 2)])

>>> df.reset_index(level=1).groupby(['name', 'foo']).grouper.groups.keys()
dict_keys([('Dennis', 1), ('Dennis', 2), ('Mona', 1), ('Mona', 2)])

>>> df.groupby([pd.Grouper(['Mona', 'Mona', 'Dennis', 'Dennis']), 'foo']).grouper.groups.keys()
dict_keys([('Dennis', 1), ('Dennis', 2), ('Mona', 1), ('Mona', 2)])

I think the problem starts here:

to_groupby = zip(*(ping.grouper for ping in self.groupings))

Where the groups are getting constructed from the individual groupings (here the first grouping is the pd.Grouper(level='name') and the second grouping is the Series containing the values for foo).

The problem is that with the former iteration only goes over the unique values (here 'Dennis' and 'Mona') which is why your length calculations are getting truncated to 2:

def __iter__(self):

Not sure the best resolution off the top of my head but if you'd like to investigate further and try your hand at a PR would certainly be welcome!

@WillAyd WillAyd added this to the Contributions Welcome milestone May 10, 2019
@shantanu-gontia
Copy link
Contributor

shantanu-gontia commented May 10, 2019

When a Grouping is created using a Grouper object, the resultant Grouping consists of a list containing one object -> The BaseGrouper corresponding to the Grouper object initially passed. In OP's example, the first Grouper object is passed with level='name' parameter and the second is simply an index foo. The DataFrameGroupBy finally constructed has a BaseGrouper object with the following Grouping objects

>>> dfGroupBy = df.grouby([pd.Grouper(level='name'), 'foo'])
>>> dfGroupBy.grouper.groupings
[Grouping(name), Grouping(foo)]

The Grouping(name) object is the one generated from the Grouper object. The grouper parameter of this Grouping(name) object is itself a BaseGrouper

>>> dfGroupBy.grouper.groupings[0].grouper
<pandas.core.groupby.groupby.BaseGrouper>

Meanwhile, if the Grouping is constructed simply using an index like foo

>> dfGroupby.grouper.groupings[1].grouper
array([1, 2, 1, 2], dtype=int64)

The grouper is a simple array.

(Is this behavior intentional?)


Also as @WillAyd mentioned,

to_groupby = lzip(*(ping.grouper for ping in self.groupings))

when working with a BaseGrouper returns only the indices rather than an array of appropriate values.


So, a possible fix might be either to handle the special case of a BaseGrouper while constructing the variable to_groupby or updating the _get_grouper() code to not have a BaseGrouper object as a member of a Grouping

@shantanu-gontia
Copy link
Contributor

shantanu-gontia commented May 10, 2019

Adding to the above discussion,

>>> dfGroupby.grouper.groupings[0].grouper.groupings[0].grouper
Index(['Mona', 'Mona', 'Dennis', 'Dennis'], dtype='object', name='name')

@lixuwen1991
Copy link
Author

Thanks for the bug confirmation, will use combination of labels to groupby instead.
As a beginner of pandas and even Python, I am trying to understand the codes and logic you mentioned above.😂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants