BUG: ngroups and len(groups) do not equal when grouping with a list of Grouper and column label #26326

lixuwen1991 · 2019-05-09T10:56:16Z

Code

m_index = pd.MultiIndex.from_product([['Mona', 'Dennis'], ['good', 'great']], names=['name', 'desc']) 
df = pd.DataFrame({'foo': [1, 2, 1, 2], 'bar': np.random.randn(4)},  index=m_index)
n1 = df.groupby([pd.Grouper(level='name'), 'foo']).ngroups
n2 = len(df.groupby([pd.Grouper(level='name'), 'foo'])) 
n3 = len(df.groupby([pd.Grouper(level='name'), 'foo']).groups) 
print(n1, n2, n3)
# 4 2 2

Problem description

I grouped a DataFrame by a list of pandas.Grouper object and column label, the length of groups or group keys should be 4, but I checked the length of GroupBy object as n2, it is 2, why? Thanks for your answer in advance.

ommit: None
python: 3.7.3.final.0
python-bits: 64
OS: Darwin
OS-release: 18.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: zh_CN.UTF-8
LOCALE: zh_CN.UTF-8

shantanu-gontia · 2019-05-09T20:49:58Z

len(<groupByObject>) returns the number of groups as does calling the length on the groups property as in n3

ngroups on the other hand

 def ngroups(self):
        return len(self.result_index)

returns the result_index which is a MultiIndex object. For the above example, the result_index output is the following MultiIndex

MultiIndex(levels=[['Dennis', 'Mona'], [1, 2]],
           codes=[[0, 0, 1, 1], [0, 1, 0, 1]],
           names=['name', 'foo'])

As of pandas 0.24.2, the MultiIndex __len__ method is implemented as

def __len__(self):
        return len(self.codes[0])

so it returns the length of the first set of codes in the MultiIndex. Hence, ngroups outputs 4.'

The groups that are formed in the example's groupby operation are

{('Dennis', 1): MultiIndex(levels=[['Dennis', 'Mona'], ['good', 'great']],
            codes=[[1], [0]],
            names=['name', 'desc']),
 ('Mona', 2): MultiIndex(levels=[['Dennis', 'Mona'], ['good', 'great']],
            codes=[[1], [1]],
            names=['name', 'desc'])}

It appears that the correct groups are not being formed correctly when you are passing a pandas.Grouper object

Not sure if this is appropriate behaviour though.

WillAyd · 2019-05-10T03:32:09Z

This is pretty tricky. I'm surprised to see a difference between the below as well:

>>> df.groupby([pd.Grouper(level=0), 'foo']).grouper.groups.keys()
dict_keys([('Dennis', 1), ('Mona', 2)])

>>> df.reset_index(level=1).groupby(['name', 'foo']).grouper.groups.keys()
dict_keys([('Dennis', 1), ('Dennis', 2), ('Mona', 1), ('Mona', 2)])

>>> df.groupby([pd.Grouper(['Mona', 'Mona', 'Dennis', 'Dennis']), 'foo']).grouper.groups.keys()
dict_keys([('Dennis', 1), ('Dennis', 2), ('Mona', 1), ('Mona', 2)])

I think the problem starts here:

pandas/pandas/core/groupby/ops.py

Line 261 in ee6b131

to_groupby = zip(*(ping.grouper for ping in self.groupings))

Where the groups are getting constructed from the individual groupings (here the first grouping is the pd.Grouper(level='name') and the second grouping is the Series containing the values for foo).

The problem is that with the former iteration only goes over the unique values (here 'Dennis' and 'Mona') which is why your length calculations are getting truncated to 2:

pandas/pandas/core/groupby/grouper.py

Line 348 in ee6b131

def __iter__(self):

Not sure the best resolution off the top of my head but if you'd like to investigate further and try your hand at a PR would certainly be welcome!

shantanu-gontia · 2019-05-10T07:35:44Z

When a Grouping is created using a Grouper object, the resultant Grouping consists of a list containing one object -> The BaseGrouper corresponding to the Grouper object initially passed. In OP's example, the first Grouper object is passed with level='name' parameter and the second is simply an index foo. The DataFrameGroupBy finally constructed has a BaseGrouper object with the following Grouping objects

>>> dfGroupBy = df.grouby([pd.Grouper(level='name'), 'foo'])
>>> dfGroupBy.grouper.groupings
[Grouping(name), Grouping(foo)]

The Grouping(name) object is the one generated from the Grouper object. The grouper parameter of this Grouping(name) object is itself a BaseGrouper

>>> dfGroupBy.grouper.groupings[0].grouper
<pandas.core.groupby.groupby.BaseGrouper>

Meanwhile, if the Grouping is constructed simply using an index like foo

>> dfGroupby.grouper.groupings[1].grouper
array([1, 2, 1, 2], dtype=int64)

The grouper is a simple array.

(Is this behavior intentional?)

Also as @WillAyd mentioned,

to_groupby = lzip(*(ping.grouper for ping in self.groupings))

when working with a BaseGrouper returns only the indices rather than an array of appropriate values.

So, a possible fix might be either to handle the special case of a BaseGrouper while constructing the variable to_groupby or updating the _get_grouper() code to not have a BaseGrouper object as a member of a Grouping

shantanu-gontia · 2019-05-10T09:54:24Z

Adding to the above discussion,

>>> dfGroupby.grouper.groupings[0].grouper.groupings[0].grouper
Index(['Mona', 'Mona', 'Dennis', 'Dennis'], dtype='object', name='name')

lixuwen1991 · 2019-05-10T14:20:46Z

Thanks for the bug confirmation, will use combination of labels to groupby instead.
As a beginner of pandas and even Python, I am trying to understand the codes and logic you mentioned above.😂

WillAyd added Bug Groupby labels May 10, 2019

WillAyd added this to the Contributions Welcome milestone May 10, 2019

shantanu-gontia mentioned this issue May 13, 2019

BUG: ngroups and len(groups) do not equal when grouping with a list of Grouper and column label (GH26326) #26374

Merged

4 tasks

jreback modified the milestones: Contributions Welcome, 0.25.0 May 19, 2019

jreback closed this as completed in #26374 May 20, 2019

falcaopetri mentioned this issue Mar 29, 2020

BUG: wrong df.groupby().groups when grouping with [Grouper(freq=), ...] #33132

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: ngroups and len(groups) do not equal when grouping with a list of Grouper and column label #26326

BUG: ngroups and len(groups) do not equal when grouping with a list of Grouper and column label #26326

lixuwen1991 commented May 9, 2019

shantanu-gontia commented May 9, 2019 •

edited

Loading

WillAyd commented May 10, 2019

shantanu-gontia commented May 10, 2019 •

edited

Loading

shantanu-gontia commented May 10, 2019 •

edited

Loading

lixuwen1991 commented May 10, 2019

BUG: ngroups and len(groups) do not equal when grouping with a list of Grouper and column label #26326

BUG: ngroups and len(groups) do not equal when grouping with a list of Grouper and column label #26326

Comments

lixuwen1991 commented May 9, 2019

Code

Problem description

shantanu-gontia commented May 9, 2019 • edited Loading

WillAyd commented May 10, 2019

shantanu-gontia commented May 10, 2019 • edited Loading

shantanu-gontia commented May 10, 2019 • edited Loading

lixuwen1991 commented May 10, 2019

shantanu-gontia commented May 9, 2019 •

edited

Loading

shantanu-gontia commented May 10, 2019 •

edited

Loading

shantanu-gontia commented May 10, 2019 •

edited

Loading