Slice by column then by index fails if columns/rows are repeated. #6121

dbew · 2014-01-27T12:10:08Z

We've found a problem where repeating a row and a column in a DataFrame fails with a "Cannot create BlockManager._ref_locs" assertion error.

The dataframe is very simple:

df = pd.DataFrame(np.arange(25.).reshape(5,5),
                            index=['a', 'b', 'c', 'd', 'e'],
                            columns=['a', 'b', 'c', 'd', 'e'])

And we pull the data out like this:

z = df[['a', 'c', 'a']]
z.ix[['a', 'c', 'a']]
Traceback (most recent call last):
  File "/users/is/dbew/pyenvs/timeseries/lib/python2.7/site-packages/ipython-1.1.0_1_ahl1-py2.7.egg/IPython/core/interactiveshell.py", line 2830, in run_code
    exec code_obj in self.user_global_ns, self.user_ns
  File "<ipython-input-87-3bdc0aacc4b5>", line 1, in <module>
    z.ix[['a', 'c', 'a']]
  File "/users/is/dbew/pyenvs/timeseries/lib/python2.7/site-packages/pandas-0.13.0_292_g4dcecb0-py2.7-linux-x86_64.egg/pandas/core/indexing.py", line 56, in __getitem__
    return self._getitem_axis(key, axis=0)
  File "/users/is/dbew/pyenvs/timeseries/lib/python2.7/site-packages/pandas-0.13.0_292_g4dcecb0-py2.7-linux-x86_64.egg/pandas/core/indexing.py", line 744, in _getitem_axis
    return self._getitem_iterable(key, axis=axis)
  File "/users/is/dbew/pyenvs/timeseries/lib/python2.7/site-packages/pandas-0.13.0_292_g4dcecb0-py2.7-linux-x86_64.egg/pandas/core/indexing.py", line 816, in _getitem_iterable
    convert=False)
  File "/users/is/dbew/pyenvs/timeseries/lib/python2.7/site-packages/pandas-0.13.0_292_g4dcecb0-py2.7-linux-x86_64.egg/pandas/core/generic.py", line 1164, in take
    new_data = self._data.take(indices, axis=baxis)
  File "/users/is/dbew/pyenvs/timeseries/lib/python2.7/site-packages/pandas-0.13.0_292_g4dcecb0-py2.7-linux-x86_64.egg/pandas/core/internals.py", line 3366, in take
    ref_items=new_axes[0], axis=axis)
  File "/users/is/dbew/pyenvs/timeseries/lib/python2.7/site-packages/pandas-0.13.0_292_g4dcecb0-py2.7-linux-x86_64.egg/pandas/core/internals.py", line 2337, in apply
    do_integrity_check=do_integrity_check)
  File "/users/is/dbew/pyenvs/timeseries/lib/python2.7/site-packages/pandas-0.13.0_292_g4dcecb0-py2.7-linux-x86_64.egg/pandas/core/internals.py", line 1990, in __init__
    self._set_ref_locs(do_refs=True)
  File "/users/is/dbew/pyenvs/timeseries/lib/python2.7/site-packages/pandas-0.13.0_292_g4dcecb0-py2.7-linux-x86_64.egg/pandas/core/internals.py", line 2130, in _set_ref_locs
    'have _ref_locs set' % (block, labels))
AssertionError: Cannot create BlockManager._ref_locs because block [FloatBlock: [a], 1 x 3, dtype: float64] with duplicate items [Index([u'a', u'c', u'a'], dtype='object')] does not have _ref_locs set

If instead we take a copy of the intermediate step, then it works:

z = df[['a', 'c', 'a']].copy()
z.ix[['a', 'c', 'a']]
Out[89]: 
    a   c   a
a   0   2   0
c  10  12  10
a   0   2   0

[3 rows x 3 columns]

This means that if you several functions which each do a part of the data processing, you need to know the history of an object to know whether what you're doing works. I think .ix should always succeed on a DataFrame or Series, regardless of how it was constructed.

(I've read the discussion at #6056 about chained operations - but it's not something you can avoid if you have a pipeline of small steps instead of one big step).

This wasn't an issue in 0.11.0 but is failing in 0.13.0 and the latest master. Here's the output of installed versions when running on the master:

commit: None
python: 2.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.18-308.el5
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB

pandas: 0.13.0-292-g4dcecb0
Cython: 0.16
numpy: 1.7.1
scipy: 0.9.0
statsmodels: None
patsy: None
scikits.timeseries: None
dateutil: 1.5
pytz: None
bottleneck: 0.6.0
tables: 2.3.1-1
numexpr: 2.0.1
matplotlib: 1.1.1
openpyxl: None
xlrd: 0.8.0
xlwt: None
xlsxwriter: None
sqlalchemy: None
lxml: 2.3.6
bs4: None
html5lib: None
bq: None
apiclient: None

jreback · 2014-01-27T13:56:38Z

Was just a missing case in core.internals.Block.take

FYI, your example is more clear if you don't name the index/columns the same (just easier to see)

In [1]: df = DataFrame(np.arange(25.).reshape(5,5),
   ...: index=['a', 'b', 'c', 'd', 'e'],
   ...: columns=['A', 'B', 'C', 'D', 'E'])

In [2]: z = df[['A', 'C', 'A']]

In [3]: z.ix[['a', 'c', 'a']]
Out[3]: 
    A   C   A
a   0   2   0
c  10  12  10
a   0   2   0

[3 rows x 3 columns]

jreback · 2014-01-27T14:18:30Z

@dbew give a try with master....let me know anything else you find asap....trying to release 0.13.1...but if I can get fixes in i will

dbew · 2014-01-28T10:56:57Z

I've checked on master and it's working. I'm trying to report stuff I find as fast as possible, thanks for getting fixes out so quickly.

jreback · 2014-01-28T11:00:46Z

gr8
we r going to release 0.13.1 which is almost all bug/perf fixes next week

so if u have anything else would be great

jreback mentioned this issue Jan 27, 2014

BUG: Bug in propogating _ref_locs during construction of a DataFrame with dups index/columns (GH6121) #6123

Merged

jreback closed this as completed in #6123 Jan 27, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slice by column then by index fails if columns/rows are repeated. #6121

Slice by column then by index fails if columns/rows are repeated. #6121

dbew commented Jan 27, 2014

jreback commented Jan 27, 2014

jreback commented Jan 27, 2014

dbew commented Jan 28, 2014

jreback commented Jan 28, 2014

Slice by column then by index fails if columns/rows are repeated. #6121

Slice by column then by index fails if columns/rows are repeated. #6121

Comments

dbew commented Jan 27, 2014

jreback commented Jan 27, 2014

jreback commented Jan 27, 2014

dbew commented Jan 28, 2014

jreback commented Jan 28, 2014