Skip to content

Slice by column then by index fails if columns/rows are repeated. #6121

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dbew opened this issue Jan 27, 2014 · 4 comments · Fixed by #6123
Closed

Slice by column then by index fails if columns/rows are repeated. #6121

dbew opened this issue Jan 27, 2014 · 4 comments · Fixed by #6123
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves
Milestone

Comments

@dbew
Copy link
Contributor

dbew commented Jan 27, 2014

We've found a problem where repeating a row and a column in a DataFrame fails with a "Cannot create BlockManager._ref_locs" assertion error.

The dataframe is very simple:

df = pd.DataFrame(np.arange(25.).reshape(5,5),
                            index=['a', 'b', 'c', 'd', 'e'],
                            columns=['a', 'b', 'c', 'd', 'e'])

And we pull the data out like this:

z = df[['a', 'c', 'a']]
z.ix[['a', 'c', 'a']]
Traceback (most recent call last):
  File "/users/is/dbew/pyenvs/timeseries/lib/python2.7/site-packages/ipython-1.1.0_1_ahl1-py2.7.egg/IPython/core/interactiveshell.py", line 2830, in run_code
    exec code_obj in self.user_global_ns, self.user_ns
  File "<ipython-input-87-3bdc0aacc4b5>", line 1, in <module>
    z.ix[['a', 'c', 'a']]
  File "/users/is/dbew/pyenvs/timeseries/lib/python2.7/site-packages/pandas-0.13.0_292_g4dcecb0-py2.7-linux-x86_64.egg/pandas/core/indexing.py", line 56, in __getitem__
    return self._getitem_axis(key, axis=0)
  File "/users/is/dbew/pyenvs/timeseries/lib/python2.7/site-packages/pandas-0.13.0_292_g4dcecb0-py2.7-linux-x86_64.egg/pandas/core/indexing.py", line 744, in _getitem_axis
    return self._getitem_iterable(key, axis=axis)
  File "/users/is/dbew/pyenvs/timeseries/lib/python2.7/site-packages/pandas-0.13.0_292_g4dcecb0-py2.7-linux-x86_64.egg/pandas/core/indexing.py", line 816, in _getitem_iterable
    convert=False)
  File "/users/is/dbew/pyenvs/timeseries/lib/python2.7/site-packages/pandas-0.13.0_292_g4dcecb0-py2.7-linux-x86_64.egg/pandas/core/generic.py", line 1164, in take
    new_data = self._data.take(indices, axis=baxis)
  File "/users/is/dbew/pyenvs/timeseries/lib/python2.7/site-packages/pandas-0.13.0_292_g4dcecb0-py2.7-linux-x86_64.egg/pandas/core/internals.py", line 3366, in take
    ref_items=new_axes[0], axis=axis)
  File "/users/is/dbew/pyenvs/timeseries/lib/python2.7/site-packages/pandas-0.13.0_292_g4dcecb0-py2.7-linux-x86_64.egg/pandas/core/internals.py", line 2337, in apply
    do_integrity_check=do_integrity_check)
  File "/users/is/dbew/pyenvs/timeseries/lib/python2.7/site-packages/pandas-0.13.0_292_g4dcecb0-py2.7-linux-x86_64.egg/pandas/core/internals.py", line 1990, in __init__
    self._set_ref_locs(do_refs=True)
  File "/users/is/dbew/pyenvs/timeseries/lib/python2.7/site-packages/pandas-0.13.0_292_g4dcecb0-py2.7-linux-x86_64.egg/pandas/core/internals.py", line 2130, in _set_ref_locs
    'have _ref_locs set' % (block, labels))
AssertionError: Cannot create BlockManager._ref_locs because block [FloatBlock: [a], 1 x 3, dtype: float64] with duplicate items [Index([u'a', u'c', u'a'], dtype='object')] does not have _ref_locs set

If instead we take a copy of the intermediate step, then it works:

z = df[['a', 'c', 'a']].copy()
z.ix[['a', 'c', 'a']]
Out[89]: 
    a   c   a
a   0   2   0
c  10  12  10
a   0   2   0

[3 rows x 3 columns]

This means that if you several functions which each do a part of the data processing, you need to know the history of an object to know whether what you're doing works. I think .ix should always succeed on a DataFrame or Series, regardless of how it was constructed.

(I've read the discussion at #6056 about chained operations - but it's not something you can avoid if you have a pipeline of small steps instead of one big step).

This wasn't an issue in 0.11.0 but is failing in 0.13.0 and the latest master. Here's the output of installed versions when running on the master:

commit: None
python: 2.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.18-308.el5
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB

pandas: 0.13.0-292-g4dcecb0
Cython: 0.16
numpy: 1.7.1
scipy: 0.9.0
statsmodels: None
patsy: None
scikits.timeseries: None
dateutil: 1.5
pytz: None
bottleneck: 0.6.0
tables: 2.3.1-1
numexpr: 2.0.1
matplotlib: 1.1.1
openpyxl: None
xlrd: 0.8.0
xlwt: None
xlsxwriter: None
sqlalchemy: None
lxml: 2.3.6
bs4: None
html5lib: None
bq: None
apiclient: None

@jreback
Copy link
Contributor

jreback commented Jan 27, 2014

Was just a missing case in core.internals.Block.take

FYI, your example is more clear if you don't name the index/columns the same (just easier to see)

In [1]: df = DataFrame(np.arange(25.).reshape(5,5),
   ...: index=['a', 'b', 'c', 'd', 'e'],
   ...: columns=['A', 'B', 'C', 'D', 'E'])

In [2]: z = df[['A', 'C', 'A']]

In [3]: z.ix[['a', 'c', 'a']]
Out[3]: 
    A   C   A
a   0   2   0
c  10  12  10
a   0   2   0

[3 rows x 3 columns]

@jreback
Copy link
Contributor

jreback commented Jan 27, 2014

@dbew give a try with master....let me know anything else you find asap....trying to release 0.13.1...but if I can get fixes in i will

@dbew
Copy link
Contributor Author

dbew commented Jan 28, 2014

I've checked on master and it's working. I'm trying to report stuff I find as fast as possible, thanks for getting fixes out so quickly.

@jreback
Copy link
Contributor

jreback commented Jan 28, 2014

gr8
we r going to release 0.13.1 which is almost all bug/perf fixes next week

so if u have anything else would be great

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
2 participants