Skip to content

API/ERR: allow iterators in df.set_index & improve errors #24984

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 36 commits into from
Feb 24, 2019
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
3e01681
API: re-enable custom label types in set_index
h-vetinari Jan 28, 2019
1b71e68
Fix doc pattern?
h-vetinari Jan 28, 2019
caeb125
Review (TomAugspurger)
h-vetinari Jan 29, 2019
8bd5340
Review (jreback)
h-vetinari Jan 29, 2019
3c8b69a
Merge remote-tracking branch 'upstream/master' into set_index_custom
h-vetinari Jan 29, 2019
cdfd86a
Merge remote-tracking branch 'upstream/master' into set_index_custom
h-vetinari Jan 30, 2019
d76ecfb
Review (jreback & jorisvandenbossche)
h-vetinari Jan 30, 2019
d2ffb81
Revert last two commits
h-vetinari Jan 30, 2019
5863678
Review (jorisvandenbossche)
h-vetinari Jan 31, 2019
0a7d783
Fix hashable listlikes (review jorisvandenbossche)
h-vetinari Jan 31, 2019
087d4f1
Merge remote-tracking branch 'upstream/master' into set_index_custom
h-vetinari Jan 31, 2019
794f61d
Stabilize repr of frozenset
h-vetinari Feb 1, 2019
0761633
Merge remote-tracking branch 'upstream/master' into set_index_custom
h-vetinari Feb 1, 2019
c58e8b6
Review (WillAyd)
h-vetinari Feb 1, 2019
29fa8c0
Unambiguous KeyError message
h-vetinari Feb 1, 2019
b5c8fa8
Merge remote-tracking branch 'upstream/master' into set_index_custom
h-vetinari Feb 3, 2019
5590433
Remove redundant whatsnew
h-vetinari Feb 3, 2019
7767ff7
Merge remote-tracking branch 'upstream/master' into set_index_custom
h-vetinari Feb 7, 2019
37c12d0
Merge remote-tracking branch 'upstream/master' into set_index_custom
h-vetinari Feb 9, 2019
2c4eaea
Review (jorisvandenbossche)
h-vetinari Feb 9, 2019
6c78816
Merge remote-tracking branch 'upstream/master' into set_index_custom
h-vetinari Feb 10, 2019
2ccd9a9
Merge remote-tracking branch 'upstream/master' into set_index_custom
h-vetinari Feb 14, 2019
b03c43b
Merge remote-tracking branch 'upstream/master' into set_index_custom
h-vetinari Feb 17, 2019
ea10359
Review (jreback)
h-vetinari Feb 17, 2019
ca17895
Retrigger after connectivity issues
h-vetinari Feb 17, 2019
a401eea
Merge remote-tracking branch 'upstream/master' into set_index_custom
h-vetinari Feb 17, 2019
f4deacc
Review (jorisvandenbossche)
h-vetinari Feb 18, 2019
125b0ca
Merge remote-tracking branch 'upstream/master' into set_index_custom
h-vetinari Feb 18, 2019
9bfcfde
move test for easier diff
h-vetinari Feb 18, 2019
6838613
Merge remote-tracking branch 'upstream/master' into set_index_custom
h-vetinari Feb 19, 2019
ca2ac60
Review (jreback)
h-vetinari Feb 19, 2019
87bd0a6
Add 'conda list' for azure/posix after activate 'pandas-dev'
h-vetinari Feb 19, 2019
759b369
Revert "Add 'conda list' for azure/posix after activate 'pandas-dev'"
h-vetinari Feb 20, 2019
40f1aaa
Merge remote-tracking branch 'upstream/master' into set_index_custom
h-vetinari Feb 20, 2019
ecc7d03
Reflect change in docstring
h-vetinari Feb 21, 2019
5f99b15
Merge remote-tracking branch 'upstream/master' into set_index_custom
h-vetinari Feb 24, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Review (jorisvandenbossche)
  • Loading branch information
h-vetinari committed Jan 31, 2019
commit 5863678045ce09f8f3ddca9321420a2080bbd8bd
17 changes: 13 additions & 4 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -4151,14 +4151,23 @@ def set_index(self, keys, drop=True, append=False, inplace=False,
# arrays are fine as long as they are one-dimensional
if getattr(col, 'ndim', 1) > 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

!= 1, is a 0-dim numpy scalar valid?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

raise ValueError(err_msg)
elif is_list_like(col, allow_sets=False):
# various iterators/generators are hashable, but should not
# raise a KeyError
tipo = type(col)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just put type(col) directly rather than adding another line here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

raise ValueError(err_msg + ' Received column of '
'type {}'.format(tipo))
Copy link
Contributor Author

@h-vetinari h-vetinari Jan 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this extra branch to give a more sensible error for the iterator/generator cases. It's maybe worth noting that it would be easy to re-add the capability to consume list-likes (excluding tuples) here, because tuples now always enter the first branch. In any case, that would be something for 0.25, and not the regression fix here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not fully sure we should use is_list_like here either. You can have custom objects that are hashable, but also iterable (like a tuple is also the combination). And once it is iterable, is_list_like will give True.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is a PITA .. (and thanks for the updates!)

Copy link
Contributor Author

@h-vetinari h-vetinari Jan 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

custom objects that are hashable, but also iterable

There's a line to draw somewhere, right? Iterables should not be keys (excluding strings, obviously), hashable or not. Have a look at this franken-example and tell me this is something we should explicitly support.

I think the current set-up already goes a long way towards making it really clear to the user what's happening or what's wrong. But hashable and iterable? I don't know how to sort that out with reasonable complexity (while keeping more important errors clean and clear), and I already spent way too much time that I don't have today on the last commit, to help with getting out 0.24.1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the line we draw is on hashable (at least, if it is not hashable, the indexing machinery simply doesn't work), not on iterable (tuples are iterable, strings are iterable)

I am not saying I would use it myself, but there are certainly reasonable examples of things you can put in an object dtype array that are iterable. One example is a shapely geometry object which are iterable (iterate through coordinates points; strictly spoken, they are currently not hashable, but that is something they plan to fix).

Given 0.24.1, I think we might want to release today. Are you fine with us adding some small changes here to get us to merge it during the day? (keeping the broad rationale of course)

Copy link
Contributor Author

@h-vetinari h-vetinari Jan 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche
Feel free to make any desired changes to this PR, my strong preference is no KeyError for iterables though.

there are certainly reasonable examples of things you can put in an object dtype array that are iterable

Why should an object dtype array (which is mutable) be hashable?

My suggestion would be that you try to come up with a hashable/iterable example that would also be list-like (from the POV of is_list_like, which would likely exclude various custom types), before you start changing things around, but feel free.

Edit: didn't see the shapely example, but this case is simply at odds with the iterator/generator case, that - I'd argue - is more widespread.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should an object dtype array (which is mutable) be hashable?

Sorry, I just generally meant object dtype array-like, like an Index or Series. So that it can make sense to put hashable but iterable objects in an Index or Series

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion would be that you try to come up with a hashable/iterable example that would also be list-like

As I said, eg a shapely geometry.
(anything that is iterable is regarded 'list-like' in the eyes of is_list_like)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't see your comments pop up, and edited my answer independently about the shaply case

In any case, I can't work on this any more today. Failures seem to be due to dict-views being hashable on PY2. That case could simply be removed from the parametrization, it was an extra mile I tried to go.

else:
# everything else gets tried as a key; see GH 24969
try:
self[col]
except KeyError:
found = col in self.columns
except TypeError:
tipo = type(col)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

raise ValueError(err_msg,
'Received column of type {}'.format(tipo))
raise TypeError(err_msg + ' Received column of '
'type {}'.format(tipo))
else:
if not found:
missing.append(col)

if missing:
raise KeyError('{}'.format(missing))
Expand Down
70 changes: 65 additions & 5 deletions pandas/tests/frame/test_alter_axes.py
Original file line number Diff line number Diff line change
Expand Up @@ -255,21 +255,40 @@ def test_set_index_raise_keys(self, frame_of_index_cols, drop, append):

@pytest.mark.parametrize('append', [True, False])
@pytest.mark.parametrize('drop', [True, False])
@pytest.mark.parametrize('box', [set, iter])
def test_set_index_raise_on_type(self, frame_of_index_cols, box,
drop, append):
@pytest.mark.parametrize('box', [iter, lambda x: (y for y in x)],
ids=['iter', 'generator'])
def test_set_index_raise_on_type_iter(self, frame_of_index_cols, box,
drop, append):
df = frame_of_index_cols

msg = 'The parameter "keys" may be a column key, .*'
# forbidden type, e.g. set/tuple/iter
# forbidden type, e.g. iter/generator
with pytest.raises(ValueError, match=msg):
df.set_index(box(df['A']), drop=drop, append=append)

# forbidden type in list, e.g. set/tuple/iter
# forbidden type in list, e.g. iter/generator
with pytest.raises(ValueError, match=msg):
df.set_index(['A', df['A'], box(df['A'])],
drop=drop, append=append)

@pytest.mark.parametrize('append', [True, False])
@pytest.mark.parametrize('drop', [True, False])
@pytest.mark.parametrize('box', [set, lambda x: dict(zip(x, x)).keys()],
ids=['set', 'dict-view'])
def test_set_index_raise_on_type_unhashable(self, frame_of_index_cols, box,
drop, append):
df = frame_of_index_cols

msg = 'The parameter "keys" may be a column key, .*'
# forbidden type that is unhashable, e.g. set/dict-view
with pytest.raises(TypeError, match=msg):
df.set_index(box(df['A']), drop=drop, append=append)

# forbidden type in list that is unhashable, e.g. set/dict-view
with pytest.raises(TypeError, match=msg):
df.set_index(['A', df['A'], box(df['A'])],
drop=drop, append=append)

def test_set_index_custom_label_type(self):
# GH 24969

Expand All @@ -281,6 +300,10 @@ def __init__(self, name, color):
def __str__(self):
return "<Thing %r>" % (self.name,)

def __repr__(self):
# necessary for pretty KeyError
return self.__str__()

thing1 = Thing('One', 'red')
thing2 = Thing('Two', 'blue')
df = DataFrame({thing1: [0, 1], thing2: [2, 3]})
Expand All @@ -295,6 +318,43 @@ def __str__(self):
result = df.set_index([thing2])
tm.assert_frame_equal(result, expected)

# missing key
thing3 = Thing('Three', 'pink')
msg = "<Thing 'Three'>"
with pytest.raises(KeyError, match=msg):
# missing label directly
df.set_index(thing3)

with pytest.raises(KeyError, match=msg):
# missing label in list
df.set_index([thing3])

def test_set_index_custom_label_type_raises(self):
# GH 24969

# purposefully inherit from something unhashable
class Thing(set):
def __init__(self, name, color):
self.name = name
self.color = color

def __str__(self):
return "<Thing %r>" % (self.name,)

thing1 = Thing('One', 'red')
thing2 = Thing('Two', 'blue')
df = DataFrame([[0, 2], [1, 3]], columns=[thing1, thing2])

msg = 'The parameter "keys" may be a column key, .*'

with pytest.raises(TypeError, match=msg):
# use custom label directly
df.set_index(thing2)

with pytest.raises(TypeError, match=msg):
# custom label wrapped in list
df.set_index([thing2])

def test_construction_with_categorical_index(self):
ci = tm.makeCategoricalIndex(10)
ci.name = 'B'
Expand Down