Skip to content

[PERF] Get rid of MultiIndex conversion in IntervalIndex.intersection #26225

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 33 commits into from
Jun 6, 2019
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
a5a1272
Gid rid of MultiIndex conversion in IntervalIndex.intersection
makbigc Apr 21, 2019
3cd095a
Add benchmark for IntervalIndex.intersection
makbigc Apr 27, 2019
0486a4e
clear code
makbigc Apr 27, 2019
09c89f1
Add whatsnew note
makbigc Apr 27, 2019
841a0b7
Modity the case for duplicate index
makbigc May 1, 2019
8b22623
Combine the set operation to find indexer into one
makbigc May 1, 2019
32d4005
Move setops tests to test_setops.py and add two tests
makbigc May 1, 2019
d502fcb
Remove relundant line
makbigc May 1, 2019
8ec6366
Remove duplicate line in whatsnew note
makbigc May 1, 2019
6000904
Isort interval/test_setops.py
makbigc May 1, 2019
7cb7d2c
Split the intersection into two sub-functions
makbigc May 1, 2019
bcf36bb
Functionalize some indexes
makbigc May 5, 2019
745c0bb
Remove relundant lines in whatsnew
makbigc May 5, 2019
ff8bb97
Fixturize the sort parameter
makbigc May 6, 2019
17d775f
Factor out the check and decorate the setops
makbigc May 7, 2019
03a989a
Add docstring to two subfunction
makbigc May 8, 2019
b36cbc8
Add intersection into _index_shared_docs
makbigc May 8, 2019
1cdb170
Isort and change the decorator's name
makbigc May 10, 2019
18c2d37
Remove object inheritance
makbigc May 11, 2019
d229677
merge master
makbigc May 14, 2019
35594b0
Add docstring to setop_check
makbigc May 16, 2019
0834206
Merge master again
makbigc May 16, 2019
3cf5be8
merge again
makbigc May 23, 2019
9cf9b7e
complete merge
makbigc May 23, 2019
ab67edd
2nd approach
makbigc May 25, 2019
402b09c
Add a new benchmark
makbigc May 25, 2019
b4f130d
Fix linting issue
makbigc May 25, 2019
3ff4c64
Change the decorator name to SetopCheck
makbigc May 26, 2019
3db3130
Amend and add test for a more corner case
makbigc May 28, 2019
1f25adb
Merge commit to resolve conflict
makbigc May 28, 2019
4a9cd29
merge master
makbigc May 29, 2019
1467e94
merge
makbigc Jun 4, 2019
ea2550a
merge again
makbigc Jun 6, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Amend and add test for a more corner case
  • Loading branch information
makbigc committed May 28, 2019
commit 3db3130bf2dece5394aaff5c919f18de4e342912
4 changes: 4 additions & 0 deletions pandas/core/indexes/interval.py
Original file line number Diff line number Diff line change
Expand Up @@ -1165,6 +1165,10 @@ def _intersection_non_unique(self, other):
"""
mask = np.zeros(len(self), dtype=bool)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be an issue with this approach when dupes are present in self and other. For other index types, such a scenario can result in more dupes being present in the intersection than in self. This behavior looks a bit buggy and inconsistent though, so I'm not sure if we actually want IntervalIndex to be consistent with it.

Some examples of the buggy and inconsistent behavior with Index:

In [2]: idx2 = pd.Index(list('aa')) 
   ...: idx3 = pd.Index(list('aaa')) 
   ...: idx3b = pd.Index(list('baaa'))

In [3]: idx2.intersection(idx3)
Out[3]: Index(['a', 'a', 'a', 'a'], dtype='object')

In [4]: idx3.intersection(idx3)
Out[4]: Index(['a', 'a', 'a'], dtype='object')

In [5]: idx2.intersection(idx3)
Out[5]: Index(['a', 'a', 'a', 'a'], dtype='object')

In [6]: idx2.intersection(idx3b)
Out[6]: Index(['a', 'a', 'a'], dtype='object')

It seems strange that [3] has more dupes present than in either original index but [4] does not. Similarly, it seems like [5] and [6] should be identical, as the presence of a non-intersecting element shouldn't impact the number of dupes returned.

@jreback : Do you know what the expected behavior for intersection with dupes should be? Or if there are any dependencies on the behavior of intersection that would dictate this?

If we treat indexes like multisets, then the intersection should contain the minimum multiplicity of dupes, e.g. idx2.intersection(idx3) and idx3.intersection(idx2) should both have length 2, so you maintain the property of the intersection being a subset of the original indexes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this is weird as these are set ops

what happens (meaning how much breakage) if

  • raise if left and right are not unique
  • uniquify left and right

prob need to do this for all set ops

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't had time to extensively test this out but I made the two changes you suggested in indexes/base.py for intersection and both resulted in some breakage. Aside some from breakage in the index set ops tests, there was also some breakage in tests/reshape/test_merge.py.


if self.hasnans and other.hasnans:
first_nan_loc = np.arange(len(self))[self.isna()][0]
mask[first_nan_loc] = True

lmiss = other.left.get_indexer_non_unique(self.left)[1]
lmatch = np.setdiff1d(np.arange(len(self)), lmiss)

Expand Down
7 changes: 7 additions & 0 deletions pandas/tests/indexes/interval/test_setops.py
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,13 @@ def test_intersection(self, closed, sort):
result = index.intersection(other)
tm.assert_index_equal(result, expected)

# GH 26225: duplicate nan element
index = IntervalIndex([np.nan, np.nan])
other = IntervalIndex([np.nan])
expected = IntervalIndex([np.nan])
result = index.intersection(other)
tm.assert_index_equal(result, expected)

def test_difference(self, closed, sort):
index = IntervalIndex.from_arrays([1, 0, 3, 2],
[1, 2, 3, 4],
Expand Down