Skip to content

duplicated() (and drop_duplicates()) incorrectly identifying unique/different rows as duplicates #26762

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
drewhouston opened this issue Jun 10, 2019 · 2 comments

Comments

@drewhouston
Copy link

drewhouston commented Jun 10, 2019

Code Sample, a copy-pastable example if possible

import pandas as pd

s = pd.Series.from_csv('duplicated_int64-bug.csv')

d = {}
non_dupe_python = set()
for k, val in s.iteritems():
    if k not in d:
        d[k] = val
    else:
        if val != d[k]:
            non_dupe_python.add(k)

dup = s.duplicated(keep=False)
has_2_entries = dup.groupby(level=0).count()
has_2_entries = has_2_entries[has_2_entries>1].index

non_dupe_pd = set(dup[dup==0].index & has_2_entries)
print "Non-dupes (Python, %d): %r" % (len(non_dupe_python), non_dupe_python)
print "Non-dupes (Series.duplicated, %d): %r" % (len(non_dupe_pd), non_dupe_pd)
print "False duplicates: %r" % (non_dupe_python-non_dupe_pd)

print "\nTwo false duplicates (problem here)"

print s.loc[1793]
print dup.loc[1793]

print s.loc[1795]
print dup.loc[1795]

print "\nA correctly-identified non-duplicate"
print s.loc[6080]
print dup.loc[6080]

print "\nA correctly-identified duplicate"
print s.loc[1]
print dup.loc[1]

duplicated_int64-bug.csv.zip

My output:

Non-dupes (Python, 9): set([6080, 1793, 1795, 6084, 1797, 1798, 5968, 5975, 6086])
Non-dupes (Series.duplicated, 7): set([6080, 6084, 1797, 1798, 5968, 5975, 6086])
False duplicates: set([1793, 1795])

Two false duplicates (problem here)
1793    1334122803464982
1793    1334122794298518
dtype: int64
1793    True
1793    True
dtype: bool
1795    1334122803464982
1795    1334122794298518
dtype: int64
1795    True
1795    True
dtype: bool

A correctly-identified non-duplicate
6080    91475737356800
6080    91475737482944
dtype: int64
6080    False
6080    False
dtype: bool

A correctly-identified duplicate
1    93251501583580
1    93251501583580
dtype: int64
1    True
1    True
dtype: bool

Problem description

Using 0.24.2 I was comparing two ~6,000-row DataFrames before and after some modifications and looking for modified rows by using concat and then drop_duplicates (keep=False, although IIRC the issue also happens with other arguments to keep) and found that it was reporting false duplicates (i.e. missing rows that were in fact modified.)

Stepping into drop_duplicates(), then duplicated(), then at the output of get_group_index() (a series of key->int64 pairs which looks like a total ordering of the DataFrame that I basically dumped into the attached csv) shows a problem most likely in duplicated_int64 which looks like a Cython or native module (which unfortunately I don't have the cycles or toolchain to look into but maybe there's a false hash collision?)

The repro and csv above includes ~12000 int64->int64 pairs with a variety of duplicated and non-duplicated pairs and some keys that only appear once.

I couldn't find a similar report although this appears to be similar in principle to #11864 (problem in duplicated_int64) and wasn't fixed by that. Repro above should be able to reproduce & isolate the problem.

Expected Output

As shown above the two actual values for index 1793 are different and should not be reported as duplicate rows (i.e. dup.loc[1793] should return False)

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]
INSTALLED VERSIONS

commit: None
python: 2.7.10.final.0
python-bits: 64
OS: Darwin
OS-release: 17.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.24.2
pytest: None
pip: 18.0
setuptools: 38.4.0
Cython: None
numpy: 1.15.4
scipy: 1.2.0
pyarrow: None
xarray: None
IPython: 5.8.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2013.7
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 1.3.1
openpyxl: 2.5.2
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: None
lxml.etree: None
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.3.3
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@drewhouston drewhouston changed the title duplicated() incorrectly identifying unique/different rows as duplicates duplicated() (and drop_duplicates()) incorrectly identifying unique/different rows as duplicates Jun 10, 2019
@qwhelan
Copy link
Contributor

qwhelan commented Jun 10, 2019

It appears the values are indeed duplicates, and everything is working as expected:

s[s == 1334122803464982]
1793    1334122803464982
1795    1334122803464982
dtype: int64

s[s == 1334122794298518]
1793    1334122794298518
1795    1334122794298518
dtype: int64

As duplicated() doesn't consider the index, these are correctly reported as dupes.

@drewhouston
Copy link
Author

Duplicated not considering the index makes sense and likely gave rise to my issue. For anyone else having similar issues doing concat then reset_index so the index becomes a column appears to fix it. Thanks for looking into it & sorry for the false alarm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants