duplicated() (and drop_duplicates()) incorrectly identifying unique/different rows as duplicates #26762

drewhouston · 2019-06-10T07:02:13Z

Code Sample, a copy-pastable example if possible

import pandas as pd

s = pd.Series.from_csv('duplicated_int64-bug.csv')

d = {}
non_dupe_python = set()
for k, val in s.iteritems():
    if k not in d:
        d[k] = val
    else:
        if val != d[k]:
            non_dupe_python.add(k)

dup = s.duplicated(keep=False)
has_2_entries = dup.groupby(level=0).count()
has_2_entries = has_2_entries[has_2_entries>1].index

non_dupe_pd = set(dup[dup==0].index & has_2_entries)
print "Non-dupes (Python, %d): %r" % (len(non_dupe_python), non_dupe_python)
print "Non-dupes (Series.duplicated, %d): %r" % (len(non_dupe_pd), non_dupe_pd)
print "False duplicates: %r" % (non_dupe_python-non_dupe_pd)

print "\nTwo false duplicates (problem here)"

print s.loc[1793]
print dup.loc[1793]

print s.loc[1795]
print dup.loc[1795]

print "\nA correctly-identified non-duplicate"
print s.loc[6080]
print dup.loc[6080]

print "\nA correctly-identified duplicate"
print s.loc[1]
print dup.loc[1]

duplicated_int64-bug.csv.zip

My output:

Non-dupes (Python, 9): set([6080, 1793, 1795, 6084, 1797, 1798, 5968, 5975, 6086])
Non-dupes (Series.duplicated, 7): set([6080, 6084, 1797, 1798, 5968, 5975, 6086])
False duplicates: set([1793, 1795])

Two false duplicates (problem here)
1793    1334122803464982
1793    1334122794298518
dtype: int64
1793    True
1793    True
dtype: bool
1795    1334122803464982
1795    1334122794298518
dtype: int64
1795    True
1795    True
dtype: bool

A correctly-identified non-duplicate
6080    91475737356800
6080    91475737482944
dtype: int64
6080    False
6080    False
dtype: bool

A correctly-identified duplicate
1    93251501583580
1    93251501583580
dtype: int64
1    True
1    True
dtype: bool

Problem description

Using 0.24.2 I was comparing two ~6,000-row DataFrames before and after some modifications and looking for modified rows by using concat and then drop_duplicates (keep=False, although IIRC the issue also happens with other arguments to keep) and found that it was reporting false duplicates (i.e. missing rows that were in fact modified.)

Stepping into drop_duplicates(), then duplicated(), then at the output of get_group_index() (a series of key->int64 pairs which looks like a total ordering of the DataFrame that I basically dumped into the attached csv) shows a problem most likely in duplicated_int64 which looks like a Cython or native module (which unfortunately I don't have the cycles or toolchain to look into but maybe there's a false hash collision?)

The repro and csv above includes ~12000 int64->int64 pairs with a variety of duplicated and non-duplicated pairs and some keys that only appear once.

I couldn't find a similar report although this appears to be similar in principle to #11864 (problem in duplicated_int64) and wasn't fixed by that. Repro above should be able to reproduce & isolate the problem.

Expected Output

As shown above the two actual values for index 1793 are different and should not be reported as duplicate rows (i.e. dup.loc[1793] should return False)

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

commit: None
python: 2.7.10.final.0
python-bits: 64
OS: Darwin
OS-release: 17.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.24.2
pytest: None
pip: 18.0
setuptools: 38.4.0
Cython: None
numpy: 1.15.4
scipy: 1.2.0
pyarrow: None
xarray: None
IPython: 5.8.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2013.7
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 1.3.1
openpyxl: 2.5.2
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: None
lxml.etree: None
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.3.3
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

The text was updated successfully, but these errors were encountered:

qwhelan · 2019-06-10T07:26:01Z

It appears the values are indeed duplicates, and everything is working as expected:

s[s == 1334122803464982]
1793    1334122803464982
1795    1334122803464982
dtype: int64

s[s == 1334122794298518]
1793    1334122794298518
1795    1334122794298518
dtype: int64

As duplicated() doesn't consider the index, these are correctly reported as dupes.

drewhouston · 2019-06-10T16:37:04Z

Duplicated not considering the index makes sense and likely gave rise to my issue. For anyone else having similar issues doing concat then reset_index so the index becomes a column appears to fix it. Thanks for looking into it & sorry for the false alarm.

drewhouston changed the title ~~duplicated() incorrectly identifying unique/different rows as duplicates~~ duplicated() (and drop_duplicates()) incorrectly identifying unique/different rows as duplicates Jun 10, 2019

drewhouston closed this as completed Jun 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

duplicated() (and drop_duplicates()) incorrectly identifying unique/different rows as duplicates #26762

duplicated() (and drop_duplicates()) incorrectly identifying unique/different rows as duplicates #26762

drewhouston commented Jun 10, 2019 •

edited

Loading

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

qwhelan commented Jun 10, 2019

drewhouston commented Jun 10, 2019

duplicated() (and drop_duplicates()) incorrectly identifying unique/different rows as duplicates #26762

duplicated() (and drop_duplicates()) incorrectly identifying unique/different rows as duplicates #26762

Comments

drewhouston commented Jun 10, 2019 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line] INSTALLED VERSIONS

qwhelan commented Jun 10, 2019

drewhouston commented Jun 10, 2019

drewhouston commented Jun 10, 2019 •

edited

Loading

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS