You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Using 0.24.2 I was comparing two ~6,000-row DataFrames before and after some modifications and looking for modified rows by using concat and then drop_duplicates (keep=False, although IIRC the issue also happens with other arguments to keep) and found that it was reporting false duplicates (i.e. missing rows that were in fact modified.)
Stepping into drop_duplicates(), then duplicated(), then at the output of get_group_index() (a series of key->int64 pairs which looks like a total ordering of the DataFrame that I basically dumped into the attached csv) shows a problem most likely in duplicated_int64 which looks like a Cython or native module (which unfortunately I don't have the cycles or toolchain to look into but maybe there's a false hash collision?)
The repro and csv above includes ~12000 int64->int64 pairs with a variety of duplicated and non-duplicated pairs and some keys that only appear once.
I couldn't find a similar report although this appears to be similar in principle to #11864 (problem in duplicated_int64) and wasn't fixed by that. Repro above should be able to reproduce & isolate the problem.
Expected Output
As shown above the two actual values for index 1793 are different and should not be reported as duplicate rows (i.e. dup.loc[1793] should return False)
Output of pd.show_versions()
[paste the output of pd.show_versions() here below this line]
INSTALLED VERSIONS
The text was updated successfully, but these errors were encountered:
drewhouston
changed the title
duplicated() incorrectly identifying unique/different rows as duplicates
duplicated() (and drop_duplicates()) incorrectly identifying unique/different rows as duplicates
Jun 10, 2019
Duplicated not considering the index makes sense and likely gave rise to my issue. For anyone else having similar issues doing concat then reset_index so the index becomes a column appears to fix it. Thanks for looking into it & sorry for the false alarm.
Code Sample, a copy-pastable example if possible
duplicated_int64-bug.csv.zip
My output:
Problem description
Using 0.24.2 I was comparing two ~6,000-row DataFrames before and after some modifications and looking for modified rows by using concat and then drop_duplicates (keep=False, although IIRC the issue also happens with other arguments to keep) and found that it was reporting false duplicates (i.e. missing rows that were in fact modified.)
Stepping into
drop_duplicates()
, thenduplicated()
, then at the output ofget_group_index()
(a series of key->int64 pairs which looks like a total ordering of the DataFrame that I basically dumped into the attached csv) shows a problem most likely induplicated_int64
which looks like a Cython or native module (which unfortunately I don't have the cycles or toolchain to look into but maybe there's a false hash collision?)The repro and csv above includes ~12000 int64->int64 pairs with a variety of duplicated and non-duplicated pairs and some keys that only appear once.
I couldn't find a similar report although this appears to be similar in principle to #11864 (problem in
duplicated_int64
) and wasn't fixed by that. Repro above should be able to reproduce & isolate the problem.Expected Output
As shown above the two actual values for index 1793 are different and should not be reported as duplicate rows (i.e.
dup.loc[1793]
should returnFalse
)Output of
pd.show_versions()
[paste the output of
pd.show_versions()
here below this line]INSTALLED VERSIONS
commit: None
python: 2.7.10.final.0
python-bits: 64
OS: Darwin
OS-release: 17.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None
pandas: 0.24.2
pytest: None
pip: 18.0
setuptools: 38.4.0
Cython: None
numpy: 1.15.4
scipy: 1.2.0
pyarrow: None
xarray: None
IPython: 5.8.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2013.7
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 1.3.1
openpyxl: 2.5.2
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: None
lxml.etree: None
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.3.3
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
The text was updated successfully, but these errors were encountered: