Skip to content

Inconsistency, NaT included in result of groupby method first but not NaN #10590

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
larvian opened this issue Jul 15, 2015 · 5 comments · Fixed by #10625
Closed

Inconsistency, NaT included in result of groupby method first but not NaN #10590

larvian opened this issue Jul 15, 2015 · 5 comments · Fixed by #10625
Labels
Bug Datetime Datetime data dtype Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Milestone

Comments

@larvian
Copy link
Contributor

larvian commented Jul 15, 2015

NaT is included in result of groupby method first while NaN. I am expecting that first should skip both NaN and NaT and include the first value where pandas.isnull is False.
Demonstration of the inconsistency. (note that both NaT and NaN in the data frame are produced by np.nan, the difference is that the d_t column contains date values).

import numpy as np
import pandas as pd
from datetime import datetime as dt

testFrame=DataFrame({'IX':['A','A'],'num':[np.nan,100],'d_t':[np.nan,dt.now()]})

Resulting data frame:

  IX                     d_t  num
0  A                     NaT  NaN
1  A 2015-07-15 22:47:10.635  100

Grouping this data frame on the IX column and executing the first method results in this data frame which shows the inconsistency between the d_t and num columns.

testFrame.groupby('IX').first()

Resulting dataframe:

        d_t  num
IX              
A       NaT  100
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.9.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.16.2
nose: 1.3.4
Cython: 0.22
numpy: 1.9.2
scipy: 0.15.1
statsmodels: 0.6.1
IPython: 3.0.0
sphinx: 1.2.3
patsy: 0.3.0
dateutil: 2.4.2
pytz: 2015.4
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.4.3
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: 0.6.7
lxml: 3.4.2
bs4: 4.3.2
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: 0.9.9
pymysql: None
psycopg2: None
@jreback
Copy link
Contributor

jreback commented Jul 15, 2015

hmm, @sinhrks I thought this was fixed?

@jreback jreback added Bug Datetime Datetime data dtype Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Jul 15, 2015
@jreback jreback added this to the 0.17.0 milestone Jul 15, 2015
@sinhrks
Copy link
Member

sinhrks commented Jul 15, 2015

I remember some issues when NaT is in group key, but not aware the aggregation issure. .min might be affected also.

@jreback
Copy link
Contributor

jreback commented Jul 15, 2015

prob needs some adjustment for comparison vs iNaT in the first/last (though i thought it was there)

@sinhrks
Copy link
Member

sinhrks commented Jul 16, 2015

Confirmed .min also affected. @larvian PR is appreciated:)

testFrame.groupby('IX').min()
#    d_t  num
# IX         
# A  NaT  100

@larvian
Copy link
Contributor Author

larvian commented Jul 18, 2015

@sinhrks :) I see. Well, I have started to research GitHub and the Pandas source code a bit but right now I unfortunately don't have very much time available and waiting for me could require patience.
If I come to the point when I feel confident that I can contribute to the solution then I will do a PR. If someone else does it I am OK with that too :)

larvian added a commit to larvian/pandas that referenced this issue Jul 26, 2015
For groupby the time stamps gets converted to integervalue tslib.iNaT
which is -9223372036854775808. The aggregation is then done using this
value with incorrect result as a consequence. The solution proposed here
is to replace its value by np.nan in case it is a datetime or timedelta.
@jreback jreback modified the milestones: Next Major Release, 0.17.0 Sep 1, 2015
larvian added a commit to larvian/pandas that referenced this issue Sep 2, 2015
For groupby the time stamps gets converted to integervalue tslib.iNaT
which is -9223372036854775808. The aggregation is then done using this
value with incorrect result as a consequence. The solution proposed here
is to replace its value by np.nan in case it is a datetime or timedelta.
@jreback jreback modified the milestones: 0.17.0, Next Major Release Sep 2, 2015
nickeubank pushed a commit to nickeubank/pandas that referenced this issue Sep 29, 2015
For groupby the time stamps gets converted to integervalue tslib.iNaT
which is -9223372036854775808. The aggregation is then done using this
value with incorrect result as a consequence. The solution proposed here
is to replace its value by np.nan in case it is a datetime or timedelta.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants