Type inference code coerces float column into datetime #4601

ghost · 2013-08-18T16:19:58Z

In [10]: ix = [-352.737091, 183.575577]
    ...: df=pd.DataFrame([0,1],index=ix)
    ...: df.to_csv("/tmp/1.csv")
    ...: df2=pd.DataFrame.from_csv("/tmp/1.csv",parse_dates=True)
    ...: print df
    ...: print df2
             0
-352.737091  0
 183.575577  1
                            0
2105-11-21 22:43:41.128654  0
1936-11-21 22:43:41.128654  1

Moved from #3171.

The text was updated successfully, but these errors were encountered:

danbirken · 2013-09-16T23:55:50Z

The real root of this problem is that dateutil.parse() is really liberal about parsing weird values into datetimes. So I think the two best strategies are:

Stop using dateutil for datetime parsing, because it is too liberal. Have more constrained rules for what is acceptable as a datetime and change pandas to only parse those.
or
Add additional filtering before dateutil to hopefully provide more sane behavior

This change is an attempt at (2). The benefits of (2) are that it is much less of a BC break, and hopefully it is more user-friendly [though this isn't a given - the current dateutil behavior can output some surprising results, like this bug]. I think (1) has a lot of merits as well, but my guess is (2) is more in the spirit of pandas.

So this change basically just moves the pre-filtering for dateutil into a new module, pandas.utils.datetime_parsing, and then just calls that before using dateutil. There is little performance concern, because dateutil.parse() is much slower than this and this will only involve code paths that are eventually getting to dateutil.

I think this is also nice in that it moves the dateutil pre-filtering into one spot with testing, so if we wanted to go more down this path of pre-filtering dateutil, we have a good place for it. Previously there was already a slight amount of dateutil pre-filtering in the cython.

…#4601 Currently dateutil will parse almost any string into a datetime. This change adds a filter in front of dateutil that will prevent it from parsing certain strings that don't look like datetimes: 1) Strings that parse to float values that are less than 1000 2) Certain special one character strings (this was already in there, this just moves that code) Additionally, this filters out datetimes that are out of range for the datetime64[ns] type. Currently any out-of-range datetimes will just overflow and be mapped to some random time within the bounds of datetime64[ns].

ghost mentioned this issue Aug 18, 2013

API: Can't override type sniffing in df.from_csv()? #3171

Closed

danbirken mentioned this issue Sep 17, 2013

BUG: Constrain date parsing from strings a little bit more #4601 #4863

Merged

jreback closed this as completed in #4863 Sep 20, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Type inference code coerces float column into datetime #4601

Type inference code coerces float column into datetime #4601

ghost commented Aug 18, 2013

danbirken commented Sep 16, 2013

Type inference code coerces float column into datetime #4601

Type inference code coerces float column into datetime #4601

Comments

ghost commented Aug 18, 2013

danbirken commented Sep 16, 2013