Skip to content

Type inference code coerces float column into datetime #4601

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ghost opened this issue Aug 18, 2013 · 1 comment · Fixed by #4863
Closed

Type inference code coerces float column into datetime #4601

ghost opened this issue Aug 18, 2013 · 1 comment · Fixed by #4863
Labels
Milestone

Comments

@ghost
Copy link

ghost commented Aug 18, 2013

In [10]: ix = [-352.737091, 183.575577]
    ...: df=pd.DataFrame([0,1],index=ix)
    ...: df.to_csv("/tmp/1.csv")
    ...: df2=pd.DataFrame.from_csv("/tmp/1.csv",parse_dates=True)
    ...: print df
    ...: print df2
             0
-352.737091  0
 183.575577  1
                            0
2105-11-21 22:43:41.128654  0
1936-11-21 22:43:41.128654  1

Moved from #3171.

@danbirken
Copy link
Contributor

The real root of this problem is that dateutil.parse() is really liberal about parsing weird values into datetimes. So I think the two best strategies are:

  1. Stop using dateutil for datetime parsing, because it is too liberal. Have more constrained rules for what is acceptable as a datetime and change pandas to only parse those.
    or
  2. Add additional filtering before dateutil to hopefully provide more sane behavior

This change is an attempt at (2). The benefits of (2) are that it is much less of a BC break, and hopefully it is more user-friendly [though this isn't a given - the current dateutil behavior can output some surprising results, like this bug]. I think (1) has a lot of merits as well, but my guess is (2) is more in the spirit of pandas.

So this change basically just moves the pre-filtering for dateutil into a new module, pandas.utils.datetime_parsing, and then just calls that before using dateutil. There is little performance concern, because dateutil.parse() is much slower than this and this will only involve code paths that are eventually getting to dateutil.

I think this is also nice in that it moves the dateutil pre-filtering into one spot with testing, so if we wanted to go more down this path of pre-filtering dateutil, we have a good place for it. Previously there was already a slight amount of dateutil pre-filtering in the cython.

danbirken added a commit to danbirken/pandas that referenced this issue Sep 20, 2013
…#4601

Currently dateutil will parse almost any string into a datetime.  This
change adds a filter in front of dateutil that will prevent it from
parsing certain strings that don't look like datetimes:

1) Strings that parse to float values that are less than 1000
2) Certain special one character strings (this was already in there,
   this just moves that code)

Additionally, this filters out datetimes that are out of range for the
datetime64[ns] type.  Currently any out-of-range datetimes will just
overflow and be mapped to some random time within the bounds of
datetime64[ns].
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant