-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Infer datetime format #6021
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Infer datetime format #6021
Conversation
what happens if their are mixed formats in a list of strings? does it gues based in 1st? (eg how do u sample what to guess) I suppose that some formats if they are mixed will cause the reader to barf - so maybe have some tests and prob just need to document this |
Looking good: unintrusive, covers the major use case and has a good bang/buck ratio. Notes:
As far as generality, I think this is hits a balance nearly perfectly. marking it for 0.14. a future iteration could tackle the case of multiple date formats in a column:
Those are just for fun though, this already delivers what we really care about. 👍 |
two more:
|
One more: |
Alright, updated version:
Oddities:
Other things:
|
instead of adding infer_datetime_format to read_csv can u just intercept parse_dates='infer' and if it's a dict maybe could for individual fields (if parse_dates is a list might be trickier) I am normally against overloading too much |
oh. oh nice. yeah. you're against overloading. sure. hehe :) |
closet overloader :) |
I was thinking "filthy liar" but close enough :).
On risk/defaults, I suggest we ship this as off by default in 0.14, give it a prominent place I'd like to avoid making non-battle-tested code the default for everyone's production |
👍 on 0.14.0 (though could put in 0.13.1 if its turned off by default) |
You're totally right. @danbirken if you can sort out the remaining issues let's just ship it next |
why don't you just have an import flag in and set it if the function exists we do this when we want to know at run-time if a module exists, but no reason why you can't do this then your guesser can decide what to do based on the availability of the dateutil function? |
1 similar comment
why don't you just have an import flag in and set it if the function exists we do this when we want to know at run-time if a module exists, but no reason why you can't do this then your guesser can decide what to do based on the availability of the dateutil function? |
However, the situation isn't perfect. It will still mess up cases a human wouldn't:
But sentinel values don't actually improve this case, this is just a problem with the current guessing method. However, this is a pretty rare edge case, as pretty much every standard datetime format puts the Y-m-d information first, which is what the guesser expects. So in conclusion, I think the sentinel values of 0 are actually perfectly good and I can't think of any case where they cause the guesser to do the wrong thing. New questions: Assuming everybody is content with adding the |
you can add to the separate issue: if you use the separate option (ok by me then :)) in theory it should be able to take a list of columns, because you might want to only infer on certain columns (annoying but prob needed) |
@danbirken also will need an example in v0.13.1.txt (and you can use the same example) in io.rst (which you can prob use from a test you have anyhow) |
I'm convinced by your points on the sentinals idea. |
…das-dev#5490 Given an array of strings that represent datetimes, infer_format=True will attempt to guess the format of the datetimes, and if it can infer the format, it will use a faster function to convert/import the datetimes. In cases where this speed-up can be used, the function should be about 10x faster.
Added vbench, fixed kwarg ordering, made keyword consistent everywhere ( Here is a link just to the changes from this round (the "official" pull request has all of these smashed together, but that makes it harder to review. Hopefully this helps): danbirken/pandas@731120f...6ed08d1 I admit I am saying this partially out of laziness, but making |
@danbirken agree with your point about infer_datetime_format just being a boolean looks good would add the YYYYMMDD format to the common parsed format section can u post a run of the new vbenches? otherwise looks good! |
This allows read_csv() to attempt to infer the datetime format for any columns where parse_dates is enabled. In cases where the datetime format can be inferred, this should speed up processing datetimes by ~10x. Additionally add documentation and benchmarks for read_csv().
Updated docs. Here are the vbench outputs from my machine: With
When I manually set them to False (I don't know another way
The iso8601 one is a little slower, which makes sense because the extra time of guessing the format has to be taken into account, and you don't actually get any speedup. I don't know what these time units are, but in an absolute real-world sense I don't think the difference is a big deal. The bigger your input case, the less this difference will be felt additionally. |
.. ipython:: python | ||
|
||
# Try to infer the format for the index column | ||
df = pd.read_csv('foo.csv', index_col=0, parse_dates=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
foo.csv has been removed before (under 'Specifying date columns'), so you will have to move that remove below this.
Let's call this after the doc tweaks, it's plenty good enough to merge. |
@y-p merge away |
this going to close #5490 |
@danbirken thanks for this ! awesome!!!! pls check docs and behavior....if need a small followup can easily put in 0..13.1! |
I'm packing for a flight right now, but I'll get to those doc fixes when I'm on the plane in a few hours. I misunderstood the formatting and that was a case of copy/pasting gone wrong! And yes, this should close that issue. |
@danbirken gr8! doc updates can take anytime.... thanks again! |
I'll push a release note thanking @lexual and @danbirken |
sure |
closes #5490
Basically this attempts to figure out the the datetime format if given a list of datetime strings. If the format can be guessed successfully, then to_datetime() attempts to use a faster way of processing the datetime strings.
Here is a gist that shows some timings of this function: https://gist.github.com/danbirken/8533199
If the format can be guessed, this generally speeds up the processing by 10x. In the case where the format string is guessed to be an iso8601 string or subset of that, then the format is not used, because the iso8601 fast-path is much faster than this (about 30-40x faster).
I just did a pretty basic way of guessing the format using only dateutil.parse, a simple regex and some simple logic. This doesn't use any private methods of dateutil. It certainly can be improved, but I wanted to get something started first before I went crazy having to handle more advanced cases.