Skip to content

set_index(DatetimeIndex) unexpectedly shifts tz-aware datetime #12358

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wavexx opened this issue Feb 16, 2016 · 9 comments
Closed

set_index(DatetimeIndex) unexpectedly shifts tz-aware datetime #12358

wavexx opened this issue Feb 16, 2016 · 9 comments
Labels
Bug Timezones Timezone data dtype
Milestone

Comments

@wavexx
Copy link

wavexx commented Feb 16, 2016

This is another issue I've found in code that used to work:

import pandas as pd
tm = pd.DatetimeIndex(pd.to_datetime(["2014-01-01 10:10:10"]), tz='UTC').tz_convert('Europe/Rome')
df = pd.DataFrame({'tm': tm})
df.set_index(df.tm, inplace=True)
print(df.tm[0].hour)
print(df.index[0].hour)

writes:

11
10

It's unclear to me why the time is shifted. If we take a pd.DatetimeIndex which is not directly contained in the df, it works as it should:

tm = pd.DatetimeIndex(pd.to_datetime(["2014-01-01 10:10:10"]), tz='UTC').tz_convert('Europe/Rome')
df = pd.DataFrame({'tm': tm})
df.set_index(tm, inplace=True)
print(df.tm[0].hour)
print(df.index[0].hour)
11
11
@jreback jreback added Usage Question Timezones Timezone data dtype labels Feb 16, 2016
@jreback
Copy link
Contributor

jreback commented Feb 16, 2016

a couple of things:

  1. your syntax is incorrect (yes this did work, but it is completely misleading), as its not clear that you actually mean to localize

so construct the index like this. IOW. you have to say, hey this a local UTC time, THEN convert it.

tm = pd.DatetimeIndex(pd.to_datetime(["2014-01-01 10:10:10"])).tz_localize('UTC').tz_convert('Europe/Rome')

In [4]: tm
Out[4]: DatetimeIndex(['2014-01-01 11:10:10+01:00'], dtype='datetime64[ns, Europe/Rome]', freq=None)
  1. df.set_index(df.tm, inplace=True)

This is a nonsensical operation, what do you think this should do?

you probably mean
df.index = df.tm

You are effectively setting the index with a 'key' from they array; this technically works as you only have 1 element (otherwise it would raise). but as I said doesn't make any sense.

In [30]: df.set_index?
Signature: df.set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False)
Docstring:
Set the DataFrame index (row labels) using one or more existing
columns. By default yields a new object.

Parameters
----------
keys : column label or list of column labels / arrays
drop : boolean, default True
    Delete columns to be used as the new index
append : boolean, default False
    Whether to append columns to existing index
inplace : boolean, default False
    Modify the DataFrame in place (do not create a new object)
verify_integrity : boolean, default False
    Check the new index for duplicates. Otherwise defer the check until
    necessary. Setting to False will improve the performance of this
    method

Examples
--------
>>> indexed_df = df.set_index(['A', 'B'])
>>> indexed_df2 = df.set_index(['A', [0, 1, 2, 0, 1, 2]])
>>> indexed_df3 = df.set_index([[0, 1, 2, 0, 1, 2]])

Returns
-------
dataframe : DataFrame
File:      ~/pandas/pandas/core/frame.py
Type:      instancemethod

@jreback jreback closed this as completed Feb 16, 2016
@wavexx
Copy link
Author

wavexx commented Feb 17, 2016

It seems clear enough to me that if I know the tz of the series, there's no point to "localize" it later.
In fact, I always start from UTC. pd.to_datetime has an utc keyword which I would have expected to make the DatetimeIndex UTC and tz-aware, which would be what I need 99% of the time, but it doesn't (what's the point of this argument is still unclear to me!?).

As for setting the index, yes, it's dodgy. It's a reduced test-case from some convoluted code.
However, why does it shift time? I see no reason why in this explicit case it should.

@jreback
Copy link
Contributor

jreback commented Feb 17, 2016

passing it rather than explicity localizing leads to a lot of ambiguity, what should I doing here?

In [1]: DatetimeIndex(['2014-01-01 11:10:10+01:00'],tz='UTC')
Out[1]: DatetimeIndex(['2014-01-01 10:10:10+00:00'], dtype='datetime64[ns, UTC]', freq=None)

as to your second point, it is converted to a numpy array, thus the tz is lost. the first arg only accepts a list or np.array NOT a Series, excactly for this reason.

@wavexx
Copy link
Author

wavexx commented Feb 17, 2016

tz_localize() converts the timezone, I explicitly don't want it to do any conversion as my dates do not contain any.

In fact, if I could bug you one more time about this, what's the more efficient way to start from a unix timestamp (obviously in UTC) and get to a localized series?

@jreback
Copy link
Contributor

jreback commented Feb 17, 2016

NO tz_localize, SETS the timezone, tz_convert converts it!

Here's some examples.

You CAN use the utc=True flag on pd.to_datetime; this WILL return it localized to UTC. (just don't do this directly with DatetimeIndex. All will be well if you use pd.to_datetime for all conversion needs, then operate on the resulting objects

In [2]: v = Timestamp('20130101').value

In [3]: v
Out[3]: 1356998400000000000

In [4]: pd.to_datetime(v,unit='ns')
Out[4]: Timestamp('2013-01-01 00:00:00')

In [5]: pd.to_datetime(v/1000000,unit='ms')
Out[5]: Timestamp('2013-01-01 00:00:00')

In [6]: pd.to_datetime(v/1000000,unit='ms').tz_localize('UTC')
Out[6]: Timestamp('2013-01-01 00:00:00+0000', tz='UTC')

In [7]: pd.to_datetime(v/1000000,unit='ms',utc=True)
Out[7]: Timestamp('2013-01-01 00:00:00+0000', tz='UTC')

In [8]: pd.to_datetime(v/1000000,unit='ms').tz_localize('UTC')
Out[8]: Timestamp('2013-01-01 00:00:00+0000', tz='UTC')

In [9]: Series(pd.to_datetime(v/1000000,unit='ms').tz_localize('UTC'))
Out[9]: 
0   2013-01-01 00:00:00+00:00
dtype: datetime64[ns, UTC]

In [10]: Series(pd.to_datetime(v/1000000,unit='ms')).dt.tz_localize('UTC').dt.tz_convert('US/Eastern')
Out[10]: 
0   2012-12-31 19:00:00-05:00
dtype: datetime64[ns, US/Eastern]

@wavexx
Copy link
Author

wavexx commented Feb 17, 2016

On Wed, Feb 17 2016, Jeff Reback [email protected] wrote:

You CAN use the utc=True flag on pd.to_datetime; this WILL return it localized
to UTC. (just don't do this directly with DatetimeIndex. All will be well if
you use pd.to_datetime for all conversion needs, then operate on the resulting
objects

Ok, this made things a little bit clearer regarding the tz.
Point understood.

I'm still not super-happy about the set_index behavior. I've given it
some extra-though, but I don't see where and why the tz would be lost.

Where exactly this conversion happens?

import pandas as pd
tm = pd.DatetimeIndex(pd.to_datetime(["2014-01-01 10:10:10"]), tz='UTC').tz_convert('Europe/Rome')
df = pd.DataFrame({'tm': tm})
print(df.set_index(tm).index[0].hour)
print(pd.DatetimeIndex(pd.Series(df.tm))[0].hour)
print(df.set_index(df.tm).index[0].hour)

=> 11 11 10

Ignore the fact that I could assign to index for a moment.

I'm supplying a type to set_index that should be equivalent to the
first or second print statement.

@jreback
Copy link
Contributor

jreback commented Feb 17, 2016

looks like a bug after all!

fixed by #12365

@wavexx
Copy link
Author

wavexx commented Feb 17, 2016

On Wed, Feb 17 2016, Jeff Reback [email protected] wrote:

looks like a bug after all!

fixed by #12365

Sorry for being pedantic!

@jreback
Copy link
Contributor

jreback commented Feb 17, 2016

no, persistence is good! you got me to actually step thru and see what was happening. always better to test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants