Performance degradation in pandas 0.13/0.14 #7493

davaco · 2014-06-18T10:40:22Z

I reported previously on issue #7208. It was noted that .ix was slower as of pandas 0.13 but that this should be only noticeable in non-vectorized code. I sometimes however have trouble vectorizing everything. Please consider the following code in which I have a rather big correlation matrix, using multi-indexed columns. I would like to set the diagonal elements equal to 1.

from pandas import *
import numpy as np
import pandas
import string

def diag(cor, assets):
    for asset in assets:
        cor.ix[:, (asset, asset)] = 1

# create a multi-indexed column axis like ('A', 'A'), ('A', 'B'), ...
assets = list(string.ascii_uppercase)
columns = MultiIndex.from_tuples([(a, b) for a in assets for b in assets])

# create the correlation matrix
cor = DataFrame(np.random.rand(10000,676), index=date_range('1977/1/1', periods=10000, freq='D'), columns=columns)

%time diag(cor, assets)

On my machine this takes approx. 9 ms in pandas 0.12, 5.3 seconds using pandas 0.14 (!).
Maybe the above code could be vectorized, then I would be curious on the how. However my concern is that I cannot always vectorize, in which case Pandas to me seems to display performance degradation each time when I go to the next version, from 0.11 onwards. Any comments / help will be greatly appreciated!

The text was updated successfully, but these errors were encountered:

immerrr · 2014-06-18T11:39:38Z

One possible reason is that the assignment changes dtype: np.random.rand returns floats and you're assigning whole columns to integers which causes reallocations in dataframe storage. All subsequent runs of %time diag(cor, assets) are back at millisecond level. So, the real question here is whether pandas should coerce columns during whole-column assignments and if it should why it doesn't.

Also, vectorization will probably not help here, diag_vec as defined below runs in approximately the same time:

def diag_vec(cor, assets):
    cor.loc[:, [(a, a) for a in assets]] = 1

For hardcore optimizations, you could probably use the fact that your dataframe is monotype to your advantage: in this case cor.values is a numpy view into the frame values and you can get away with smth like that:

def diag_monotype(cor, assets):
    assert (cor.dtypes[1:] == cor.dtypes[0]).all()
    col_indices = cor.columns.get_indexer([(a,a) for a in assets])
    cor.values[:, col_indices] = 1.

Numpy doesn't do implicit type conversions, so it'll work as expected and do it blazingly fast. Beware though:

if cor contains different dtypes, then cor.values is a temporary rather than a view and such assignments will not propagate back into cor
date/time handling in pandas is far better than in numpy and working directly with numpy datetime/timedelta containers will require even more hacks on top of that.

immerrr · 2014-06-18T11:42:53Z

Oh, and it may not be obvious from my answer that the immediate fix for your code would be to change the constant 1 which is of type int to 1.0 which is of type float.

jreback · 2014-06-18T11:43:52Z

I only see a very minor (6->12ms) degredation
not sure what you are timing

In [41]: pd.__version__
Out[41]: '0.14.0'

In [42]: %timeit diag(cor,assets)
100 loops, best of 3: 17.9 ms per loop

Here's a better way

In [43]: %timeit cor.ix[:,[ (asset,asset) for asset in assets ]] = 1
100 loops, best of 3: 12 ms per loop

Edit:

yes mine includes the amortized effect of the type conversion

jreback · 2014-06-18T12:06:41Z

This is included in all timings

In [88]: %timeit cor.copy()
10 loops, best of 3: 65.6 ms per loop

In [78]: def f1(x):
   ....:     cor = x.copy()
   ....:     cor.ix[:,[ (asset,asset) for asset in assets ]] = 1
   ....:     

In [79]: def f2(x):
   ....:     cor = x.copy()
   ....:     cor.ix[:,[ (asset,asset) for asset in assets ]] = 1.0
   ....:     

In [80]: def f3(x):
   ....:     cor = x.copy()
   ....:     for asset in assets:
   ....:         cor.ix[:,(asset,asset)] = 1
   ....:         

In [81]: def f4(x):
   ....:     cor = x.copy()
   ....:     cor[[ (asset,asset) for asset in assets ]] = 1.0
   ....:     
In [86]: def f5(x):
   ....:     cor = x.copy()
   ....:     cor[[ (asset,asset) for asset in assets ]] = 1
   ....:

In [82]: %timeit f1(cor)
1 loops, best of 3: 2.42 s per loop

In [83]: %timeit f2(cor)
10 loops, best of 3: 76.7 ms per loop

In [84]: %timeit f3(cor)
1 loops, best of 3: 2.5 s per loop

In [85]: %timeit f4(cor)
10 loops, best of 3: 76.7 ms per loop

In [87]: %timeit f5(cor)
1 loops, best of 3: 2.45 s per loop

Assiging via ix.[:,columns] = is de-facto the same as [columns] =
so dtypes are by definition coerced (this makes assignment consistent). Note that a partial assignment (.ix[rows,columns] ) will NOT result in coercion (even if its possible)

jreback · 2014-06-18T12:11:51Z

@davaco so bottom line is as a user, you MUST pay attention to dtype coercion, and assignment, ESPECIALLY when using multi-indexes.

Its not always obvious what the most performant path is. Almost always a straight loop is NOT the most efficient, especially with assignement. These are separate operations and pandas DOES NOT see them all at once. A list-like however, DOES result in pandas seeing them all at once, so oftentimes is it much more performant.

I would encourage you to look at %prun if you find that performance is not what you would expect.

Pandas has many upgrades in features for new versions; it is possible that certain cases can lose performance (though we have an extensive performance test suite for exactly this reason). That said, their are MANY paths thru the indexing code so it is quite tricky.

immerrr · 2014-06-18T12:28:36Z

A list-like however, DOES result in pandas seeing them all at once, so oftentimes is it much more performant.

As a sidenote: It appears that there's potential for improvement if I read the debugger trace right. BlockManager.set(...) is called once for each item even for list-like item assignment which causes N reallocations instead of just one if the types don't match and I'm pretty sure I was anticipating vectorized assignment, so the optimization may be rather simple. I'll add this to my backlog.

davaco · 2014-06-18T12:33:42Z

Guys, thanks bigtime for your prompt feedback. Very useful input.

jreback · 2014-06-18T12:58:47Z

@immerrr will mark as a perf issue then

immerrr · 2014-07-01T13:54:17Z

Ok, I had a look at it a while ago. It's doable, but not simple.

jreback · 2015-09-20T20:41:21Z

closing as stale

jreback added Indexing labels Jun 18, 2014

jreback added this to the 0.15.0 milestone Jun 18, 2014

jreback modified the milestones: Someday, 0.16.0 Mar 6, 2015

jreback mentioned this issue Sep 20, 2015

.ix/.loc methods are *significantly* slower when upgrading pandas from 0.13 to 0.16.2 #11151

Closed

jreback closed this as completed Sep 20, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance degradation in pandas 0.13/0.14 #7493

Performance degradation in pandas 0.13/0.14 #7493

davaco commented Jun 18, 2014

immerrr commented Jun 18, 2014

immerrr commented Jun 18, 2014

jreback commented Jun 18, 2014

jreback commented Jun 18, 2014

jreback commented Jun 18, 2014

immerrr commented Jun 18, 2014

davaco commented Jun 18, 2014

jreback commented Jun 18, 2014

immerrr commented Jul 1, 2014

jreback commented Sep 20, 2015

Performance degradation in pandas 0.13/0.14 #7493

Performance degradation in pandas 0.13/0.14 #7493

Comments

davaco commented Jun 18, 2014

immerrr commented Jun 18, 2014

immerrr commented Jun 18, 2014

jreback commented Jun 18, 2014

jreback commented Jun 18, 2014

jreback commented Jun 18, 2014

immerrr commented Jun 18, 2014

davaco commented Jun 18, 2014

jreback commented Jun 18, 2014

immerrr commented Jul 1, 2014

jreback commented Sep 20, 2015