Skip to content

Performance degradation in pandas 0.13/0.14 #7493

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
davaco opened this issue Jun 18, 2014 · 10 comments
Closed

Performance degradation in pandas 0.13/0.14 #7493

davaco opened this issue Jun 18, 2014 · 10 comments
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Internals Related to non-user accessible pandas implementation Performance Memory or execution speed performance

Comments

@davaco
Copy link

davaco commented Jun 18, 2014

I reported previously on issue #7208. It was noted that .ix was slower as of pandas 0.13 but that this should be only noticeable in non-vectorized code. I sometimes however have trouble vectorizing everything. Please consider the following code in which I have a rather big correlation matrix, using multi-indexed columns. I would like to set the diagonal elements equal to 1.

from pandas import *
import numpy as np
import pandas
import string

def diag(cor, assets):
    for asset in assets:
        cor.ix[:, (asset, asset)] = 1

# create a multi-indexed column axis like ('A', 'A'), ('A', 'B'), ...
assets = list(string.ascii_uppercase)
columns = MultiIndex.from_tuples([(a, b) for a in assets for b in assets])

# create the correlation matrix
cor = DataFrame(np.random.rand(10000,676), index=date_range('1977/1/1', periods=10000, freq='D'), columns=columns)

%time diag(cor, assets)

On my machine this takes approx. 9 ms in pandas 0.12, 5.3 seconds using pandas 0.14 (!).
Maybe the above code could be vectorized, then I would be curious on the how. However my concern is that I cannot always vectorize, in which case Pandas to me seems to display performance degradation each time when I go to the next version, from 0.11 onwards. Any comments / help will be greatly appreciated!

@immerrr
Copy link
Contributor

immerrr commented Jun 18, 2014

One possible reason is that the assignment changes dtype: np.random.rand returns floats and you're assigning whole columns to integers which causes reallocations in dataframe storage. All subsequent runs of %time diag(cor, assets) are back at millisecond level. So, the real question here is whether pandas should coerce columns during whole-column assignments and if it should why it doesn't.

Also, vectorization will probably not help here, diag_vec as defined below runs in approximately the same time:

def diag_vec(cor, assets):
    cor.loc[:, [(a, a) for a in assets]] = 1

For hardcore optimizations, you could probably use the fact that your dataframe is monotype to your advantage: in this case cor.values is a numpy view into the frame values and you can get away with smth like that:

def diag_monotype(cor, assets):
    assert (cor.dtypes[1:] == cor.dtypes[0]).all()
    col_indices = cor.columns.get_indexer([(a,a) for a in assets])
    cor.values[:, col_indices] = 1.

Numpy doesn't do implicit type conversions, so it'll work as expected and do it blazingly fast. Beware though:

  • if cor contains different dtypes, then cor.values is a temporary rather than a view and such assignments will not propagate back into cor
  • date/time handling in pandas is far better than in numpy and working directly with numpy datetime/timedelta containers will require even more hacks on top of that.

@immerrr
Copy link
Contributor

immerrr commented Jun 18, 2014

Oh, and it may not be obvious from my answer that the immediate fix for your code would be to change the constant 1 which is of type int to 1.0 which is of type float.

@jreback
Copy link
Contributor

jreback commented Jun 18, 2014

I only see a very minor (6->12ms) degredation
not sure what you are timing

In [41]: pd.__version__
Out[41]: '0.14.0'

In [42]: %timeit diag(cor,assets)
100 loops, best of 3: 17.9 ms per loop

Here's a better way

In [43]: %timeit cor.ix[:,[ (asset,asset) for asset in assets ]] = 1
100 loops, best of 3: 12 ms per loop

Edit:

yes mine includes the amortized effect of the type conversion

@jreback
Copy link
Contributor

jreback commented Jun 18, 2014

This is included in all timings

In [88]: %timeit cor.copy()
10 loops, best of 3: 65.6 ms per loop
In [78]: def f1(x):
   ....:     cor = x.copy()
   ....:     cor.ix[:,[ (asset,asset) for asset in assets ]] = 1
   ....:     

In [79]: def f2(x):
   ....:     cor = x.copy()
   ....:     cor.ix[:,[ (asset,asset) for asset in assets ]] = 1.0
   ....:     

In [80]: def f3(x):
   ....:     cor = x.copy()
   ....:     for asset in assets:
   ....:         cor.ix[:,(asset,asset)] = 1
   ....:         

In [81]: def f4(x):
   ....:     cor = x.copy()
   ....:     cor[[ (asset,asset) for asset in assets ]] = 1.0
   ....:     
In [86]: def f5(x):
   ....:     cor = x.copy()
   ....:     cor[[ (asset,asset) for asset in assets ]] = 1
   ....:     

In [82]: %timeit f1(cor)
1 loops, best of 3: 2.42 s per loop

In [83]: %timeit f2(cor)
10 loops, best of 3: 76.7 ms per loop

In [84]: %timeit f3(cor)
1 loops, best of 3: 2.5 s per loop

In [85]: %timeit f4(cor)
10 loops, best of 3: 76.7 ms per loop

In [87]: %timeit f5(cor)
1 loops, best of 3: 2.45 s per loop

Assiging via ix.[:,columns] = is de-facto the same as [columns] =
so dtypes are by definition coerced (this makes assignment consistent). Note that a partial assignment (.ix[rows,columns] ) will NOT result in coercion (even if its possible)

@jreback
Copy link
Contributor

jreback commented Jun 18, 2014

@davaco so bottom line is as a user, you MUST pay attention to dtype coercion, and assignment, ESPECIALLY when using multi-indexes.

Its not always obvious what the most performant path is. Almost always a straight loop is NOT the most efficient, especially with assignement. These are separate operations and pandas DOES NOT see them all at once. A list-like however, DOES result in pandas seeing them all at once, so oftentimes is it much more performant.

I would encourage you to look at %prun if you find that performance is not what you would expect.

Pandas has many upgrades in features for new versions; it is possible that certain cases can lose performance (though we have an extensive performance test suite for exactly this reason). That said, their are MANY paths thru the indexing code so it is quite tricky.

@immerrr
Copy link
Contributor

immerrr commented Jun 18, 2014

A list-like however, DOES result in pandas seeing them all at once, so oftentimes is it much more performant.

As a sidenote: It appears that there's potential for improvement if I read the debugger trace right. BlockManager.set(...) is called once for each item even for list-like item assignment which causes N reallocations instead of just one if the types don't match and I'm pretty sure I was anticipating vectorized assignment, so the optimization may be rather simple. I'll add this to my backlog.

@davaco
Copy link
Author

davaco commented Jun 18, 2014

Guys, thanks bigtime for your prompt feedback. Very useful input.

@jreback
Copy link
Contributor

jreback commented Jun 18, 2014

@immerrr will mark as a perf issue then

@jreback jreback added this to the 0.15.0 milestone Jun 18, 2014
@immerrr
Copy link
Contributor

immerrr commented Jul 1, 2014

Ok, I had a look at it a while ago. It's doable, but not simple.

@jreback
Copy link
Contributor

jreback commented Sep 20, 2015

closing as stale

@jreback jreback closed this as completed Sep 20, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Internals Related to non-user accessible pandas implementation Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

3 participants