-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Performance degradation in pandas 0.13/0.14 #7493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
One possible reason is that the assignment changes dtype: Also, vectorization will probably not help here, def diag_vec(cor, assets):
cor.loc[:, [(a, a) for a in assets]] = 1 For hardcore optimizations, you could probably use the fact that your dataframe is monotype to your advantage: in this case def diag_monotype(cor, assets):
assert (cor.dtypes[1:] == cor.dtypes[0]).all()
col_indices = cor.columns.get_indexer([(a,a) for a in assets])
cor.values[:, col_indices] = 1. Numpy doesn't do implicit type conversions, so it'll work as expected and do it blazingly fast. Beware though:
|
Oh, and it may not be obvious from my answer that the immediate fix for your code would be to change the constant |
I only see a very minor (6->12ms) degredation
Here's a better way
Edit: yes mine includes the amortized effect of the type conversion |
This is included in all timings
Assiging via |
@davaco so bottom line is as a user, you MUST pay attention to dtype coercion, and assignment, ESPECIALLY when using multi-indexes. Its not always obvious what the most performant path is. Almost always a straight loop is NOT the most efficient, especially with assignement. These are separate operations and pandas DOES NOT see them all at once. A list-like however, DOES result in pandas seeing them all at once, so oftentimes is it much more performant. I would encourage you to look at Pandas has many upgrades in features for new versions; it is possible that certain cases can lose performance (though we have an extensive performance test suite for exactly this reason). That said, their are MANY paths thru the indexing code so it is quite tricky. |
As a sidenote: It appears that there's potential for improvement if I read the debugger trace right. |
Guys, thanks bigtime for your prompt feedback. Very useful input. |
@immerrr will mark as a perf issue then |
Ok, I had a look at it a while ago. It's doable, but not simple. |
closing as stale |
I reported previously on issue #7208. It was noted that .ix was slower as of pandas 0.13 but that this should be only noticeable in non-vectorized code. I sometimes however have trouble vectorizing everything. Please consider the following code in which I have a rather big correlation matrix, using multi-indexed columns. I would like to set the diagonal elements equal to 1.
On my machine this takes approx. 9 ms in pandas 0.12, 5.3 seconds using pandas 0.14 (!).
Maybe the above code could be vectorized, then I would be curious on the how. However my concern is that I cannot always vectorize, in which case Pandas to me seems to display performance degradation each time when I go to the next version, from 0.11 onwards. Any comments / help will be greatly appreciated!
The text was updated successfully, but these errors were encountered: