Skip to content

Styler extremely slow #19917

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
N2ITN opened this issue Feb 26, 2018 · 16 comments · Fixed by jihwans/pandas#1 or #34863
Closed

Styler extremely slow #19917

N2ITN opened this issue Feb 26, 2018 · 16 comments · Fixed by jihwans/pandas#1 or #34863
Labels
IO HTML read_html, to_html, Styler.apply, Styler.applymap Performance Memory or execution speed performance Styler conditional formatting using DataFrame.style
Milestone

Comments

@N2ITN
Copy link

N2ITN commented Feb 26, 2018

Code Sample

def highlighter(col):
    '''
    Highlights rows in `highlight_map` keys where df[`highlight map` values]==1
    '''

    highlight_map = {
        'Service Type': 'ServiceType_Added',
        'Store #': 'Store_Added',
    }
    if col.name in highlight_map:
        return ['background-color: yellow' if final[highlight_map[col.name]][v] else '' for v in col.index.tolist()]
    else:
        return [''] * len(col)
    
styled = final.style.apply(highlighter)
styled.to_excel('highlights.xlsx', engine='openpyxl') 

Problem description

Here I have some conditional highlighting on a df with 18k rows. The issue is that despite preceding complex operations on the df (such as conditional merges and df.apply by row) taking ~300ms at the most, the Styler.apply part takes over two minutes. I realize this feature is in development but I am wondering if there is a way to make it faster or if this is a known issue.

@TomAugspurger
Copy link
Contributor

Is it Styler.apply that's slow, or your function, or the writer? Have you profiled it to get a sense for where the time being spent?

Note that Styler builds up a list of tasks lazily. To actually run it you'll have to call _compute, so you'll want to profile something like

styled = final.style.apply(highlighter)._compute()

@TomAugspurger TomAugspurger added the Needs Info Clarification about behavior needed to assess issue label Feb 26, 2018
@N2ITN
Copy link
Author

N2ITN commented Feb 26, 2018

Good point.
Here are some times:

Function scanning the df and writing new values final.apply(highlighter) 1.42s

The Styler by itself final.style.apply(highlighter)._compute() 1m 28.3s

The Styler and openpyxl writer final.style.apply(highlighter).to_excel('highlights.xlsx', engine='openpyxl')
2m 31s

@TomAugspurger
Copy link
Contributor

final.style.apply(highlighter)._compute()

If you want, you can break that down into how much time is spent in pandas' Styler._apply vs. your function. I suspect the majority will be in your function, in which case there's nothing pandas can do to hep ;)

For the to_excel part, I'd be curious to see a breakdown between pandas and whatever engine is doing the writing. I'm not familiar with that bit of the code though I'd guess most of the time will be spent in the writer engine.

@N2ITN
Copy link
Author

N2ITN commented Feb 26, 2018

As stated above, unless I misunderstand, calling the function independently of the styler using df.apply(highlight) takes 1.42s as opposed to with the styler df.style.apply(highlight) which takes 1m 28s.

Both calls have the same amount of iteration over the original dataframe, the only difference being the latter creates a Style object.
Therefore I don't believe it is my function that is slowing the process.

@TomAugspurger
Copy link
Contributor

Therefore I don't believe it is my function that is slowing the process.

Could you do some line profiling then? Or post a reproducible example? _update_context could be to blame, but it's hard to say for sure.

@TomAugspurger
Copy link
Contributor

@N2ITN find any time to profile this?

@TomAugspurger
Copy link
Contributor

Closing, let us know if you're able to profile things.

@itssimon
Copy link

I'm experiencing the same issue and did some profiling.

I have a very small DataFrame:

In [4]: df.shape
Out[4]: (78, 4)

Just adding a text-align: left property takes almost 1 second:

In [5]: %%time
    ...: s = df.style.set_properties(**{'text-align': 'left'})
    ...: s.render();
    ...: 
CPU times: user 844 ms, sys: 68 ms, total: 912 ms
Wall time: 833 ms

I ran the line profiler and pasted the output here. As suspected most time is spent in _update_ctx().

@TomAugspurger Could you please reopen the issue? I'm happy to provide more info, just let me know what is helpful!

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Mar 30, 2019 via email

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Mar 30, 2019 via email

@itssimon
Copy link

Yes, it seems that way. Unless you specify the subset parameter.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Mar 30, 2019 via email

@mjh7
Copy link

mjh7 commented Apr 19, 2019

Hi there,

I am still relatively new to this toolset (I'm loving pandas overall!), but I seem to be having the same issue with slow Stylers as other users in this thread... Just checking if there are any new developments?

In my medium-small df (6K rows by 10 cols), it takes almost 4 minutes to process a function that assigns css font-size using df.style.applymap(my_css_function), while df.applymap(my_css_function) is practically instantaneous. I can process particular slices faster, of course, but it's awkward that I can't re-use the result - it has to get re-computed every time and prints the entire structure with no 'head' method, etc.

For my purposes, the style idea is very relevant, but I'd prefer to define rules to control how whatever I'm looking at in my jupyter notebooks are rendered all the time, as opposed to working most of the time with one structure with limited readability and creating a separate structure (Styler) every time I want to see it more clearly/richly. I don't know how reasonable of a request that is, but unless I'm missing something or it gets a lot faster, the usefulness will be fairly low.

Thanks for your time!

Edit: Just found the "Limitations" section near the bottom of the documentation, it says: "No large repr, and performance isn’t great; this is intended for summary DataFrames". So maybe it's working as intended, formatting control and basic visualizations as a bonus as opposed to core functionality. I can support that. But maybe this "Limitations" section could be moved close to the top of the documentation page so people don't get the wrong idea?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Apr 20, 2019 via email

@jihwans
Copy link
Contributor

jihwans commented Jun 18, 2020

Here's a temporary solution until it gets fixed:

from pandas.io.formats.style import Styler

def _update_ctx(self, attrs):
    rows = [(row_label, v) for row_label, v in attrs.iterrows()]
    row_idx = self.index.get_indexer([x[0] for x in rows])
    for ii, row in enumerate(rows):
        i = row_idx[ii]
        cols = [(col_label, col) for col_label, col in row[1].items() if col]
        col_idx = self.columns.get_indexer([x[0] for x in cols])
        for jj, itm in enumerate(cols):
            j = col_idx[jj]
            for pair in itm[1].rstrip(";").split(";"):
                self.ctx[(i, j)].append(pair)

Styler._update_ctx = _update_ctx

Original code in current version goes like this:

def _update_ctx(self, attrs: DataFrame) -> None:
"""
Update the state of the Styler.
Collects a mapping of {index_label: ['<property>: <value>']}.
Parameters
----------
attrs : DataFrame
should contain strings of '<property>: <value>;<prop2>: <val2>'
Whitespace shouldn't matter and the final trailing ';' shouldn't
matter.
"""
for row_label, v in attrs.iterrows():
for col_label, col in v.items():
i = self.index.get_indexer([row_label])[0]
j = self.columns.get_indexer([col_label])[0]
for pair in col.rstrip(";").split(";"):
self.ctx[(i, j)].append(pair)

There were two reasons that it was slow:

  • get_indexer happens to be quite an expensive function and it was called rows x cols x 2 times
  • when there's no item to add, it was still iterating through all the rows and cols

I could not figure out:

  • how to skip the whole row if it contains nothing to add

It went in my app from 20 seconds to render down to less than 2 seconds.
I'm not satisfied but it's a bit better than before.

jihwans added a commit to jihwans/pandas that referenced this issue Jun 18, 2020
jihwans added a commit to jihwans/pandas that referenced this issue Jun 23, 2020
jihwans added a commit to jihwans/pandas that referenced this issue Jun 24, 2020
jihwans added a commit to jihwans/pandas that referenced this issue Jun 24, 2020
jihwans added a commit to jihwans/pandas that referenced this issue Jun 24, 2020
jihwans added a commit to jihwans/pandas that referenced this issue Jun 24, 2020
jihwans added a commit to jihwans/pandas that referenced this issue Jun 24, 2020
jihwans added a commit to jihwans/pandas that referenced this issue Jun 24, 2020
jihwans added a commit to jihwans/pandas that referenced this issue Jun 24, 2020
@jihwans
Copy link
Contributor

jihwans commented Jun 24, 2020

I made a pull request addressing this issue -- see #34863

     <v1.0.4^0>       <jihwans-patch-1>
-            178M             157M     0.88  io.style.RenderApply.peakmem_render(36, 1200)
-        69.4±1ms       19.2±0.1ms     0.28  io.style.RenderApply.time_render(12, 12)
-      6.44±0.01s       1.53±0.02s     0.24  io.style.RenderApply.time_render(12, 1200)
-         133±1ms       30.9±0.8ms     0.23  io.style.RenderApply.time_render(24, 12)
-        679±20ms          154±3ms     0.23  io.style.RenderApply.time_render(12, 120)
-         197±3ms         40.8±2ms     0.21  io.style.RenderApply.time_render(36, 12)
-      1.23±0.06s          237±5ms     0.19  io.style.RenderApply.time_render(24, 120)
-      1.82±0.02s          325±9ms     0.18  io.style.RenderApply.time_render(36, 120)
-      18.4±0.06s        3.13±0.1s     0.17  io.style.RenderApply.time_render(36, 1200)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

@jreback jreback removed the Needs Info Clarification about behavior needed to assess issue label Jun 24, 2020
@jreback jreback added IO HTML read_html, to_html, Styler.apply, Styler.applymap Performance Memory or execution speed performance Styler conditional formatting using DataFrame.style labels Jun 24, 2020
@jreback jreback added this to the 1.1 milestone Jun 24, 2020
jihwans added a commit to jihwans/pandas that referenced this issue Jun 25, 2020
jihwans added a commit to jihwans/pandas that referenced this issue Jun 25, 2020
jihwans added a commit to jihwans/pandas that referenced this issue Jun 25, 2020
jihwans added a commit to jihwans/pandas that referenced this issue Jun 27, 2020
- experimental, 10% further improvement by eliminating get_indexer call

see pandas-dev#19917
jihwans added a commit to jihwans/pandas that referenced this issue Jun 27, 2020
- experimental, 10% further improvement by eliminating get_indexer call

see pandas-dev#19917
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO HTML read_html, to_html, Styler.apply, Styler.applymap Performance Memory or execution speed performance Styler conditional formatting using DataFrame.style
Projects
None yet
6 participants