Skip to content

[Bug]: Creating sub-plots is much slower than Plotly #26162

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Hvass-Labs opened this issue Jun 21, 2023 · 13 comments
Open

[Bug]: Creating sub-plots is much slower than Plotly #26162

Hvass-Labs opened this issue Jun 21, 2023 · 13 comments

Comments

@Hvass-Labs
Copy link

Bug summary

Creating sub-plots in Matplotlib is typically 4-12x slower than Plotly. This is not a bug per se, but a serious performance issue for time-critical applications such as interactive web-apps. There are several closed GitHub issues about the slowness of creating sub-plots that go back 6-7 years, but it's still a problem.

Code for reproduction

%matplotlib inline
from matplotlib.figure import Figure
from plotly.subplots import make_subplots
import numpy as np
import pandas as pd
import timeit

rng = np.random.default_rng()

results = []

for i in range(30):
    # Random number of rows and columns.
    rows, cols = rng.integers(low=1, high=20, size=2)
    
    # Plotly won't accept Numpy ints.
    rows = int(rows)
    cols = int(cols)

    # Total number of sub-plots.
    total = rows*cols

    # Timer.
    t1 = timeit.default_timer()

    # Matplotlib.
    fig_matplotlib = Figure()
    axs_matplotlib = fig_matplotlib.subplots(nrows=rows, ncols=cols)

    # Timer.
    t2 = timeit.default_timer()

    # Plotly.
    fig_plotly = make_subplots(rows=rows, cols=cols)

    # Timer.
    t3 = timeit.default_timer()
    
    # Time-usage.
    t_matplotlib = t2 - t1
    t_plotly = t3 - t2
    
    # Relative time-usage.
    t_relative = t_matplotlib / t_plotly

    # Save results.
    results.append(dict(rows=rows, cols=cols, total=total,
                        t_matplotlib=t_matplotlib, t_plotly=t_plotly,
                        t_relative=t_relative))

    # Show status.    
    print(f'{rows}\t{cols}\t{total}\t{t_relative:.3f}')

# Convert results to Pandas DataFrame.
df_results = pd.DataFrame(results)

# Plot relative time-usage.
df_results.plot(kind='scatter', x='total', y='t_relative', grid=True);

# Plot individual time-usage.
df2 = df_results.set_index('total').sort_index()
df2.plot(y=['t_matplotlib', 't_plotly'], grid=True, ylabel='seconds');

Actual outcome

In my actual application with 3 columns and 10 rows, the time-usage for Matplotlib is consistently around 1.8 seconds, but for some reason it is only around 0.5 seconds in these tests.

This plot shows the individual time-usage for Matplotlib and Plotly, where the x-axis is the total number of sub-plots (cols * rows):

image

Note the jagged lines for the Matplotlib time-usage. We could average several runs to make the lines smoother, but the trend is clear, and the jaggedness is actually quite strange, that the time changes so much from run to run.

This plot shows the relative time-usage (Matplotlib time / Plotly time):

image

Expected outcome

I would like it to run like this - minus the crashes, please.

Additional information

Thanks again for making Matplotlib! I don't want to sound ingrateful or too demanding, as this is my second GitHub issue in a few days relating to the performance of using many sub-plots in Matplotlib. But these issues are major bottle-necks in my application that take around 90% of the runtime. I also wonder if perhaps the issues are related. (See #26150)

Is there a technical reason that Plotly is so much faster than Matplotlib when it comes to having sub-plots?

I imagine that Matplotlib has been made by many different people over a long period of time, so perhaps it is getting hard to understand what the code is doing sometimes?

Plotly runs very fast and is easy to use, but I have already made everything in Matplotlib, and I'm not even sure Plotly has all the features I need to customize the plots. So I'm hoping it would be possible to improve the speed of Matplotlib when using sub-plots.

Thanks!

Operating system

Kubuntu 22

Matplotlib Version

3.7.1

Matplotlib Backend

module://matplotlib_inline.backend_inline

Python version

3.9.12

Jupyter version

6.4.12 (through VSCode)

Installation

pip

@jklymak
Copy link
Member

jklymak commented Jun 21, 2023

Are you triggering draws for each of these figure calls? It's hard to compare if there is no draw being made. I'd also do all this outside of vscode etc

@oscargus
Copy link
Member

I was inspired to do a small profiling with a 10 x 10 subplot and on my computer creating the Axes objects took 99.9% of the time. Of which, Axes.clear took about 60% of the time. This is basically setting up all the different attributes of each Axes (and Axis and Ticks and labels and ...). The private helpers _internal_helper and _init_axis took about 15% each.

Interesting to note is that although only 2 x 100 Axis objects was created Axis.clear was called 1200 times, so six times for each Axis object. Those calls were about 25% of the total time, so this seems like a pretty obvious thing to look at.

@oscargus
Copy link
Member

An obvious improvement may be to pass an optional argument to Spines.register_axis that determines if Axis.clear should be called (default) or not. In _init_axis the same Axis is registered twice and probably doesn't have to be cleared twice(?). (If at all?)

@jklymak
Copy link
Member

jklymak commented Jun 21, 2023

I think when we looked into these sorts of things in the past, it boiled down to the transform stack. @anntzer has some ideas to improve, but I'm not sure how much work they are.

@Hvass-Labs
Copy link
Author

Hvass-Labs commented Jun 21, 2023

@jklymak The code above does not draw anything. You are right that it could be that Matplotlib calculates a lot of stuff "up-front" while Plotly defers a lot of its computations until we actually draw something.

The following repeats the tests above, but this time with random line-drawings as well, and the option to save the image to an IO-stream.

Note that this requires pip install kaleido as well as plotly.

%matplotlib inline
from matplotlib.figure import Figure
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import numpy as np
import pandas as pd
import timeit
from io import BytesIO

rng = np.random.default_rng()

# Generate random data.
def rand_data(row, col, size=100):
    x = rng.normal(loc=row+1, scale=col+1, size=size)
    y = rng.normal(loc=col+1, scale=row+1, size=size)
    x = np.sort(x)
    return x, y

# Generate random line-plots using MATPLOTLIB.
def generate_matplotlib(axs, rows, cols):
    for row in range(rows):
        for col in range(cols):
            ax = axs[row, col]
            x, y = rand_data(row=row, col=col)
            ax.plot(x, y);

# Generate random line-plots using PLOTLY.
def generate_plotly(fig, rows, cols):
    for row in range(rows):
        for col in range(cols):
            x, y = rand_data(row=row, col=col)
            # Plotly Scatter can draw both scatter-plots, and lines as here.
            fig.add_trace(go.Scatter(x=x, y=y, mode='lines'),
                          row=row+1, col=col+1)

# String with save-format e.g. 'svg' or 'png. `None` means no saving.
SAVE_FORMAT = None  # 'svg'

# Results for all test-runs.
results = []

for i in range(30):
    # Random number of rows and columns.
    rows, cols = rng.integers(low=1, high=20, size=2)
    
    # Plotly won't accept Numpy ints.
    rows = int(rows)
    cols = int(cols)

    # Total number of sub-plots.
    total = rows*cols

    # Timer.
    t1 = timeit.default_timer()

    # MATPLOTLIB.
    fig_matplotlib = Figure()
    axs_matplotlib = \
        fig_matplotlib.subplots(nrows=rows, ncols=cols, squeeze=False)
    generate_matplotlib(axs=axs_matplotlib, rows=rows, cols=cols)
    if SAVE_FORMAT is not None:
        stream_matplotlib = BytesIO()
        fig_matplotlib.savefig(stream_matplotlib, format=SAVE_FORMAT)

    # Timer.
    t2 = timeit.default_timer()

    # PLOTLY.
    fig_plotly = make_subplots(rows=rows, cols=cols)
    generate_plotly(fig=fig_plotly, rows=rows, cols=cols)
    if SAVE_FORMAT is not None:
        stream_plotly = BytesIO()
        fig_plotly.write_image(file=stream_plotly, format=SAVE_FORMAT)

    # Timer.
    t3 = timeit.default_timer()
    
    # Time-usage.
    t_matplotlib = t2 - t1
    t_plotly = t3 - t2
    
    # Relative time-usage.
    t_relative = t_matplotlib / t_plotly

    # Save results.
    results.append(dict(rows=rows, cols=cols, total=total,
                        t_matplotlib=t_matplotlib, t_plotly=t_plotly,
                        t_relative=t_relative))

    # Show status.    
    print(f'{rows}\t{cols}\t{total}\t{t_relative:.3f}')

# Convert results to Pandas DataFrame.
df_results = pd.DataFrame(results)

# Plot relative time-usage.
df_results.plot(kind='scatter', x='total', y='t_relative', grid=True);

# Plot individual time-usage.
df2 = df_results.set_index('total').sort_index()
df2.plot(y=['t_matplotlib', 't_plotly'], grid=True, ylabel='seconds');

Test 1: No Saving

This looks similar to the original test, but it could be that Plotly still doesn't actually draw anything, while Matplotlib does.

Note that before each of these tests I have made a full reset of the Jupyter kernel, and only ran the code above.

image

image

Test 2: Saving as SVG

For smaller numbers of sub-plots Plotly looks to be faster, while it looks like Matplotlib starts to be faster for larger numbers of sub-plots.

image

image

Test 3: Saving as PNG

Again Plotly is faster for smaller numbers of sub-plots, but now Matplotlib is even faster for larger numbers of sub-plots. Apparently Plotly is slower at saving PNG compared to SVG-files, while Matplotlib apparently takes roughly the same time for PNG and SVG-files.

image

image

Comments

When doing an "end-to-end" comparison of the whole plotting process, the differences between Matplotlib and Plotly are much smaller. Hopefully there are still some gains to be had from optimizing Matplotlib as @oscargus pointed out (thanks!) because 1.8 seconds to generate 3x10 sub-plots is quite a long time for real-time applications.

Also note that I only started using Plotly yesterday. And although I've been using Matplotlib for many years, I would still consider myself a "noob" there as well :-) So whether these are fair apples-to-apples comparisons, I don't know. Perhaps we need to set options for the file-savers to make it completely fair.

Please experiment with the code above, if you have an idea for making a more fair comparison.

Thanks!

EDIT x 3: I keep writing Pyplot instead of Plotly :-)

@jklymak
Copy link
Member

jklymak commented Jun 21, 2023

Phew that is a relief that we are not ridiculously slower. Of course getting faster would be great, and indeed some of what we do seems lower level than plotly so we could conceivably be faster.

For real time applications you maybe don't need to clear the whole figure and start again. You should be able to add and remove artists relatively cheaply. https://matplotlib.org/devdocs/users/explain/artists/performance.html. https://matplotlib.org/devdocs/users/explain/animations/blitting.html#sphx-glr-users-explain-animations-blitting-py may give you some ideas.

@Hvass-Labs
Copy link
Author

Sorry for startling you with the unfair performance comparison! :-)

Thanks to everyone for jumping on this so quickly! I am very grateful for that! I also took a peek at your PR, and I didn't understand any of it, but it's an impressive amount of changes you have made in a very short amount of time! :-)

Test 4: Saving as PNG + sharex=True

This is an additional test for sharing the x-axis in the columns of sub-plots, which was found to cause a slow-down in #26150 which is also confirmed here, and where @tacaswell may have some ideas for a solution.

Compared to Test 3 above, it seems that Plotly benefits and actually gets faster when sharing the x-axis, while Matplotlib suffers a slow-down.

Changes to the code above:

# Matplotlib.
axs_matplotlib = \
    fig_matplotlib.subplots(nrows=rows, ncols=cols, squeeze=False, sharex=True)

# Plotly.
fig_plotly = make_subplots(rows=rows, cols=cols, shared_xaxes=True)

image

image

@Hvass-Labs
Copy link
Author

@jklymak Thanks for the suggestion regarding blitting. However, I don't actually redraw the figures over and over - at least not in a traditional sense. The web-app is still under development, otherwise I could have shown you. But briefly explained, the user inputs various data as you would on any web-site, then they click a "process" button so the data gets sent to the web-server, which is state-less so it only gets the data the user just input and whatever data it needs to read from a database. Then it generates various plots using Matplotlib, and shows the results as SVG in a web-page that is sent back to the user's web-browser.

So perhaps "low-latency" plotting is a better description for what I need than "real-time" plotting, which may imply repeated and fast updating of a figure.

@jklymak
Copy link
Member

jklymak commented Jun 22, 2023

I'll agree that it would be nice if we were faster.

OTOH, if you have to wait 3s for 100 plots, I can't help but think that is a small fraction of the time you will need to look at them all and understand what they are telling you. If it were me, I'd put effort into data reduction techniques.

@Hvass-Labs
Copy link
Author

It would be easier to understand my use-case if I could show you what I am doing with Matplotlib, but unfortunately I can't right now. A total of 3 seconds latency would not be a problem - but one of my "features" takes 10 seconds to run from the user clicks the button until they see the results, so the user-experience is quite sluggish, and it's 90% Matplotlib because of the issues we've been discussing in the past few days. Data-reduction is unfortunately not possible.

Please consider it this way: You have over 1 million installs of Matplotlib per day. That's a tremendous amount of users! If you can save 50% of the runtime through code optimizations, such as the ones in this thread, then it's not only a massive amount of runtime that is saved for all users, which could make them more productive, it would also save electricity for server-farms using Matplotlib, and it would also make Matplotlib usable for more time-critical applications, which might help invent and develop all new kinds of tools.

So making Matplotlib run faster would be far more impactful in the world than just my personal needs.

@timhoffm
Copy link
Member

timhoffm commented Jun 23, 2023

Of course making Matplotlib run faster would be nice. However, this is a hard issue. The original architecture has not been designed with the idea of several tenths of subplots in mind. Optimizing performance while keeping full backward-compatibility is very challenging. I estimate that it'd need some hundred hours of focussed time for a developer familiar with the matplotlib codebase. Unfortunately, the intersection of people who would be able to do this and who have the capacity is nearly zero.

If you are interested, I can link a couple of issues and PRs to that topic. It's not that we have not looked into this, but there are no low-hanging fruits left here.

@jklymak
Copy link
Member

jklymak commented Jun 23, 2023

If a fraction of the people who used Matplotlib contributed $5 a year we could hire an RSE or two to work on problems like this.

@Hvass-Labs
Copy link
Author

You guys are doing a valiant effort on Matplotlib!

Regarding funding of open source in general, the real problem is all the corporations and universities who use the software but don't contribute either manpower or money to its development. I lack polite words to describe that, and I think an effort should be made to "guilt" them into helping.

I have been working well over a year on my current project. It's just me without any funding. It is quite possible that it will be a failure and my time has been completely wasted. But if it is even modestly successful, then at the very top of my wish-list is to donate funds to improve the performance of Matplotlib. But this could easily be a year into the future, if at all.

I have done open source R&D for ... I don't know ... 15 years maybe, without being paid for it, and at tremendous expense to myself. My "famous" TensorFlow tutorials probably took me nearly 10-12 months to make, because there was very little information available at the time, which is of course why I made them. Hundreds of thousands of people learned AI from those. I probably made $2500 in donations from that - with $1000 coming from a single wealthy person. The rest of my R&D I just did on my own, and shared the results with everyone.

So I will help you financially when I get the chance, but for now I can't. But I still hope that you'll find some of these code optimizations worth your time and effort, as they will probably benefit many people. It looks like you managed to improve this issue already, so I'm excited to see the result!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants