Skip to content

PERF: ArrowExtensionArray.fillna when array does not contains any nulls #51635

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Mar 17, 2023

Conversation

lukemanley
Copy link
Member

@lukemanley lukemanley commented Feb 25, 2023

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(1_000_000, 10), dtype="float64[pyarrow]")

%timeit df.fillna(0.0)
25 ms ± 5.42 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)        -> main
76.4 µs ± 4.43 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)  -> PR

@lukemanley lukemanley added Performance Memory or execution speed performance Arrow pyarrow functionality labels Feb 25, 2023
@phofl
Copy link
Member

phofl commented Feb 25, 2023

Could you test this on #51411 as well? This should address the problem for CoW

@lukemanley
Copy link
Member Author

Could you test this on #51411 as well? This should address the problem for CoW

I tested locally on top of your branch in #51411 and it seems to work. What "problem" are you referring to?

@phofl
Copy link
Member

phofl commented Feb 25, 2023

Performance, we are avoiding the copy on the block level there

@phofl phofl added this to the 2.1 milestone Feb 27, 2023
@lukemanley
Copy link
Member Author

@phofl - hmmm... clearly this doesn't play well with #51411

Not sure what I did yesterday to convince myself it did... sorry

@lukemanley
Copy link
Member Author

actually, what is the expected behavior for COW wrt pyarrow? Since arrow arrays are immutable, ArrowExtensionArray.copy doesn't actually copy the underlying pyarrow array. I think tests like this (which are failing) are expected to fail:

import pandas as pd
import numpy as np

arr = pd.array([1, 2, 3], dtype="int64[pyarrow]")
arr2 = arr.copy()

assert not np.shares_memory(arr, arr2)  # -> raises

@phofl
Copy link
Member

phofl commented Feb 27, 2023

CoW handles the copies globally on the block level, hence we can remove the check above.

You can adjust the tests, eas made to many defensive copies without CoW, you got rid of one of them, which is a good thing

@lukemanley
Copy link
Member Author

Ok will do. Is the comment you added still relevant? (# TODO(CoW): Not necessary anymore when CoW is the default)

I would think this fastpath can still apply after CoW is the default (e.g. for whatever reason CoW is not being used)

@phofl
Copy link
Member

phofl commented Feb 27, 2023

We are not planning on having a non CoW mode when CoW is the default

@mroeschke
Copy link
Member

Merge when ready @phofl

@mroeschke
Copy link
Member

Could you rebase @lukemanley

@mroeschke mroeschke merged commit 04ec93e into pandas-dev:main Mar 17, 2023
@mroeschke
Copy link
Member

Thanks @lukemanley

@lukemanley lukemanley deleted the arrow-fillna-no-nulls branch March 18, 2023 11:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants