Skip to content

ASV: more for str.cat #22652

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Sep 14, 2018
Merged

ASV: more for str.cat #22652

merged 5 commits into from
Sep 14, 2018

Conversation

h-vetinari
Copy link
Contributor

Planning to do a clean-up of the Series.str.cat internals and wanted to expand ASV-coverage before doing so.

@pep8speaks
Copy link

Hello @h-vetinari! Thanks for submitting the PR.

@h-vetinari h-vetinari force-pushed the asv_str_cat branch 2 times, most recently from 1c00ed1 to 497a336 Compare September 9, 2018 21:18
@codecov
Copy link

codecov bot commented Sep 9, 2018

Codecov Report

Merging #22652 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #22652   +/-   ##
=======================================
  Coverage   92.17%   92.17%           
=======================================
  Files         169      169           
  Lines       50708    50708           
=======================================
  Hits        46740    46740           
  Misses       3968     3968
Flag Coverage Δ
#multiple 90.58% <ø> (ø) ⬆️
#single 42.35% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0976e12...67503c6. Read the comment docs.

@h-vetinari
Copy link
Contributor Author

@TomAugspurger

ping because you managed the CirlceCI transition from 1.0 to 2.0. It seems that new setup does not respect [ci skip] - I view the CI as a limited resource (appveyor recently fluctuating up to something like 12h waiting time), so [ci skip] can be useful for avoiding a CI run where it's pointless (like this PR, which changes nothing in code/docs; and PEP8 runs anyway).

@h-vetinari h-vetinari force-pushed the asv_str_cat branch 3 times, most recently from d90e164 to 2bccc65 Compare September 10, 2018 07:24
@TomAugspurger
Copy link
Contributor

TomAugspurger commented Sep 10, 2018 via email

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Sep 10, 2018

@TomAugspurger

not sure where you saw it, but https://discuss.circleci.com/search?q=ci%20skip is pretty messy with lots of duplicates (due to their auto-closing policy for issues after a while with no activity). One of the closer ones seems to be https://discuss.circleci.com/t/ci-skip-does-not-work-in-pull-requests/5308

The 2.0 docs (https://circleci.com/docs/2.0/skip-build/) mention that

By default, CircleCI automatically builds a project whenever you push changes to a version control system (VCS). You can override this behavior by adding a [ci skip] or [skip ci] tag anywhere in a commit’s title or description.
[...]
Note: This feature is not supported for fork PRs.

which is weird, because it did work for 1.0 in this repo (side note: the docs for that don't exist anymore: https://circleci.com/docs/1.0/skip-a-build/).

On an unrelated note: would you care to comment about the PR itself? :)

@TomAugspurger
Copy link
Contributor

https://discuss.circleci.com/t/ci-skip-does-not-work-in-pull-requests/5308/10 is the one I saw.

Agreed it's unfortunate. If a maintainer is around you can ask them to cancel it :/

No strong thoughts on how the PR, other than a consideration of runtime. It seems like these should run pretty quickly.

@h-vetinari
Copy link
Contributor Author

@TomAugspurger

Agreed it's unfortunate. If a maintainer is around you can ask them to cancel it :/

Maintainer time is an even more precious resource than CI time, so I think I won't bother you guys. ;-)

No strong thoughts on how the PR, other than a consideration of runtime. It seems like these should run pretty quickly.

Here's the output of an ASV run against master. Of course, N (and the number of columns for others) could be tuned as desired, but I don't think it's so bad (and due to decrease a fair amount - hopefully - by an upcoming PR).

       before           after         ratio
     [0976e126]       [2e70311e]
            594ms            609ms     1.03  strings.Cat.time_cat(5, ',', '-', 0.0)
            594ms            609ms     1.03  strings.Cat.time_cat(5, ',', '-', 0.0001)
            594ms            594ms     1.00  strings.Cat.time_cat(5, ',', '-', 0.1)
            375ms            375ms     1.00  strings.Cat.time_cat(5, ',', None, 0.0)
            359ms            344ms     0.96  strings.Cat.time_cat(5, ',', None, 0.0001)
            297ms            312ms     1.05  strings.Cat.time_cat(5, ',', None, 0.1)
           failed            453ms      n/a  strings.Cat.time_cat(5, None, '-', 0.0)
            438ms            438ms     1.00  strings.Cat.time_cat(5, None, '-', 0.0001)
            453ms            453ms     1.00  strings.Cat.time_cat(5, None, '-', 0.1)
            344ms            359ms     1.05  strings.Cat.time_cat(5, None, None, 0.0)
            344ms            344ms     1.00  strings.Cat.time_cat(5, None, None, 0.0001)
           failed            297ms      n/a  strings.Cat.time_cat(5, None, None, 0.1)
         46.9±0ms       46.9±0.7ms     1.00  strings.Cat.time_cat(None, ',', '-', 0.0)
         54.7±4ms       46.9±0.7ms    ~0.86  strings.Cat.time_cat(None, ',', '-', 0.0001)
         54.7±3ms         62.5±4ms    ~1.14  strings.Cat.time_cat(None, ',', '-', 0.1)
         44.3±3ms         44.3±3ms     1.00  strings.Cat.time_cat(None, ',', None, 0.0)
         62.5±4ms         62.5±4ms     1.00  strings.Cat.time_cat(None, ',', None, 0.0001)
         70.3±1ms         78.1±1ms    ~1.11  strings.Cat.time_cat(None, ',', None, 0.1)
           failed       41.7±0.7ms      n/a  strings.Cat.time_cat(None, None, '-', 0.0)
       41.7±0.7ms         46.9±4ms    ~1.12  strings.Cat.time_cat(None, None, '-', 0.0001)
         54.7±0ms         54.7±0ms     1.00  strings.Cat.time_cat(None, None, '-', 0.1)
           failed         62.5±1ms      n/a  strings.Cat.time_cat(None, None, None, 0.0)
           failed         58.6±5ms      n/a  strings.Cat.time_cat(None, None, None, 0.0001)
           failed         54.7±0ms      n/a  strings.Cat.time_cat(None, None, None, 0.1)

param_names = ['others', 'sep', 'na_rep', 'na_frac']

def setup(self, others, sep, na_rep, na_frac):
N = int(5e5)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you see how we do this in other benchmarks? I think we usually just pick 100_000 so for consistency might be nice to reflect that here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem, was trying to get the results near goal_time, but maybe I'm misinterpreting what that is supposed to be


def setup(self, others, sep, na_rep, na_frac):
N = int(5e5)
mask_gen = lambda: np.random.choice([True, False], N,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the lambda here? Figured we could just use the returned array in the subsequent call

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's essential IMO that each column gets a different mask to compound the NaNs (for the na_rep=None case) and have a reasonable similarity to real-life data.

self.s = Series(tm.makeStringIndex(N)).where(mask_gen())
self.others = (DataFrame({i: tm.makeStringIndex(N).where(mask_gen())
for i in range(others)})
if others is not None else None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason you don't just use 0 instead of None as the parametrized argument? Would get rid of the need for a condition here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to 0 for consistency/legibility, but still need the condition, as pd.DataFrame({'i' : ... for i in range(0)}) would yield a DataFrame (but str.cat needs others=None to self-concatenate).

@WillAyd WillAyd added the Benchmark Performance (ASV) benchmarks label Sep 11, 2018
Copy link
Contributor Author

@h-vetinari h-vetinari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review.

param_names = ['others', 'sep', 'na_rep', 'na_frac']

def setup(self, others, sep, na_rep, na_frac):
N = int(5e5)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem, was trying to get the results near goal_time, but maybe I'm misinterpreting what that is supposed to be


def setup(self, others, sep, na_rep, na_frac):
N = int(5e5)
mask_gen = lambda: np.random.choice([True, False], N,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's essential IMO that each column gets a different mask to compound the NaNs (for the na_rep=None case) and have a reasonable similarity to real-life data.

self.s = Series(tm.makeStringIndex(N)).where(mask_gen())
self.others = (DataFrame({i: tm.makeStringIndex(N).where(mask_gen())
for i in range(others)})
if others is not None else None)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to 0 for consistency/legibility, but still need the condition, as pd.DataFrame({'i' : ... for i in range(0)}) would yield a DataFrame (but str.cat needs others=None to self-concatenate).

@h-vetinari h-vetinari force-pushed the asv_str_cat branch 2 times, most recently from c91f374 to 67503c6 Compare September 11, 2018 05:56
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Request on readability otherwise lgtm

self.s = Series(tm.makeStringIndex(N)).where(mask_gen())
self.others = (DataFrame({i: tm.makeStringIndex(N).where(mask_gen())
for i in range(other_cols)})
if other_cols > 0 else None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better but I think the condition inside the comprehension just makes this tougher to read than it needs to be. Would rather take the condition out of the comprehension and as a subsequent statement evaluate the truthiness of the expression, assigning None where False

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed - hope I understood your request correctly.

Copy link
Contributor Author

@h-vetinari h-vetinari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, changed

self.s = Series(tm.makeStringIndex(N)).where(mask_gen())
self.others = (DataFrame({i: tm.makeStringIndex(N).where(mask_gen())
for i in range(other_cols)})
if other_cols > 0 else None)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed - hope I understood your request correctly.

@h-vetinari h-vetinari force-pushed the asv_str_cat branch 2 times, most recently from 4d84816 to 862abe2 Compare September 13, 2018 10:34
@jreback jreback added this to the 0.24.0 milestone Sep 13, 2018
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@h-vetinari
Copy link
Contributor Author

@WillAyd
FWIW, @TomAugspurger said above:

No strong thoughts on how the PR, other than a consideration of runtime. It seems like these should run pretty quickly.

@TomAugspurger TomAugspurger merged commit 6da5a72 into pandas-dev:master Sep 14, 2018
@TomAugspurger
Copy link
Contributor

Thanks!

@h-vetinari h-vetinari deleted the asv_str_cat branch September 15, 2018 16:39
aeltanawy pushed a commit to aeltanawy/pandas that referenced this pull request Sep 20, 2018
Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Benchmark Performance (ASV) benchmarks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants