ASV: more for str.cat #22652

h-vetinari · 2018-09-09T20:42:12Z

Planning to do a clean-up of the Series.str.cat internals and wanted to expand ASV-coverage before doing so.

pep8speaks · 2018-09-09T20:42:13Z

Hello @h-vetinari! Thanks for submitting the PR.

There are no PEP8 issues in the file asv_bench/benchmarks/strings.py !

codecov · 2018-09-09T21:18:36Z

Codecov Report

Merging #22652 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #22652   +/-   ##
=======================================
  Coverage   92.17%   92.17%           
=======================================
  Files         169      169           
  Lines       50708    50708           
=======================================
  Hits        46740    46740           
  Misses       3968     3968

Flag	Coverage Δ
#multiple	`90.58% <ø> (ø)`	⬆️
#single	`42.35% <ø> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0976e12...67503c6. Read the comment docs.

h-vetinari · 2018-09-09T21:33:01Z

@TomAugspurger

ping because you managed the CirlceCI transition from 1.0 to 2.0. It seems that new setup does not respect [ci skip] - I view the CI as a limited resource (appveyor recently fluctuating up to something like 12h waiting time), so [ci skip] can be useful for avoiding a CI run where it's pointless (like this PR, which changes nothing in code/docs; and PEP8 runs anyway).

TomAugspurger · 2018-09-10T10:58:46Z

Last I saw CircleCI had an issue open for ci-skip. Not sure what the status is.

…

On Sun, Sep 9, 2018 at 4:33 PM h-vetinari ***@***.***> wrote: @TomAugspurger <https://github.com/TomAugspurger> ping because you managed the CirlceCI transition from 1.0 to 2.0. It seems that new setup does not respect [ci skip] - I view the CI as a limited resource (appveyor recently fluctuating up to something like 12h waiting time), so [ci skip] can be useful for avoiding a CI run where it's pointless (like this PR, which changes nothing in code/docs; and PEP8 runs anyway). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#22652 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHImB2sLu_oynoGv8HnqLkxQhR88kEks5uZYkRgaJpZM4Wgcb6> .

h-vetinari · 2018-09-10T16:02:28Z

@TomAugspurger

not sure where you saw it, but https://discuss.circleci.com/search?q=ci%20skip is pretty messy with lots of duplicates (due to their auto-closing policy for issues after a while with no activity). One of the closer ones seems to be https://discuss.circleci.com/t/ci-skip-does-not-work-in-pull-requests/5308

The 2.0 docs (https://circleci.com/docs/2.0/skip-build/) mention that

By default, CircleCI automatically builds a project whenever you push changes to a version control system (VCS). You can override this behavior by adding a [ci skip] or [skip ci] tag anywhere in a commit’s title or description.
[...]
Note: This feature is not supported for fork PRs.

which is weird, because it did work for 1.0 in this repo (side note: the docs for that don't exist anymore: https://circleci.com/docs/1.0/skip-a-build/).

On an unrelated note: would you care to comment about the PR itself? :)

TomAugspurger · 2018-09-10T16:09:17Z

https://discuss.circleci.com/t/ci-skip-does-not-work-in-pull-requests/5308/10 is the one I saw.

Agreed it's unfortunate. If a maintainer is around you can ask them to cancel it :/

No strong thoughts on how the PR, other than a consideration of runtime. It seems like these should run pretty quickly.

h-vetinari · 2018-09-10T22:28:50Z

@TomAugspurger

Agreed it's unfortunate. If a maintainer is around you can ask them to cancel it :/

Maintainer time is an even more precious resource than CI time, so I think I won't bother you guys. ;-)

No strong thoughts on how the PR, other than a consideration of runtime. It seems like these should run pretty quickly.

Here's the output of an ASV run against master. Of course, N (and the number of columns for others) could be tuned as desired, but I don't think it's so bad (and due to decrease a fair amount - hopefully - by an upcoming PR).

       before           after         ratio
     [0976e126]       [2e70311e]
            594ms            609ms     1.03  strings.Cat.time_cat(5, ',', '-', 0.0)
            594ms            609ms     1.03  strings.Cat.time_cat(5, ',', '-', 0.0001)
            594ms            594ms     1.00  strings.Cat.time_cat(5, ',', '-', 0.1)
            375ms            375ms     1.00  strings.Cat.time_cat(5, ',', None, 0.0)
            359ms            344ms     0.96  strings.Cat.time_cat(5, ',', None, 0.0001)
            297ms            312ms     1.05  strings.Cat.time_cat(5, ',', None, 0.1)
           failed            453ms      n/a  strings.Cat.time_cat(5, None, '-', 0.0)
            438ms            438ms     1.00  strings.Cat.time_cat(5, None, '-', 0.0001)
            453ms            453ms     1.00  strings.Cat.time_cat(5, None, '-', 0.1)
            344ms            359ms     1.05  strings.Cat.time_cat(5, None, None, 0.0)
            344ms            344ms     1.00  strings.Cat.time_cat(5, None, None, 0.0001)
           failed            297ms      n/a  strings.Cat.time_cat(5, None, None, 0.1)
         46.9±0ms       46.9±0.7ms     1.00  strings.Cat.time_cat(None, ',', '-', 0.0)
         54.7±4ms       46.9±0.7ms    ~0.86  strings.Cat.time_cat(None, ',', '-', 0.0001)
         54.7±3ms         62.5±4ms    ~1.14  strings.Cat.time_cat(None, ',', '-', 0.1)
         44.3±3ms         44.3±3ms     1.00  strings.Cat.time_cat(None, ',', None, 0.0)
         62.5±4ms         62.5±4ms     1.00  strings.Cat.time_cat(None, ',', None, 0.0001)
         70.3±1ms         78.1±1ms    ~1.11  strings.Cat.time_cat(None, ',', None, 0.1)
           failed       41.7±0.7ms      n/a  strings.Cat.time_cat(None, None, '-', 0.0)
       41.7±0.7ms         46.9±4ms    ~1.12  strings.Cat.time_cat(None, None, '-', 0.0001)
         54.7±0ms         54.7±0ms     1.00  strings.Cat.time_cat(None, None, '-', 0.1)
           failed         62.5±1ms      n/a  strings.Cat.time_cat(None, None, None, 0.0)
           failed         58.6±5ms      n/a  strings.Cat.time_cat(None, None, None, 0.0001)
           failed         54.7±0ms      n/a  strings.Cat.time_cat(None, None, None, 0.1)

WillAyd · 2018-09-11T01:01:35Z

asv_bench/benchmarks/strings.py

+    param_names = ['others', 'sep', 'na_rep', 'na_frac']
+
+    def setup(self, others, sep, na_rep, na_frac):
+        N = int(5e5)


Can you see how we do this in other benchmarks? I think we usually just pick 100_000 so for consistency might be nice to reflect that here

No problem, was trying to get the results near goal_time, but maybe I'm misinterpreting what that is supposed to be

WillAyd · 2018-09-11T01:03:35Z

asv_bench/benchmarks/strings.py

+
+    def setup(self, others, sep, na_rep, na_frac):
+        N = int(5e5)
+        mask_gen = lambda: np.random.choice([True, False], N,


Do we need the lambda here? Figured we could just use the returned array in the subsequent call

Yes, it's essential IMO that each column gets a different mask to compound the NaNs (for the na_rep=None case) and have a reasonable similarity to real-life data.

WillAyd · 2018-09-11T01:04:30Z

asv_bench/benchmarks/strings.py

+        self.s = Series(tm.makeStringIndex(N)).where(mask_gen())
+        self.others = (DataFrame({i: tm.makeStringIndex(N).where(mask_gen())
+                                  for i in range(others)})
+                       if others is not None else None)


Any reason you don't just use 0 instead of None as the parametrized argument? Would get rid of the need for a condition here

Changed to 0 for consistency/legibility, but still need the condition, as pd.DataFrame({'i' : ... for i in range(0)}) would yield a DataFrame (but str.cat needs others=None to self-concatenate).

h-vetinari

Thanks for the review.

h-vetinari · 2018-09-11T05:44:36Z

asv_bench/benchmarks/strings.py

+    param_names = ['others', 'sep', 'na_rep', 'na_frac']
+
+    def setup(self, others, sep, na_rep, na_frac):
+        N = int(5e5)


No problem, was trying to get the results near goal_time, but maybe I'm misinterpreting what that is supposed to be

h-vetinari · 2018-09-11T05:45:59Z

asv_bench/benchmarks/strings.py

+
+    def setup(self, others, sep, na_rep, na_frac):
+        N = int(5e5)
+        mask_gen = lambda: np.random.choice([True, False], N,


Yes, it's essential IMO that each column gets a different mask to compound the NaNs (for the na_rep=None case) and have a reasonable similarity to real-life data.

h-vetinari · 2018-09-11T05:47:33Z

asv_bench/benchmarks/strings.py

+        self.s = Series(tm.makeStringIndex(N)).where(mask_gen())
+        self.others = (DataFrame({i: tm.makeStringIndex(N).where(mask_gen())
+                                  for i in range(others)})
+                       if others is not None else None)


Changed to 0 for consistency/legibility, but still need the condition, as pd.DataFrame({'i' : ... for i in range(0)}) would yield a DataFrame (but str.cat needs others=None to self-concatenate).

WillAyd

Request on readability otherwise lgtm

WillAyd · 2018-09-13T09:53:00Z

asv_bench/benchmarks/strings.py

+        self.s = Series(tm.makeStringIndex(N)).where(mask_gen())
+        self.others = (DataFrame({i: tm.makeStringIndex(N).where(mask_gen())
+                                  for i in range(other_cols)})
+                       if other_cols > 0 else None)


Better but I think the condition inside the comprehension just makes this tougher to read than it needs to be. Would rather take the condition out of the comprehension and as a subsequent statement evaluate the truthiness of the expression, assigning None where False

Changed - hope I understood your request correctly.

h-vetinari

Thx, changed

h-vetinari · 2018-09-13T10:25:25Z

asv_bench/benchmarks/strings.py

+        self.s = Series(tm.makeStringIndex(N)).where(mask_gen())
+        self.others = (DataFrame({i: tm.makeStringIndex(N).where(mask_gen())
+                                  for i in range(other_cols)})
+                       if other_cols > 0 else None)


Changed - hope I understood your request correctly.

WillAyd

lgtm @TomAugspurger

h-vetinari · 2018-09-14T11:41:29Z

@WillAyd
FWIW, @TomAugspurger said above:

No strong thoughts on how the PR, other than a consideration of runtime. It seems like these should run pretty quickly.

TomAugspurger · 2018-09-14T12:44:02Z

Thanks!

h-vetinari force-pushed the asv_str_cat branch 2 times, most recently from 1c00ed1 to 497a336 Compare September 9, 2018 21:18

h-vetinari force-pushed the asv_str_cat branch 3 times, most recently from d90e164 to 2bccc65 Compare September 10, 2018 07:24

WillAyd requested changes Sep 11, 2018

View reviewed changes

WillAyd added the Benchmark Performance (ASV) benchmarks label Sep 11, 2018

h-vetinari commented Sep 11, 2018

View reviewed changes

h-vetinari force-pushed the asv_str_cat branch 2 times, most recently from c91f374 to 67503c6 Compare September 11, 2018 05:56

WillAyd requested changes Sep 13, 2018

View reviewed changes

h-vetinari commented Sep 13, 2018

View reviewed changes

h-vetinari force-pushed the asv_str_cat branch 2 times, most recently from 4d84816 to 862abe2 Compare September 13, 2018 10:34

h-vetinari added 5 commits September 13, 2018 15:38

ASV: more for str.cat [ci skip]

0c264e3

Parametrize na_fraction [ci skip]

dadeeda

Add comment to explain choice of parameters [ci skip]

4d14cfa

Review (WillAyd) [ci skip]

d92f5f1

Review (WillAyd) [ci skip]

ddc130e

h-vetinari force-pushed the asv_str_cat branch from 862abe2 to ddc130e Compare September 13, 2018 13:38

jreback added this to the 0.24.0 milestone Sep 13, 2018

WillAyd approved these changes Sep 13, 2018

View reviewed changes

TomAugspurger merged commit 6da5a72 into pandas-dev:master Sep 14, 2018

h-vetinari deleted the asv_str_cat branch September 15, 2018 16:39

aeltanawy pushed a commit to aeltanawy/pandas that referenced this pull request Sep 20, 2018

ASV: more for str.cat (pandas-dev#22652)

fab723c

Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018

ASV: more for str.cat (pandas-dev#22652)

6d43dfc

h-vetinari mentioned this pull request Jul 26, 2019

DEPR: execute deprecations for str.cat in v1.0 #27611

Merged

3 tasks

Uh oh!

ASV: more for str.cat #22652

ASV: more for str.cat #22652

Uh oh!

Conversation

h-vetinari commented Sep 9, 2018

Uh oh!

pep8speaks commented Sep 9, 2018

Uh oh!

codecov bot commented Sep 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

h-vetinari commented Sep 9, 2018

Uh oh!

TomAugspurger commented Sep 10, 2018 via email

Uh oh!

h-vetinari commented Sep 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomAugspurger commented Sep 10, 2018

Uh oh!

h-vetinari commented Sep 10, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

h-vetinari left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WillAyd left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

h-vetinari left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WillAyd left a comment

Choose a reason for hiding this comment

Uh oh!

h-vetinari commented Sep 14, 2018

Uh oh!

TomAugspurger commented Sep 14, 2018

Uh oh!

Uh oh!

codecov bot commented Sep 9, 2018 •

edited

Loading

h-vetinari commented Sep 10, 2018 •

edited

Loading