CI Move CUDA CI to `pull_request` trigger #29376

betatim · 2024-07-01T09:35:36Z

This modifies the CUDA CI workflow so that it runs with a pull_request trigger, if the PR has the CUDA CI label. When the workflow runs, it removes the label. Each time the label is added the workflow runs.

This improves the security of our workflows, for example by having a cache that is isolated from other branches. The current manual workflow trigger would execute the workflow in the context of the main branch instead.

There is currently no check that no new commit was pushed between the moment the label was applied and the checkout happens.

It is not clear to me if the workflow will start to run (-> cost us CI minutes) if a label is applied to a PR but the label does not match "CUDA CI". It looks like the CI run will be skipped immediately: betatim/scikit-learn-ci#6

todo:

decide label name, currently "CUDA CI".
add a timeout for the job https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#jobsjob_idtimeout-minutes
test whether or not the workflow triggers when labelling with "Foo Bar". Use https://github.com/betatim/scikit-learn-ci/ for test
discuss workflow permissions, is this the minimal set?
do we want to rely on an external workflow to remove the label or make a short python script that calls the REST API? (can't use gh as that isn't installed in the CUDA VM image :()

This follows on from #29245

github-actions · 2024-07-01T09:36:53Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 6a6c376. Link to the linter CI: here}

This lets the workflow run in the context of the PR's branch instead of using `main`. This improves the security of other workflows, for example because caches will be properly separated.

betatim · 2024-07-02T07:46:45Z

Maybe using a label in the if of a job does use CI seconds. Looking at https://github.com/scikit-learn/scikit-learn/actions/runs/9743674100 it says 2s for duration :-/

lesteve · 2024-07-05T07:33:06Z

Maybe using a label in the if of a job does use CI seconds. Looking at scikit-learn/scikit-learn/actions/runs/9743674100 it says 2s for duration :-/

I think it is 2s but not on the CUDA runner though, so I think that's fine, right?

betatim · 2024-07-05T08:35:16Z

I don't know how to tell if the two seconds count as time on the GPU runner or not. I could be convinced that it is not on the GPU runner because different jobs can run on different runners, so the job selection must be "outside" of that runner. I could also be convinced that the job selection happens "inside" each jobs runner. So yeah :-/

betatim · 2024-07-05T09:19:02Z

One topic Loic and I discussed was people adding "evil" commits after you trigger the workflow.

I experimented in a separate repo to see what happens. I re-used a PR betatim/scikit-learn-ci#6 (comment) which already had commits in it, labelled it, waited for the workflow to start, and then added a commit that adds evil.txt. As far as I can tell from looking at https://github.com/betatim/scikit-learn-ci/actions/runs/9805593662/job/27075579866?pr=6 the "evil commit" doesn't get used. It checks out the last commit from the moment that the label was added, merges it into the latest commit on main and then runs it.

So I think this means we don't have to worry about this particular attack vector.

ogrisel

LGTM. Both the permission settings and the label name seem appropriate.

I am fine on relying on an external action to remove the label since this not worklow is no longer sensitive as it does not run in the main branch env but instead of the isolated pull_request env.

.github/workflows/cuda-gpu-ci.yml

ogrisel · 2024-07-09T12:03:48Z

BTW @betatim why do you think we need a timeout for this workflow?

EDIT: to avoid wasting money in case a test stalls.

I agree this is useful but can be done in a follow-up PR if needed.

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

lesteve · 2024-07-09T14:37:14Z

In it goes, thanks a lot!

ogrisel · 2024-07-09T15:15:53Z

@betatim the workflow was triggered when the label was added, the commit CI status report includes the CUDA CI status info and the test successfully ran but the label was not removed...

https://github.com/scikit-learn/scikit-learn/actions/runs/9859209806/job/27222383706#step:2:8

Run actions-ecosystem/action-remove-labels@v1
  with:
    labels: CUDA CI
    github_token: ***
    repo: scikit-learn/scikit-learn
    fail_on_error: false
Warning: failed to remove label: CUDA CI: HttpError: Resource not accessible by integration
Error: Error: failed to remove labels: CUDA CI

betatim · 2024-07-09T15:30:14Z

I'll investigate it. Two things to fix/decide:

should the workflow run on a PR that is already merged? Maybe we don't need that
figure out why the label couldn't be removed.

Looking at the "GitHub Token permissions" in https://github.com/scikit-learn/scikit-learn/actions/runs/9859209806/job/27222383706#step:1:13 (unfold to see them) the PR permissions are read. So that explains why it failed. The mystery is why the permissions are like that. Could be because this is a PR from a fork? Or because the new permission only kicks in now that the workflow file has been merged?

lesteve · 2024-07-09T15:42:18Z

Oh well CI, always full of surprises 😉

ogrisel · 2024-07-10T07:48:01Z

should the workflow run on a PR that is already merged? Maybe we don't need that

It's not a requirement, but I would not prevent it explicitly if it works the way it is. I think it could be useful from times to times when investigating the cause of a CUDA specific regression observed on main.

ogrisel · 2024-07-10T07:49:17Z

@betatim maybe you can configure job level permissions instead of workflow level permissions?

betatim · 2024-07-11T14:28:35Z

The permissions are set from outside because the job is running on a PR from a fork

ogrisel · 2024-07-11T15:31:06Z

Given that we had a successful run on this PR in the past I tried to remove the label and re-add it to retrigger. The resulting workflow is pending:

Requested runner group: cuda-gpu-runner-group
Job defined at: scikit-learn/scikit-learn/.github/workflows/cuda-gpu-ci.yml@refs/heads/main
Waiting for a runner to pick up this job...
Job is waiting for a runner from 'cuda-gpu-runner' to come online.

https://github.com/scikit-learn/scikit-learn/actions/runs/9894233015/job/27331249054

while the runner is marked as ready in:

https://github.com/organizations/scikit-learn/settings/actions/runners

lesteve · 2024-07-11T17:42:36Z

Clicking around I ended up here and I see this:

So I guess this seems to say runs-on: cuda-gpu-runner shouldn't it (I think this was already suggested somewhere else, maybe worth a try)?

scikit-learn/.github/workflows/cuda-gpu-ci.yml

Lines 17 to 18 in e7af195

    
           runs-on: 
        
             group: cuda-gpu-runner-group

Although I guess you can also use groups apparently so I don't know ... https://docs.github.com/en/actions/using-jobs/choosing-the-runner-for-a-job#example-using-groups-to-control-where-jobs-are-run

lesteve · 2024-07-11T17:48:54Z

Of course that doesn't explain why it worked once and never again but oh well 🤷

lesteve · 2024-07-12T04:26:44Z

Success maybe 🎉 🤔
https://github.com/scikit-learn/scikit-learn/actions/runs/9902345566/job/27356109951

I changed the restricted workflow looks like the syntax was wrong?

https://github.com/organizations/scikit-learn/settings/actions/runner-groups/4

scikit-learn/scikit-learn/.github/workflows/cuda-gpu-ci.yml@refs/heads/main # before
scikit-learn/scikit-learn/.github/workflows/cuda-gpu-ci.yml@main # after

I found the syntax in a blog post and not in the doc:
https://github.blog/changelog/2022-03-21-github-actions-restrict-self-hosted-runner-groups-to-specific-workflows/

lesteve · 2024-07-12T05:00:44Z

Previous message was probably a red herring, when you use @main it is replaced automatically by github by @refs/head/main.

I tried in #29466
The first one got stuck initially (but later was picked up after I did the modification below I think 🤔):
https://github.com/scikit-learn/scikit-learn/actions/runs/9902543680/job/27356609512?pr=29466
In particular I noticed this:

Job defined at: scikit-learn/scikit-learn/.github/workflows/cuda-gpu-ci.yml@refs/pull/29466/merge

Note that this is @refs/pull/29466/merge and not @refs/head/main which maybe could be the reason the workflow does not start.

I don't think there is a way to use glob in the restricted workflows ... so I allowed all workflows for now in the settings (not sure if there is some issue with this?):

Unlabeled #29466 relabeled and now it works 🎉 🤔?
https://github.com/scikit-learn/scikit-learn/actions/runs/9902543680/job/27356609512?pr=29466

lesteve · 2024-07-12T05:14:14Z

Another success setting the "CUDA CI" label on #29373 so it does seem like this was the workflow access restriction preventing the job to find a runner.

I guess we need to think a bit more about "is it OK to allow all workflows?". I guess it's fine but I am not 100% sure ...

betatim · 2024-07-12T07:27:11Z

Removing this restriction is the suggestion I got via the support ticket I filed. Good that this solves it! Now we need to think about how we can introduce restrictions/if we need them.

Does this mean you now know as much about workflows as the official support @lesteve? Is that a good thing or worrying :D

betatim · 2024-07-12T07:29:31Z

Where did you see the

Job defined at: scikit-learn/scikit-learn/.github/workflows/cuda-gpu-ci.yml@refs/pull/29466/merge

I looked at some job logs and I can't find it :(

lesteve · 2024-07-12T07:42:25Z

Where did you see the

Job defined at: scikit-learn/scikit-learn/.github/workflows/cuda-gpu-ci.yml@refs/pull/29466/merge

I looked at some job logs and I can't find it :(

It happens before the job start when it is waiting for a runner to be available. Once the job starts this is not visible anymore I think (or at least I did not find it) ...

lesteve · 2024-07-12T08:17:10Z

Now we need to think about how we can introduce restrictions/if we need them.

The workflow restriction syntax is not documented (or I did not find it) but maybe you can ask on the support ticket if there is a way to allow workflow from any PRs, so something like this:

scikit-learn/scikit-learn/.github/workflows/cuda-gpu-ci.yml@refs/pull/*/merge

If not, I guess there is no way to use workflow restriction for our use case anyway ...

By the way, I did not even know that it was possible to open a Github support ticket. I guess this is because we are paying for the GPU runner? Probably this is the relevant doc.

betatim · 2024-07-12T08:33:02Z

I've asked for advice regarding restricting the workflows that can run.

https://support.github.com/ is where I submitted the ticket. Google found it for me

github-actions bot added the Build / CI label Jul 1, 2024

Switch to using a pull_request trigger

de0d770

This lets the workflow run in the context of the PR's branch instead of using `main`. This improves the security of other workflows, for example because caches will be properly separated.

betatim force-pushed the cuda-ci-comment branch from 57ac826 to de0d770 Compare July 1, 2024 09:46

betatim marked this pull request as ready for review July 1, 2024 11:51

betatim added the Array API label Jul 1, 2024

ogrisel approved these changes Jul 9, 2024

View reviewed changes

.github/workflows/cuda-gpu-ci.yml Outdated Show resolved Hide resolved

.github/workflows/cuda-gpu-ci.yml Outdated Show resolved Hide resolved

betatim and others added 3 commits July 9, 2024 14:36

Specify major version only

270401b

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

Remove unneeded id

65f6001

Add a timeout to the CUDA CI job

6a6c376

lesteve merged commit a6a5397 into scikit-learn:main Jul 9, 2024
30 checks passed

betatim deleted the cuda-ci-comment branch July 9, 2024 14:43

betatim added the CUDA CI label Jul 9, 2024

This was referenced Jul 10, 2024

CI Don't run CUDA workflow on closed/merged PRs #29448

Closed

CI Move label removal to a separate workflow #29456

Merged

ogrisel added CUDA CI and removed CUDA CI labels Jul 11, 2024

lesteve added CUDA CI and removed CUDA CI labels Jul 12, 2024

lesteve removed the CUDA CI label Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI Move CUDA CI to `pull_request` trigger #29376

CI Move CUDA CI to `pull_request` trigger #29376

betatim commented Jul 1, 2024 •

edited by lesteve

Loading

github-actions bot commented Jul 1, 2024 •

edited

Loading

betatim commented Jul 2, 2024

lesteve commented Jul 5, 2024

betatim commented Jul 5, 2024

betatim commented Jul 5, 2024

ogrisel left a comment

ogrisel commented Jul 9, 2024 •

edited

Loading

lesteve commented Jul 9, 2024

ogrisel commented Jul 9, 2024 •

edited

Loading

betatim commented Jul 9, 2024

lesteve commented Jul 9, 2024

ogrisel commented Jul 10, 2024

ogrisel commented Jul 10, 2024

betatim commented Jul 11, 2024

ogrisel commented Jul 11, 2024

lesteve commented Jul 11, 2024 •

edited

Loading

lesteve commented Jul 11, 2024

lesteve commented Jul 12, 2024 •

edited

Loading

lesteve commented Jul 12, 2024 •

edited

Loading

lesteve commented Jul 12, 2024 •

edited

Loading

betatim commented Jul 12, 2024

betatim commented Jul 12, 2024

lesteve commented Jul 12, 2024

lesteve commented Jul 12, 2024 •

edited

Loading

betatim commented Jul 12, 2024

CI Move CUDA CI to pull_request trigger #29376

CI Move CUDA CI to pull_request trigger #29376

Conversation

betatim commented Jul 1, 2024 • edited by lesteve Loading

github-actions bot commented Jul 1, 2024 • edited Loading

✔️ Linting Passed

betatim commented Jul 2, 2024

lesteve commented Jul 5, 2024

betatim commented Jul 5, 2024

betatim commented Jul 5, 2024

ogrisel left a comment

Choose a reason for hiding this comment

ogrisel commented Jul 9, 2024 • edited Loading

lesteve commented Jul 9, 2024

ogrisel commented Jul 9, 2024 • edited Loading

betatim commented Jul 9, 2024

lesteve commented Jul 9, 2024

ogrisel commented Jul 10, 2024

ogrisel commented Jul 10, 2024

betatim commented Jul 11, 2024

ogrisel commented Jul 11, 2024

lesteve commented Jul 11, 2024 • edited Loading

lesteve commented Jul 11, 2024

lesteve commented Jul 12, 2024 • edited Loading

lesteve commented Jul 12, 2024 • edited Loading

lesteve commented Jul 12, 2024 • edited Loading

betatim commented Jul 12, 2024

betatim commented Jul 12, 2024

lesteve commented Jul 12, 2024

lesteve commented Jul 12, 2024 • edited Loading

betatim commented Jul 12, 2024

CI Move CUDA CI to `pull_request` trigger #29376

CI Move CUDA CI to `pull_request` trigger #29376

betatim commented Jul 1, 2024 •

edited by lesteve

Loading

github-actions bot commented Jul 1, 2024 •

edited

Loading

ogrisel commented Jul 9, 2024 •

edited

Loading

ogrisel commented Jul 9, 2024 •

edited

Loading

lesteve commented Jul 11, 2024 •

edited

Loading

lesteve commented Jul 12, 2024 •

edited

Loading

lesteve commented Jul 12, 2024 •

edited

Loading

lesteve commented Jul 12, 2024 •

edited

Loading

lesteve commented Jul 12, 2024 •

edited

Loading