Skip to content

CI Move CUDA CI to pull_request trigger #29376

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jul 9, 2024

Conversation

betatim
Copy link
Member

@betatim betatim commented Jul 1, 2024

This modifies the CUDA CI workflow so that it runs with a pull_request trigger, if the PR has the CUDA CI label. When the workflow runs, it removes the label. Each time the label is added the workflow runs.

This improves the security of our workflows, for example by having a cache that is isolated from other branches. The current manual workflow trigger would execute the workflow in the context of the main branch instead.

There is currently no check that no new commit was pushed between the moment the label was applied and the checkout happens.

It is not clear to me if the workflow will start to run (-> cost us CI minutes) if a label is applied to a PR but the label does not match "CUDA CI". It looks like the CI run will be skipped immediately: betatim/scikit-learn-ci#6

todo:

This follows on from #29245

Copy link

github-actions bot commented Jul 1, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 6a6c376. Link to the linter CI: here

This lets the workflow run in the context of the PR's branch instead of
using `main`. This improves the security of other workflows, for example
because caches will be properly separated.
@betatim betatim force-pushed the cuda-ci-comment branch from 57ac826 to de0d770 Compare July 1, 2024 09:46
@betatim betatim marked this pull request as ready for review July 1, 2024 11:51
@betatim
Copy link
Member Author

betatim commented Jul 2, 2024

Maybe using a label in the if of a job does use CI seconds. Looking at https://github.com/scikit-learn/scikit-learn/actions/runs/9743674100 it says 2s for duration :-/

@lesteve
Copy link
Member

lesteve commented Jul 5, 2024

Maybe using a label in the if of a job does use CI seconds. Looking at scikit-learn/scikit-learn/actions/runs/9743674100 it says 2s for duration :-/

I think it is 2s but not on the CUDA runner though, so I think that's fine, right?

@betatim
Copy link
Member Author

betatim commented Jul 5, 2024

I don't know how to tell if the two seconds count as time on the GPU runner or not. I could be convinced that it is not on the GPU runner because different jobs can run on different runners, so the job selection must be "outside" of that runner. I could also be convinced that the job selection happens "inside" each jobs runner. So yeah :-/

@betatim
Copy link
Member Author

betatim commented Jul 5, 2024

One topic Loic and I discussed was people adding "evil" commits after you trigger the workflow.

I experimented in a separate repo to see what happens. I re-used a PR betatim/scikit-learn-ci#6 (comment) which already had commits in it, labelled it, waited for the workflow to start, and then added a commit that adds evil.txt. As far as I can tell from looking at https://github.com/betatim/scikit-learn-ci/actions/runs/9805593662/job/27075579866?pr=6 the "evil commit" doesn't get used. It checks out the last commit from the moment that the label was added, merges it into the latest commit on main and then runs it.

So I think this means we don't have to worry about this particular attack vector.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Both the permission settings and the label name seem appropriate.

I am fine on relying on an external action to remove the label since this not worklow is no longer sensitive as it does not run in the main branch env but instead of the isolated pull_request env.

@ogrisel
Copy link
Member

ogrisel commented Jul 9, 2024

BTW @betatim why do you think we need a timeout for this workflow?

EDIT: to avoid wasting money in case a test stalls.

I agree this is useful but can be done in a follow-up PR if needed.

betatim and others added 3 commits July 9, 2024 14:36
@lesteve
Copy link
Member

lesteve commented Jul 9, 2024

In it goes, thanks a lot!

@lesteve lesteve merged commit a6a5397 into scikit-learn:main Jul 9, 2024
30 checks passed
@betatim betatim deleted the cuda-ci-comment branch July 9, 2024 14:43
@betatim betatim added the CUDA CI label Jul 9, 2024
@ogrisel
Copy link
Member

ogrisel commented Jul 9, 2024

@betatim the workflow was triggered when the label was added, the commit CI status report includes the CUDA CI status info and the test successfully ran but the label was not removed...

https://github.com/scikit-learn/scikit-learn/actions/runs/9859209806/job/27222383706#step:2:8

Run actions-ecosystem/action-remove-labels@v1
  with:
    labels: CUDA CI
    github_token: ***
    repo: scikit-learn/scikit-learn
    fail_on_error: false
Warning: failed to remove label: CUDA CI: HttpError: Resource not accessible by integration
Error: Error: failed to remove labels: CUDA CI

@betatim
Copy link
Member Author

betatim commented Jul 9, 2024

I'll investigate it. Two things to fix/decide:

  1. should the workflow run on a PR that is already merged? Maybe we don't need that
  2. figure out why the label couldn't be removed.

Looking at the "GitHub Token permissions" in https://github.com/scikit-learn/scikit-learn/actions/runs/9859209806/job/27222383706#step:1:13 (unfold to see them) the PR permissions are read. So that explains why it failed. The mystery is why the permissions are like that. Could be because this is a PR from a fork? Or because the new permission only kicks in now that the workflow file has been merged?

@lesteve
Copy link
Member

lesteve commented Jul 9, 2024

Oh well CI, always full of surprises 😉

@ogrisel
Copy link
Member

ogrisel commented Jul 10, 2024

should the workflow run on a PR that is already merged? Maybe we don't need that

It's not a requirement, but I would not prevent it explicitly if it works the way it is. I think it could be useful from times to times when investigating the cause of a CUDA specific regression observed on main.

@ogrisel
Copy link
Member

ogrisel commented Jul 10, 2024

@betatim maybe you can configure job level permissions instead of workflow level permissions?

@betatim
Copy link
Member Author

betatim commented Jul 11, 2024

The permissions are set from outside because the job is running on a PR from a fork

@ogrisel ogrisel added CUDA CI and removed CUDA CI labels Jul 11, 2024
@ogrisel
Copy link
Member

ogrisel commented Jul 11, 2024

Given that we had a successful run on this PR in the past I tried to remove the label and re-add it to retrigger. The resulting workflow is pending:

Requested runner group: cuda-gpu-runner-group
Job defined at: scikit-learn/scikit-learn/.github/workflows/cuda-gpu-ci.yml@refs/heads/main
Waiting for a runner to pick up this job...
Job is waiting for a runner from 'cuda-gpu-runner' to come online.

while the runner is marked as ready in:

image

@lesteve
Copy link
Member

lesteve commented Jul 11, 2024

Clicking around I ended up here and I see this:
image

So I guess this seems to say runs-on: cuda-gpu-runner shouldn't it (I think this was already suggested somewhere else, maybe worth a try)?

runs-on:
group: cuda-gpu-runner-group

Although I guess you can also use groups apparently so I don't know ... https://docs.github.com/en/actions/using-jobs/choosing-the-runner-for-a-job#example-using-groups-to-control-where-jobs-are-run

@lesteve
Copy link
Member

lesteve commented Jul 11, 2024

Of course that doesn't explain why it worked once and never again but oh well 🤷

@lesteve lesteve added CUDA CI and removed CUDA CI labels Jul 12, 2024
@lesteve
Copy link
Member

lesteve commented Jul 12, 2024

Success maybe 🎉 🤔
https://github.com/scikit-learn/scikit-learn/actions/runs/9902345566/job/27356109951

I changed the restricted workflow looks like the syntax was wrong?

https://github.com/organizations/scikit-learn/settings/actions/runner-groups/4

scikit-learn/scikit-learn/.github/workflows/cuda-gpu-ci.yml@refs/heads/main # before
scikit-learn/scikit-learn/.github/workflows/cuda-gpu-ci.yml@main # after

I found the syntax in a blog post and not in the doc:
https://github.blog/changelog/2022-03-21-github-actions-restrict-self-hosted-runner-groups-to-specific-workflows/

@lesteve lesteve removed the CUDA CI label Jul 12, 2024
@lesteve
Copy link
Member

lesteve commented Jul 12, 2024

Previous message was probably a red herring, when you use @main it is replaced automatically by github by @refs/head/main.

I tried in #29466
The first one got stuck initially (but later was picked up after I did the modification below I think 🤔):
https://github.com/scikit-learn/scikit-learn/actions/runs/9902543680/job/27356609512?pr=29466
In particular I noticed this:

Job defined at: scikit-learn/scikit-learn/.github/workflows/cuda-gpu-ci.yml@refs/pull/29466/merge

Note that this is @refs/pull/29466/merge and not @refs/head/main which maybe could be the reason the workflow does not start.

I don't think there is a way to use glob in the restricted workflows ... so I allowed all workflows for now in the settings (not sure if there is some issue with this?):
image

Unlabeled #29466 relabeled and now it works 🎉 🤔?
https://github.com/scikit-learn/scikit-learn/actions/runs/9902543680/job/27356609512?pr=29466

@lesteve
Copy link
Member

lesteve commented Jul 12, 2024

Another success setting the "CUDA CI" label on #29373 so it does seem like this was the workflow access restriction preventing the job to find a runner.

I guess we need to think a bit more about "is it OK to allow all workflows?". I guess it's fine but I am not 100% sure ...

@betatim
Copy link
Member Author

betatim commented Jul 12, 2024

Removing this restriction is the suggestion I got via the support ticket I filed. Good that this solves it! Now we need to think about how we can introduce restrictions/if we need them.

Does this mean you now know as much about workflows as the official support @lesteve? Is that a good thing or worrying :D

@betatim
Copy link
Member Author

betatim commented Jul 12, 2024

Where did you see the

Job defined at: scikit-learn/scikit-learn/.github/workflows/cuda-gpu-ci.yml@refs/pull/29466/merge

I looked at some job logs and I can't find it :(

@lesteve
Copy link
Member

lesteve commented Jul 12, 2024

Where did you see the

Job defined at: scikit-learn/scikit-learn/.github/workflows/cuda-gpu-ci.yml@refs/pull/29466/merge

I looked at some job logs and I can't find it :(

It happens before the job start when it is waiting for a runner to be available. Once the job starts this is not visible anymore I think (or at least I did not find it) ...

@lesteve
Copy link
Member

lesteve commented Jul 12, 2024

Now we need to think about how we can introduce restrictions/if we need them.

The workflow restriction syntax is not documented (or I did not find it) but maybe you can ask on the support ticket if there is a way to allow workflow from any PRs, so something like this:

scikit-learn/scikit-learn/.github/workflows/cuda-gpu-ci.yml@refs/pull/*/merge

If not, I guess there is no way to use workflow restriction for our use case anyway ...

By the way, I did not even know that it was possible to open a Github support ticket. I guess this is because we are paying for the GPU runner? Probably this is the relevant doc.

@betatim
Copy link
Member Author

betatim commented Jul 12, 2024

I've asked for advice regarding restricting the workflows that can run.

https://support.github.com/ is where I submitted the ticket. Google found it for me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants