-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
CI Move CUDA CI to pull_request
trigger
#29376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This lets the workflow run in the context of the PR's branch instead of using `main`. This improves the security of other workflows, for example because caches will be properly separated.
Maybe using a label in the |
I think it is 2s but not on the CUDA runner though, so I think that's fine, right? |
I don't know how to tell if the two seconds count as time on the GPU runner or not. I could be convinced that it is not on the GPU runner because different jobs can run on different runners, so the job selection must be "outside" of that runner. I could also be convinced that the job selection happens "inside" each jobs runner. So yeah :-/ |
One topic Loic and I discussed was people adding "evil" commits after you trigger the workflow. I experimented in a separate repo to see what happens. I re-used a PR betatim/scikit-learn-ci#6 (comment) which already had commits in it, labelled it, waited for the workflow to start, and then added a commit that adds So I think this means we don't have to worry about this particular attack vector. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Both the permission settings and the label name seem appropriate.
I am fine on relying on an external action to remove the label since this not worklow is no longer sensitive as it does not run in the main
branch env but instead of the isolated pull_request
env.
BTW @betatim why do you think we need a timeout for this workflow? EDIT: to avoid wasting money in case a test stalls. I agree this is useful but can be done in a follow-up PR if needed. |
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
In it goes, thanks a lot! |
@betatim the workflow was triggered when the label was added, the commit CI status report includes the CUDA CI status info and the test successfully ran but the label was not removed... https://github.com/scikit-learn/scikit-learn/actions/runs/9859209806/job/27222383706#step:2:8
|
I'll investigate it. Two things to fix/decide:
Looking at the "GitHub Token permissions" in https://github.com/scikit-learn/scikit-learn/actions/runs/9859209806/job/27222383706#step:1:13 (unfold to see them) the PR permissions are |
Oh well CI, always full of surprises 😉 |
It's not a requirement, but I would not prevent it explicitly if it works the way it is. I think it could be useful from times to times when investigating the cause of a CUDA specific regression observed on |
@betatim maybe you can configure job level permissions instead of workflow level permissions? |
The permissions are set from outside because the job is running on a PR from a fork |
Given that we had a successful run on this PR in the past I tried to remove the label and re-add it to retrigger. The resulting workflow is pending:
while the runner is marked as ready in: |
Clicking around I ended up here and I see this: So I guess this seems to say scikit-learn/.github/workflows/cuda-gpu-ci.yml Lines 17 to 18 in e7af195
Although I guess you can also use groups apparently so I don't know ... https://docs.github.com/en/actions/using-jobs/choosing-the-runner-for-a-job#example-using-groups-to-control-where-jobs-are-run |
Of course that doesn't explain why it worked once and never again but oh well 🤷 |
Success maybe 🎉 🤔 I changed the restricted workflow looks like the syntax was wrong? https://github.com/organizations/scikit-learn/settings/actions/runner-groups/4 scikit-learn/scikit-learn/.github/workflows/cuda-gpu-ci.yml@refs/heads/main # before
scikit-learn/scikit-learn/.github/workflows/cuda-gpu-ci.yml@main # after I found the syntax in a blog post and not in the doc: |
Previous message was probably a red herring, when you use I tried in #29466
Note that this is I don't think there is a way to use glob in the restricted workflows ... so I allowed all workflows for now in the settings (not sure if there is some issue with this?): Unlabeled #29466 relabeled and now it works 🎉 🤔? |
Another success setting the "CUDA CI" label on #29373 so it does seem like this was the workflow access restriction preventing the job to find a runner. I guess we need to think a bit more about "is it OK to allow all workflows?". I guess it's fine but I am not 100% sure ... |
Removing this restriction is the suggestion I got via the support ticket I filed. Good that this solves it! Now we need to think about how we can introduce restrictions/if we need them. Does this mean you now know as much about workflows as the official support @lesteve? Is that a good thing or worrying :D |
Where did you see the
I looked at some job logs and I can't find it :( |
It happens before the job start when it is waiting for a runner to be available. Once the job starts this is not visible anymore I think (or at least I did not find it) ... |
The workflow restriction syntax is not documented (or I did not find it) but maybe you can ask on the support ticket if there is a way to allow workflow from any PRs, so something like this:
If not, I guess there is no way to use workflow restriction for our use case anyway ... By the way, I did not even know that it was possible to open a Github support ticket. I guess this is because we are paying for the GPU runner? Probably this is the relevant doc. |
I've asked for advice regarding restricting the workflows that can run. https://support.github.com/ is where I submitted the ticket. Google found it for me |
This modifies the CUDA CI workflow so that it runs with a
pull_request
trigger, if the PR has theCUDA CI
label. When the workflow runs, it removes the label. Each time the label is added the workflow runs.This improves the security of our workflows, for example by having a cache that is isolated from other branches. The current manual workflow trigger would execute the workflow in the context of the
main
branch instead.There is currently no check that no new commit was pushed between the moment the label was applied and the checkout happens.
It is not clear to me if the workflow will start to run (-> cost us CI minutes) if a label is applied to a PR but the label does not match "CUDA CI". It looks like the CI run will be skipped immediately: betatim/scikit-learn-ci#6
todo:
gh
as that isn't installed in the CUDA VM image :()This follows on from #29245