Initial GPU CI setup #29130

betatim · 2024-05-29T08:23:39Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

This sets up a GitHub workflow that can be triggered by an admin (workflow_dispatch) and uses a runner that has a GPU. We will use this to test Array API related changes. Until now we had to run these tests locally by hand, as a result sometimes things slip through the cracks. It will be nice to have a standard CI that runs all the tests in a standardised way.

Any other comments?

For now this is a draft PR as I am exploring what is what in the image on the VM and how best to setup the tools and dependencies we need.

Todo:

decide on creating a new lock file or not
- new lock file made
use a good pytest command, e.g. copying what the Azure tests do
upload coverage
- no coverage reports for now. I think it will make for weird results when comparing PRs where this workflow ran and those where it didn't get executed (aka not array API related)
set up caching of miniforge
which version of pytorch and cupy should we install? ~latest of each? pick a minimum supported version and test against that?
make sure the workflow isn't executed (on a GPU machine) to evaluate conditions on the workflow or steps of the workflow
- currently this is true. Not sure what we can do to prevent this also in the future?
test how we can select the pull request ref/branch when selecting where the workflow should run
- the workflow takes the PR ID as input and then gh pr checkout PR_ID to get the code
- tricky to test until we merge it though

github-actions · 2024-05-29T08:24:59Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 39f9407. Link to the linter CI: here}

betatim · 2024-05-30T13:44:40Z

@ogrisel if you have time it would be good to hear what you think. There is probably lots more to do, but I think it would be good to hear what you think and then work on things (I think there is an endless amount of stuff that could be done, but unclear what should be done first)

lesteve

From a quick look it looks good, just curious have you tested it in your fork?

Just to make sure I understand the way it is supposed to work:

you can trigger the CI using worflow_dispatch and typing the PR number. I guess you don't want to run on a commit message (like Pyodide or nogil) to prevent anyone overusing the GPU workers, which makes sense

Other than this, I left a few minor comments.

build_tools/update_environments_and_lock_files.py

lesteve · 2024-05-30T13:42:46Z

.github/workflows/gpu-ci.yml

+      - name: Install miniforge
+        if: ${{ steps.cache-conda.outputs.cache-hit != 'true' }}
+        run: |
+          curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"


Is there a good reason not to use the setup-miniconda action: https://github.com/conda-incubator/setup-miniconda?

I'll investigate it. I didn't use it because I didn't know it existed!

One potential complication is that conda is already setup in the VM image. However it is owned by root and we run as UID 1001. And sudo chown 1001 -R .. can not change the ownership of the existing conda install. I am not sure I understand how that is possible, but yeah.

How important is using the action to you? I looked at it a bit and it doesn't really save us that much typing/config, but I find it harder to understand what it really does (several typescript files to follow compared to a few lines of shell) :-/

I am with you about finding shell easier to follow than multiple typescript files but I would say that this kind of commonly used actions just work out of the box, and you very rarely have to read the typescript files to understand what it does.

For example, it is very likely that using the action would have saved you the hassle of figuring out your chown issue.

I am leaning towards not using the action. It seems by the time that we also configure caching it is as complex or more complex to use the action. I think this is because we already have a lock file, so really we want to save time downloading things and there is nothing to be gained regarding solving. With our "home made" steps we can skip downloading the conda installer if we have a cache hit, with the action we still have to download the installer.

Ah right I did not understand that there was an issue with the GPU Github runner image which apparently is "Ubuntu NVIDIA GPU-Optimized Image for AI and HPC" looking at https://github.com/organizations/scikit-learn/settings/actions/github-hosted-runners/4?viewing_from_runner_group=false. The Github doc links to from Azure Market Place doc which points to NVIDIA doc.

Fine for doing the manual conda install and revisit later.

Looking at the build log, it looks like the error is that you can not install into the base environment (and we do this for installing conda-lock):

EnvironmentNotWritableError: The current user does not have write permissions to the target environment. environment location: /opt/miniconda uid: 1001 gid: 100

It looks like in Azure we have the "Take ownership of conda installation" maybe we should do something similar here (CONDA env var may not exist in the Github image and may need to be computed by other means)?

sudo chown -R $USER $CONDA

Thinking about it I still kind of think the setup-miniconda may just work because it will create an environment first, and the user has the necessary permissions? Needs to be double-checked of course ...

My understanding is that the action uses the available conda on a Github worker so no need to download anything and setup a cache (...)

But would it be possible to cache the full conda-lock resolution with setup-conda?

It might still be worth a try to compare setup time and config verbosity of each method. Since actions/setup-conda is officially maintained by GitHub developers, I have the feeling that it might help us run things more smoothly/efficiently in the long term, but if it makes caching the conda-lock resolution more complex, maybe it's not worth the change.

But would it be possible to cache the full conda-lock resolution with setup-conda?

I don't think so, I missed the fact that this PR was caching the full conda environment.

What I discovered though is that because we are using explicit lock-files, conda can install them directly with something like (i.e. no need to install conda-lock):

conda create --name test-env --file build_tools/github/pylatest_conda_forge_mkl_array-api_linux-64_conda.lock

so in principle you can use them directly in the setup-miniconda action, maybe? Since conda-lock downloads all packages I am not sure there would be such a difference with downloading from the cache but this needs to be double-checked.

sudo chown -R $USER $CONDA

I tried this but this results in an error. Which I think means in the GPU VM image the creators did something special, maybe with the use of additional linux ACL stuff.

.github/workflows/gpu-ci.yml

betatim · 2024-05-31T08:37:38Z

From a quick look it looks good, just curious have you tested it in your fork?

I don't think I can test it in my fork because the GPU runner is only available in the scikit-learn organisation. I could test that the "trigger run manually and checkout a particular PR" part works in a test repo in my account. As it doesn't really need the GPU. I'll do that.

Just to make sure I understand the way it is supposed to work:

* you can trigger the CI using `worflow_dispatch` and typing the PR number. I guess you don't want to run on a commit message (like Pyodide or nogil) to prevent anyone overusing the GPU workers, which makes sense

In general the current plan is that a trusted human (someone with merge permissions) manually triggers this workflow. Using workflow_dispatch seemed like the best way to do that. Combined with having to enter the PR ID. I thought about using a "Post a message on the PR" approach but decided against it because I've never built a workflow that reacts to a PR message and because each PR message generates notifications to everyone who watches the PR (and I don't love it when I get notifications just because someone is re-re-re-re-triggering a workflow :D).

Hopefully in the future we can improve on the manual trigger part. Not quite sure yet how to do that (yet)

betatim · 2024-05-31T10:10:07Z

To see this workflow kinda in action have a look at the Actions tab of https://github.com/betatim/scikit-learn-ci and the two PRs. For example https://github.com/betatim/scikit-learn-ci/actions/runs/9315866377/job/25642911299 was triggered by me manually starting the workflow and giving the PR ID of Sebastian's PR.

lesteve · 2024-05-31T15:53:50Z

I thought about using a "Post a message on the PR" approach but decided against it because I've never built a workflow that reacts to a PR message and because each PR message generates notifications to everyone who watches the PR (and I don't love it when I get notifications just because someone is re-re-re-re-triggering a workflow :D).

I am averse to generating noisy notifications as well, but I would think this is OKish to trigger CI by comments. Also you would probably have examples to draw inspiration from like conda-forge "please rerender" bot. Probably not the first thing to do look at though as a workflow_dispatch is fine for now.

An alternative option would be to trigger by setting a label. IIRC Triagers (not necessarily Core team members) can do that but I am not sure if we have Triagers or not anymore (maybe not?). I guess the label would need to be reset once the workflow is triggered. This will amount to visual noise in the PR (label was set/unset has many times as the workflow was triggered) but maybe preferable.

ogrisel

Thanks for the PR. This is very promising. Here is a pass of feedback.

.github/workflows/gpu-ci.yml

build_tools/update_environments_and_lock_files.py

ogrisel · 2024-06-03T09:01:40Z

.github/workflows/gpu-ci.yml

+      - name: Install miniforge
+        if: ${{ steps.cache-conda.outputs.cache-hit != 'true' }}
+        run: |
+          curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"


My understanding is that the action uses the available conda on a Github worker so no need to download anything and setup a cache (...)

But would it be possible to cache the full conda-lock resolution with setup-conda?

It might still be worth a try to compare setup time and config verbosity of each method. Since actions/setup-conda is officially maintained by GitHub developers, I have the feeling that it might help us run things more smoothly/efficiently in the long term, but if it makes caching the conda-lock resolution more complex, maybe it's not worth the change.

.github/workflows/update-lock-files.yml

build_tools/update_environments_and_lock_files.py

betatim · 2024-06-03T22:06:22Z

One thing we learnt from investigating is that if a random account creates a PR (#29172) then they can select the GPU runner.

thomasjpfan · 2024-06-03T23:26:26Z

I am experimenting to see what happens when one adds pull_request: into a workflow. On https://github.com/organizations/scikit-learn/settings/actions/runner-groups/4, I added this branch's workflow:

If I use my account with repo permissions, then the GPU job queues forever: Gpu ci check v2 #29174
If I use an account without permissions, then the GPU job also queues forever: Gpu ci check v3 #29175

It's a bit annoying that the UI will have the GPU jobs pending forever: https://github.com/scikit-learn/scikit-learn/actions/workflows/cuda-gpu-ci.yml

ogrisel

@thomasjpfan @betatim maybe this could explain the worflow is stuck in pending for a runner when dispatched by an authorized user:

ogrisel · 2024-06-04T09:46:01Z

.github/workflows/cuda-gpu-ci.yml

+
+jobs:
+  tests:
+    runs-on: GPU


Based on the doc, it seems we should be explicit about how we select the runner, e.g. either by label, by group name or both:

Suggested change

runs-on: GPU

runs-on:

group: GPU

labels: GPU

Note that both our runner label and our runner group are named "GPU". Maybe the group could be renamed to cuda-gpu-runner-group to be more explicit (and the label "cuda-gpu" ;)

I renamed the group and the runner.

What made you think that this explains why things were hanging for Thomas?

Looking at it again, I think the hanging issue will still remain, but I think it's a mostly a non-issue. I do not think many will try to open pull requests and try to invoke the GPU runner.

The CUDA CI run needs to be explicitly triggered as it uses extra resources.

betatim · 2024-06-04T23:04:50Z

.github/workflows/update-lock-files.yml

@@ -56,6 +62,14 @@ jobs:
            ### Note
            If the CI tasks fail, create a new branch based on this PR and add the required fixes to that branch.

+      # The CUDA workflow needs to be triggered explicitly as it uses an expensive runner
+      - name: Trigger additional tests


I think the only way to test that this is correct is to merge this PR :-/

Initial GPU CI setup

ca51b9c

github-actions bot added the Build / CI label May 29, 2024

betatim added 20 commits May 29, 2024 10:54

Give the workflow a name

1b06c5b

Execute workflow also on pull requests

2bc3987

Look around

b376491

Mamba solver is already configured

0421a38

Conda permissions

169a707

sudo make me a sandwich

4736792

sudo chown

c66620e

miniforge

8397056

miniforge

b30d217

miniforge

56d4472

miniforge

9b07c31

miniforge

a6de469

miniforge

1d9f1df

Use a dedicated env/lock file for array API

4dd748e

Use inputs

bb7bfcd

Fix cupy version

6f1832f

Use a cache for the conda environment

f27d848

Update path

d710236

Test the cache

618fb38

Enable build parallelism

468a6c3

betatim marked this pull request as ready for review May 30, 2024 12:58

lesteve reviewed May 30, 2024

View reviewed changes

Restrict permissions of the GitHub token

a6ef0a1

Remove unused build parallelism env var

b409225

betatim added 2 commits May 31, 2024 13:50

Set a meaningful tag

fe07d86

Configure automatic lockfile update

08fcecb

ogrisel reviewed Jun 3, 2024

View reviewed changes

thomasjpfan mentioned this pull request Jun 3, 2024

Gpu CI check #29171

Closed

scikit-learn-bot mentioned this pull request Jun 3, 2024

Gpu ci check #29172

Closed

betatim added 4 commits June 3, 2024 15:22

Rename workflow job

5e7b0da

Remove package constraints

43b8537

Merge branch 'main' into betatim/gpu-ci

4d5b7c7

Remove "pull_request" trigger

d19551d

ogrisel reviewed Jun 4, 2024

View reviewed changes

ogrisel added the Array API label Jun 4, 2024

betatim and others added 8 commits June 4, 2024 11:00

Add ability to use more than one channel

e68fcc5

Add comment about invalidating install cache

0848964

Fix script name typo

7224cff

Trigger CUDA tests from update lockfile update workflow

be358ab

The CUDA CI run needs to be explicitly triggered as it uses extra resources.

Rename lock files

99bc922

Explicitly state that runner is selected by group name

7b02cd2

Update

8d08b60

Move everything to script

39f9407

betatim commented Jun 4, 2024

View reviewed changes

thomasjpfan enabled auto-merge (squash) June 4, 2024 23:15

thomasjpfan approved these changes Jun 4, 2024

View reviewed changes

thomasjpfan merged commit 62e6fcf into main Jun 4, 2024
30 checks passed

ogrisel deleted the betatim/gpu-ci branch June 5, 2024 10:15

ogrisel mentioned this pull request Jun 5, 2024

Gpu ci check v2 #29174

Closed

jeremiedbb mentioned this pull request Jul 2, 2024

Release 1.5.1 #29382

Merged

11 tasks

Uh oh!

Initial GPU CI setup #29130

Initial GPU CI setup #29130

Uh oh!

Conversation

betatim commented May 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

github-actions bot commented May 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

betatim commented May 30, 2024

Uh oh!

lesteve left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

betatim May 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lesteve Jun 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lesteve Jun 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

betatim commented May 31, 2024

Uh oh!

betatim commented May 31, 2024

Uh oh!

lesteve commented May 31, 2024

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

betatim commented Jun 3, 2024

Uh oh!

thomasjpfan commented Jun 3, 2024

Uh oh!

ogrisel left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

betatim commented May 29, 2024 •

edited

Loading

github-actions bot commented May 29, 2024 •

edited

Loading

betatim May 31, 2024 •

edited

Loading

lesteve Jun 1, 2024 •

edited

Loading

lesteve Jun 1, 2024 •

edited

Loading

ogrisel left a comment •

edited

Loading