Skip to content

Initial GPU CI setup #29130

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 37 commits into from
Jun 4, 2024
Merged

Initial GPU CI setup #29130

merged 37 commits into from
Jun 4, 2024

Conversation

betatim
Copy link
Member

@betatim betatim commented May 29, 2024

Reference Issues/PRs

closes #24491

What does this implement/fix? Explain your changes.

This sets up a GitHub workflow that can be triggered by an admin (workflow_dispatch) and uses a runner that has a GPU. We will use this to test Array API related changes. Until now we had to run these tests locally by hand, as a result sometimes things slip through the cracks. It will be nice to have a standard CI that runs all the tests in a standardised way.

Any other comments?

For now this is a draft PR as I am exploring what is what in the image on the VM and how best to setup the tools and dependencies we need.

Todo:

  • decide on creating a new lock file or not
    • new lock file made
  • use a good pytest command, e.g. copying what the Azure tests do
  • upload coverage
    • no coverage reports for now. I think it will make for weird results when comparing PRs where this workflow ran and those where it didn't get executed (aka not array API related)
  • set up caching of miniforge
  • which version of pytorch and cupy should we install? ~latest of each? pick a minimum supported version and test against that?
  • make sure the workflow isn't executed (on a GPU machine) to evaluate conditions on the workflow or steps of the workflow
    • currently this is true. Not sure what we can do to prevent this also in the future?
  • test how we can select the pull request ref/branch when selecting where the workflow should run
    • the workflow takes the PR ID as input and then gh pr checkout PR_ID to get the code
    • tricky to test until we merge it though

Copy link

github-actions bot commented May 29, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 39f9407. Link to the linter CI: here

@betatim betatim marked this pull request as ready for review May 30, 2024 12:58
@betatim
Copy link
Member Author

betatim commented May 30, 2024

@ogrisel if you have time it would be good to hear what you think. There is probably lots more to do, but I think it would be good to hear what you think and then work on things (I think there is an endless amount of stuff that could be done, but unclear what should be done first)

Copy link
Member

@lesteve lesteve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a quick look it looks good, just curious have you tested it in your fork?

Just to make sure I understand the way it is supposed to work:

  • you can trigger the CI using worflow_dispatch and typing the PR number. I guess you don't want to run on a commit message (like Pyodide or nogil) to prevent anyone overusing the GPU workers, which makes sense

Other than this, I left a few minor comments.

- name: Install miniforge
if: ${{ steps.cache-conda.outputs.cache-hit != 'true' }}
run: |
curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a good reason not to use the setup-miniconda action: https://github.com/conda-incubator/setup-miniconda?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll investigate it. I didn't use it because I didn't know it existed!

One potential complication is that conda is already setup in the VM image. However it is owned by root and we run as UID 1001. And sudo chown 1001 -R .. can not change the ownership of the existing conda install. I am not sure I understand how that is possible, but yeah.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How important is using the action to you? I looked at it a bit and it doesn't really save us that much typing/config, but I find it harder to understand what it really does (several typescript files to follow compared to a few lines of shell) :-/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am with you about finding shell easier to follow than multiple typescript files but I would say that this kind of commonly used actions just work out of the box, and you very rarely have to read the typescript files to understand what it does.

For example, it is very likely that using the action would have saved you the hassle of figuring out your chown issue.

Copy link
Member Author

@betatim betatim May 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am leaning towards not using the action. It seems by the time that we also configure caching it is as complex or more complex to use the action. I think this is because we already have a lock file, so really we want to save time downloading things and there is nothing to be gained regarding solving. With our "home made" steps we can skip downloading the conda installer if we have a cache hit, with the action we still have to download the installer.

Copy link
Member

@lesteve lesteve Jun 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right I did not understand that there was an issue with the GPU Github runner image which apparently is "Ubuntu NVIDIA GPU-Optimized Image for AI and HPC" looking at https://github.com/organizations/scikit-learn/settings/actions/github-hosted-runners/4?viewing_from_runner_group=false. The Github doc links to from Azure Market Place doc which points to NVIDIA doc.

Fine for doing the manual conda install and revisit later.

Looking at the build log, it looks like the error is that you can not install into the base environment (and we do this for installing conda-lock):

EnvironmentNotWritableError: The current user does not have write permissions to the target environment.
  environment location: /opt/miniconda
  uid: 1001
  gid: 100

It looks like in Azure we have the "Take ownership of conda installation" maybe we should do something similar here (CONDA env var may not exist in the Github image and may need to be computed by other means)?

sudo chown -R $USER $CONDA

Copy link
Member

@lesteve lesteve Jun 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it I still kind of think the setup-miniconda may just work because it will create an environment first, and the user has the necessary permissions? Needs to be double-checked of course ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that the action uses the available conda on a Github worker so no need to download anything and setup a cache (...)

But would it be possible to cache the full conda-lock resolution with setup-conda?

image

It might still be worth a try to compare setup time and config verbosity of each method. Since actions/setup-conda is officially maintained by GitHub developers, I have the feeling that it might help us run things more smoothly/efficiently in the long term, but if it makes caching the conda-lock resolution more complex, maybe it's not worth the change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But would it be possible to cache the full conda-lock resolution with setup-conda?

I don't think so, I missed the fact that this PR was caching the full conda environment.

What I discovered though is that because we are using explicit lock-files, conda can install them directly with something like (i.e. no need to install conda-lock):

conda create --name test-env --file build_tools/github/pylatest_conda_forge_mkl_array-api_linux-64_conda.lock

so in principle you can use them directly in the setup-miniconda action, maybe? Since conda-lock downloads all packages I am not sure there would be such a difference with downloading from the cache but this needs to be double-checked.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sudo chown -R $USER $CONDA

I tried this but this results in an error. Which I think means in the GPU VM image the creators did something special, maybe with the use of additional linux ACL stuff.

@betatim
Copy link
Member Author

betatim commented May 31, 2024

From a quick look it looks good, just curious have you tested it in your fork?

I don't think I can test it in my fork because the GPU runner is only available in the scikit-learn organisation. I could test that the "trigger run manually and checkout a particular PR" part works in a test repo in my account. As it doesn't really need the GPU. I'll do that.

Just to make sure I understand the way it is supposed to work:

* you can trigger the CI using `worflow_dispatch` and typing the PR number. I guess you don't want to run on a commit message (like Pyodide or nogil) to prevent anyone overusing the GPU workers, which makes sense

In general the current plan is that a trusted human (someone with merge permissions) manually triggers this workflow. Using workflow_dispatch seemed like the best way to do that. Combined with having to enter the PR ID. I thought about using a "Post a message on the PR" approach but decided against it because I've never built a workflow that reacts to a PR message and because each PR message generates notifications to everyone who watches the PR (and I don't love it when I get notifications just because someone is re-re-re-re-triggering a workflow :D).

Hopefully in the future we can improve on the manual trigger part. Not quite sure yet how to do that (yet)

@betatim
Copy link
Member Author

betatim commented May 31, 2024

To see this workflow kinda in action have a look at the Actions tab of https://github.com/betatim/scikit-learn-ci and the two PRs. For example https://github.com/betatim/scikit-learn-ci/actions/runs/9315866377/job/25642911299 was triggered by me manually starting the workflow and giving the PR ID of Sebastian's PR.

@lesteve
Copy link
Member

lesteve commented May 31, 2024

I thought about using a "Post a message on the PR" approach but decided against it because I've never built a workflow that reacts to a PR message and because each PR message generates notifications to everyone who watches the PR (and I don't love it when I get notifications just because someone is re-re-re-re-triggering a workflow :D).

I am averse to generating noisy notifications as well, but I would think this is OKish to trigger CI by comments. Also you would probably have examples to draw inspiration from like conda-forge "please rerender" bot. Probably not the first thing to do look at though as a workflow_dispatch is fine for now.

An alternative option would be to trigger by setting a label. IIRC Triagers (not necessarily Core team members) can do that but I am not sure if we have Triagers or not anymore (maybe not?). I guess the label would need to be reset once the workflow is triggered. This will amount to visual noise in the PR (label was set/unset has many times as the workflow was triggered) but maybe preferable.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. This is very promising. Here is a pass of feedback.

- name: Install miniforge
if: ${{ steps.cache-conda.outputs.cache-hit != 'true' }}
run: |
curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that the action uses the available conda on a Github worker so no need to download anything and setup a cache (...)

But would it be possible to cache the full conda-lock resolution with setup-conda?

image

It might still be worth a try to compare setup time and config verbosity of each method. Since actions/setup-conda is officially maintained by GitHub developers, I have the feeling that it might help us run things more smoothly/efficiently in the long term, but if it makes caching the conda-lock resolution more complex, maybe it's not worth the change.

@thomasjpfan thomasjpfan mentioned this pull request Jun 3, 2024
@scikit-learn-bot scikit-learn-bot mentioned this pull request Jun 3, 2024
@betatim
Copy link
Member Author

betatim commented Jun 3, 2024

One thing we learnt from investigating is that if a random account creates a PR (#29172) then they can select the GPU runner.

@thomasjpfan
Copy link
Member

I am experimenting to see what happens when one adds pull_request: into a workflow. On https://github.com/organizations/scikit-learn/settings/actions/runner-groups/4, I added this branch's workflow:

Screenshot 2024-06-03 at 4 21 51 PM

It's a bit annoying that the UI will have the GPU jobs pending forever: https://github.com/scikit-learn/scikit-learn/actions/workflows/cuda-gpu-ci.yml

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thomasjpfan @betatim maybe this could explain the worflow is stuck in pending for a runner when dispatched by an authorized user:


jobs:
tests:
runs-on: GPU
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the doc, it seems we should be explicit about how we select the runner, e.g. either by label, by group name or both:

Suggested change
runs-on: GPU
runs-on:
group: GPU
labels: GPU

Note that both our runner label and our runner group are named "GPU". Maybe the group could be renamed to cuda-gpu-runner-group to be more explicit (and the label "cuda-gpu" ;)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed the group and the runner.

What made you think that this explains why things were hanging for Thomas?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at it again, I think the hanging issue will still remain, but I think it's a mostly a non-issue. I do not think many will try to open pull requests and try to invoke the GPU runner.

@@ -56,6 +62,14 @@ jobs:
### Note
If the CI tasks fail, create a new branch based on this PR and add the required fixes to that branch.

# The CUDA workflow needs to be triggered explicitly as it uses an expensive runner
- name: Trigger additional tests
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the only way to test that this is correct is to merge this PR :-/

@thomasjpfan thomasjpfan enabled auto-merge (squash) June 4, 2024 23:15
@thomasjpfan thomasjpfan merged commit 62e6fcf into main Jun 4, 2024
30 checks passed
@ogrisel ogrisel deleted the betatim/gpu-ci branch June 5, 2024 10:15
@ogrisel ogrisel mentioned this pull request Jun 5, 2024
@jeremiedbb jeremiedbb mentioned this pull request Jul 2, 2024
11 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Weekly CI run with NVidia GPU hardware
4 participants