-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Initial GPU CI setup #29130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial GPU CI setup #29130
Conversation
@ogrisel if you have time it would be good to hear what you think. There is probably lots more to do, but I think it would be good to hear what you think and then work on things (I think there is an endless amount of stuff that could be done, but unclear what should be done first) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a quick look it looks good, just curious have you tested it in your fork?
Just to make sure I understand the way it is supposed to work:
- you can trigger the CI using
worflow_dispatch
and typing the PR number. I guess you don't want to run on a commit message (like Pyodide or nogil) to prevent anyone overusing the GPU workers, which makes sense
Other than this, I left a few minor comments.
.github/workflows/gpu-ci.yml
Outdated
- name: Install miniforge | ||
if: ${{ steps.cache-conda.outputs.cache-hit != 'true' }} | ||
run: | | ||
curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a good reason not to use the setup-miniconda
action: https://github.com/conda-incubator/setup-miniconda?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll investigate it. I didn't use it because I didn't know it existed!
One potential complication is that conda is already setup in the VM image. However it is owned by root
and we run as UID 1001. And sudo chown 1001 -R ..
can not change the ownership of the existing conda install. I am not sure I understand how that is possible, but yeah.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How important is using the action to you? I looked at it a bit and it doesn't really save us that much typing/config, but I find it harder to understand what it really does (several typescript files to follow compared to a few lines of shell) :-/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am with you about finding shell easier to follow than multiple typescript files but I would say that this kind of commonly used actions just work out of the box, and you very rarely have to read the typescript files to understand what it does.
For example, it is very likely that using the action would have saved you the hassle of figuring out your chown
issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am leaning towards not using the action. It seems by the time that we also configure caching it is as complex or more complex to use the action. I think this is because we already have a lock file, so really we want to save time downloading things and there is nothing to be gained regarding solving. With our "home made" steps we can skip downloading the conda installer if we have a cache hit, with the action we still have to download the installer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right I did not understand that there was an issue with the GPU Github runner image which apparently is "Ubuntu NVIDIA GPU-Optimized Image for AI and HPC" looking at https://github.com/organizations/scikit-learn/settings/actions/github-hosted-runners/4?viewing_from_runner_group=false. The Github doc links to from Azure Market Place doc which points to NVIDIA doc.
Fine for doing the manual conda install and revisit later.
Looking at the build log, it looks like the error is that you can not install into the base
environment (and we do this for installing conda-lock
):
EnvironmentNotWritableError: The current user does not have write permissions to the target environment.
environment location: /opt/miniconda
uid: 1001
gid: 100
It looks like in Azure we have the "Take ownership of conda installation" maybe we should do something similar here (CONDA
env var may not exist in the Github image and may need to be computed by other means)?
sudo chown -R $USER $CONDA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about it I still kind of think the setup-miniconda may just work because it will create an environment first, and the user has the necessary permissions? Needs to be double-checked of course ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that the action uses the available conda on a Github worker so no need to download anything and setup a cache (...)
But would it be possible to cache the full conda-lock
resolution with setup-conda
?
It might still be worth a try to compare setup time and config verbosity of each method. Since actions/setup-conda
is officially maintained by GitHub developers, I have the feeling that it might help us run things more smoothly/efficiently in the long term, but if it makes caching the conda-lock resolution more complex, maybe it's not worth the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But would it be possible to cache the full conda-lock resolution with setup-conda?
I don't think so, I missed the fact that this PR was caching the full conda environment.
What I discovered though is that because we are using explicit lock-files, conda can install them directly with something like (i.e. no need to install conda-lock):
conda create --name test-env --file build_tools/github/pylatest_conda_forge_mkl_array-api_linux-64_conda.lock
so in principle you can use them directly in the setup-miniconda
action, maybe? Since conda-lock downloads all packages I am not sure there would be such a difference with downloading from the cache but this needs to be double-checked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sudo chown -R $USER $CONDA
I tried this but this results in an error. Which I think means in the GPU VM image the creators did something special, maybe with the use of additional linux ACL stuff.
I don't think I can test it in my fork because the GPU runner is only available in the scikit-learn organisation. I could test that the "trigger run manually and checkout a particular PR" part works in a test repo in my account. As it doesn't really need the GPU. I'll do that.
In general the current plan is that a trusted human (someone with merge permissions) manually triggers this workflow. Using Hopefully in the future we can improve on the manual trigger part. Not quite sure yet how to do that (yet) |
To see this workflow kinda in action have a look at the Actions tab of https://github.com/betatim/scikit-learn-ci and the two PRs. For example https://github.com/betatim/scikit-learn-ci/actions/runs/9315866377/job/25642911299 was triggered by me manually starting the workflow and giving the PR ID of Sebastian's PR. |
I am averse to generating noisy notifications as well, but I would think this is OKish to trigger CI by comments. Also you would probably have examples to draw inspiration from like conda-forge "please rerender" bot. Probably not the first thing to do look at though as a An alternative option would be to trigger by setting a label. IIRC Triagers (not necessarily Core team members) can do that but I am not sure if we have Triagers or not anymore (maybe not?). I guess the label would need to be reset once the workflow is triggered. This will amount to visual noise in the PR (label was set/unset has many times as the workflow was triggered) but maybe preferable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. This is very promising. Here is a pass of feedback.
.github/workflows/gpu-ci.yml
Outdated
- name: Install miniforge | ||
if: ${{ steps.cache-conda.outputs.cache-hit != 'true' }} | ||
run: | | ||
curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that the action uses the available conda on a Github worker so no need to download anything and setup a cache (...)
But would it be possible to cache the full conda-lock
resolution with setup-conda
?
It might still be worth a try to compare setup time and config verbosity of each method. Since actions/setup-conda
is officially maintained by GitHub developers, I have the feeling that it might help us run things more smoothly/efficiently in the long term, but if it makes caching the conda-lock resolution more complex, maybe it's not worth the change.
One thing we learnt from investigating is that if a random account creates a PR (#29172) then they can select the GPU runner. |
I am experimenting to see what happens when one adds
It's a bit annoying that the UI will have the GPU jobs pending forever: https://github.com/scikit-learn/scikit-learn/actions/workflows/cuda-gpu-ci.yml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thomasjpfan @betatim maybe this could explain the worflow is stuck in pending for a runner when dispatched by an authorized user:
.github/workflows/cuda-gpu-ci.yml
Outdated
|
||
jobs: | ||
tests: | ||
runs-on: GPU |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the doc, it seems we should be explicit about how we select the runner, e.g. either by label, by group name or both:
runs-on: GPU | |
runs-on: | |
group: GPU | |
labels: GPU |
Note that both our runner label and our runner group are named "GPU". Maybe the group could be renamed to cuda-gpu-runner-group
to be more explicit (and the label "cuda-gpu" ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed the group and the runner.
What made you think that this explains why things were hanging for Thomas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at it again, I think the hanging issue will still remain, but I think it's a mostly a non-issue. I do not think many will try to open pull requests and try to invoke the GPU runner.
The CUDA CI run needs to be explicitly triggered as it uses extra resources.
@@ -56,6 +62,14 @@ jobs: | |||
### Note | |||
If the CI tasks fail, create a new branch based on this PR and add the required fixes to that branch. | |||
|
|||
# The CUDA workflow needs to be triggered explicitly as it uses an expensive runner | |||
- name: Trigger additional tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the only way to test that this is correct is to merge this PR :-/
Reference Issues/PRs
closes #24491
What does this implement/fix? Explain your changes.
This sets up a GitHub workflow that can be triggered by an admin (
workflow_dispatch
) and uses a runner that has a GPU. We will use this to test Array API related changes. Until now we had to run these tests locally by hand, as a result sometimes things slip through the cracks. It will be nice to have a standard CI that runs all the tests in a standardised way.Any other comments?
For now this is a draft PR as I am exploring what is what in the image on the VM and how best to setup the tools and dependencies we need.
Todo:
pytest
command, e.g. copying what the Azure tests dogh pr checkout PR_ID
to get the code