Skip to content

[CI] Add circleci build and test for arm64. #20460

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Aug 4, 2021

Conversation

cmarmo
Copy link
Contributor

@cmarmo cmarmo commented Jul 3, 2021

What does this implement/fix? Explain your changes.

This pull request adds continuous integration on circleCI for arm64 architectures.
Thanks for considering it.

Any other comments?

We spoke about that during a Consortium meeting but at that time the service was opt-in and needed the organization to subscribe.
I have checked the workflow on my github account, so I guess it is now available to everyone.

Please note that a number of tests are failing.
Also, the list of arm related issues is accessible here.

@cmarmo cmarmo closed this Jul 4, 2021
@cmarmo cmarmo reopened this Jul 4, 2021
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

It means that we can probably remove the following entry in the travis CI config and update the documentation to remove the [arm64] manual flag in the scikit-learn CI documentation.

https://github.com/scikit-learn/scikit-learn/blob/main/.travis.yml#L33-L40

In a follow-up PRs we might also migrate the full travis config for [cd build] to ARM-based Circle CI but this is significantly more work because as it would also involve configuring secret tokens. But since we did not run out of credits for travis in a while, maybe that's not needed.

@ogrisel
Copy link
Member

ogrisel commented Jul 5, 2021

I also wonder if the test failures with numpy.linalg.LinAlgError: Eigenvalues did not converge are related to atlas vs openblas.

@ogrisel
Copy link
Member

ogrisel commented Jul 5, 2021

Note that numpy has similar ARM64 failures (numpy.linalg.LinAlgError: Eigenvalues did not converge) on their dev build on travis:

https://travis-ci.com/github/numpy/numpy/jobs/521809468

and I also observed similar failures today on our travis arm64 build when working with [cd build] commit flag on an unrelated change in this PR: #20468 (comment).

@ogrisel
Copy link
Member

ogrisel commented Jul 5, 2021

We did get the failure on our weekly arm64 build on travis.

https://travis-ci.com/github/scikit-learn/scikit-learn/jobs/521594473

but not 8 days ago:

https://travis-ci.com/github/scikit-learn/scikit-learn/builds/230982232

@ogrisel
Copy link
Member

ogrisel commented Jul 5, 2021

The last successful arm64 nightly build for numpy was 4 days ago: https://travis-ci.com/github/numpy/numpy/builds/231467572

@ogrisel
Copy link
Member

ogrisel commented Jul 5, 2021

I opened numpy/numpy#19411 to track the issue upstream in numpy but this is probably something that changed in the runtime environment other than numpy / scipy and openblas since those jobs use the same released wheels for those libs...

@cmarmo
Copy link
Contributor Author

cmarmo commented Jul 5, 2021

It means that we can probably remove the following entry in the travis CI config and update the documentation to remove the [arm64] manual flag in the scikit-learn CI documentation.

https://github.com/scikit-learn/scikit-learn/blob/main/.travis.yml#L33-L40

I have removed the trigger and the related documentation.
I agree that travis builds should be considered in another pull request.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. It is quite nice that the CI takes only 25 minutes to build and tests. This is quite impressive indeed. I was recalling that it was almost an hour with the first ARM64 platform.

@cmarmo
Copy link
Contributor Author

cmarmo commented Jul 21, 2021

@ogrisel the numpy.linalg.LinAlgError: Eigenvalues did not converge has disappeared... I don't know why and when...

@ogrisel
Copy link
Member

ogrisel commented Jul 30, 2021

@ogrisel the numpy.linalg.LinAlgError: Eigenvalues did not converge has disappeared... I don't know why and when...

After investigation, it seems it was fixed in OpenBLAS 0.3.16:

  • fixed missing restore of a register in the recently rewritten DNRM2 kernel
    for ThunderX2 and Neoverse N1 that could cause spurious failures in e.g.
    DGEEV

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not 100% sure why this is working again because we are still using scipy 1.7.0 which had the problem previously... but as long at it works now.

Anyways here are a few suggestion for further improvements:

sudo apt-get install python3-scipy python3-matplotlib libopenblas-base \
python3-virtualenv ccache
python3 -m virtualenv --system-site-packages --python=python3 testenv
source testenv/bin/activate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just in case pip fix bugs in the future, let's use the latest version in the venv:

Suggested change
source testenv/bin/activate
source testenv/bin/activate
pip install --upgrade pip

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
@cmarmo
Copy link
Contributor Author

cmarmo commented Aug 1, 2021

Something has changed in openmp management? Two azure builds are failing with

=================================== FAILURES ===================================
_______________________ test_openmp_parallelism_enabled ________________________
[gw0] linux -- Python 3.9.6 /usr/share/miniconda/envs/testvenv/bin/python

    def test_openmp_parallelism_enabled():
        # Check that sklearn is built with OpenMP-based parallelism enabled.
        # This test can be skipped by setting the environment variable
        # ``SKLEARN_SKIP_OPENMP_TEST``.
        if os.getenv("SKLEARN_SKIP_OPENMP_TEST"):
            pytest.skip("test explicitly skipped (SKLEARN_SKIP_OPENMP_TEST)")
    
        base_url = "dev" if __version__.endswith(".dev0") else "stable"
        err_msg = textwrap.dedent(
            """
            This test fails because scikit-learn has been built without OpenMP.
            This is not recommended since some estimators will run in sequential
            mode instead of leveraging thread-based parallelism.
    
            You can find instructions to build scikit-learn with OpenMP at this
            address:
    
                https://scikit-learn.org/{}/developers/advanced_installation.html
    
            You can skip this test by setting the environment variable
            SKLEARN_SKIP_OPENMP_TEST to any value.
            """
        ).format(base_url)
    
>       assert _openmp_parallelism_enabled(), err_msg
E       AssertionError: 
E         This test fails because scikit-learn has been built without OpenMP.
E         This is not recommended since some estimators will run in sequential
E         mode instead of leveraging thread-based parallelism.
E         
E         You can find instructions to build scikit-learn with OpenMP at this
E         address:
E         
E             https://scikit-learn.org/dev/developers/advanced_installation.html
E         
E         You can skip this test by setting the environment variable
E         SKLEARN_SKIP_OPENMP_TEST to any value.
E         
E       assert False
E        +  where False = _openmp_parallelism_enabled()

base_url   = 'dev'
err_msg    = '\nThis test fails because scikit-learn has been built without OpenMP.\nThis is not recommended since some estimators ...tallation.html\n\nYou can skip this test by setting the environment variable\nSKLEARN_SKIP_OPENMP_TEST to any value.\n'

../1/s/sklearn/tests/test_build.py:33: AssertionError

@ogrisel
Copy link
Member

ogrisel commented Aug 4, 2021

Probably the issue reported as #20640? Hopefully merging main with #20654 will help.

@ogrisel
Copy link
Member

ogrisel commented Aug 4, 2021

I made a mistake because the github UI was not up to date and applied some of the review suggestion twice ... I will try to fix. Sorry for the mess.

@ogrisel
Copy link
Member

ogrisel commented Aug 4, 2021

I will merge as soon as the last CI run passes.

@ogrisel
Copy link
Member

ogrisel commented Aug 4, 2021

It's green! Merging. Thank you very much @cmarmo for the improved CI setup!

@ogrisel ogrisel merged commit 7366abd into scikit-learn:main Aug 4, 2021
@ogrisel
Copy link
Member

ogrisel commented Aug 4, 2021

@cmarmo are you interested in a follow-up PR to also migrate the [cd build] part of the ARM-based travis config? Because of weird travis errors, the weekly ARM64 wheel build is broken at the moment and it would be great to get it back up and working.

@cmarmo
Copy link
Contributor Author

cmarmo commented Aug 4, 2021

@cmarmo are you interested in a follow-up PR to also migrate the [cd build] part of the ARM-based travis config?

Yes, I'm interested. I will take care of it ASAP. Thanks @ogrisel for the suggestion.

@glemaitre
Copy link
Member

Nice Thanks @cmarmo

@cmarmo cmarmo deleted the circle-ci-arm branch August 4, 2021 18:55
@thomasjpfan
Copy link
Member

This is great work! Thank you @cmarmo :)

samronsin pushed a commit to samronsin/scikit-learn that referenced this pull request Nov 30, 2021
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants