Arm64 CI setup with TravisCI #17996

rth · 2020-07-26T09:21:58Z

Experiment with Arm64 CI setup using TravisCI, since we already have it activated. It's a similar setup to that used by numpy.

Closes #13073

ogrisel · 2020-07-26T13:37:43Z

Maybe the test failure is caused by a corruption of the /home/travis/scikit_learn_data/20news_home/20news-bydate.tar.gz file caused by concurrent download in tests run in // because of the use of pytest-xdist?

It would be worth making the dataset concurrent-safe by downloading the files to temporary filenames and then renaming at the end to ensure atomic operations.

ogrisel · 2020-07-26T14:16:53Z

I will work on making the download safe in another PR.

ogrisel · 2020-07-26T14:49:21Z

Apparently there is no easy way to make the downloaders truly concurrent safe. So maybe prefetching the datasets sequentially to a cache folder is better.

rth · 2020-07-26T19:31:39Z

Overall this looks promising, the ARM CI here runs in 28min including 10min for the tests suite on 32 (I imagine shared) CPUs. The build time should decrease once ccache starts to work.

We could probably run it for each commit on master or in a cron job, not sure.

Now there are two categories of failing tests,

those that make fetch_* calls. That's actually a non issue. By default the test suite should not make network calls, and they were happening because SKLEARN_SKIP_NETWORK_TESTS was set to 0 in the cron jobs. We don't need to download these datasets in arm64, same as we don't do it in default azure jobs. Now we still get some failure because some of the recently modified tests are not correctly handling SKLEARN_SKIP_NETWORK_TESTS as far as I can tell. Possibly somehow related to MNT refactor tests for datasets #17599 (cc @glemaitre ) So that would need to be investigated a bit.
GradientBoostingClassifier doctest ([0.23.1] doctest GradientBoostingClassifier failes on arm(rhel) processors #17797) should now be skipped in conftest.py but if I copy it to the test directory to account for it with pytest --pyargs sklearn I get import errors for this conftest.py. Not sure what's happening there.

I won't be able to look into this in the next few days, so if anyone wants to investigate don't hesitate to push to this PR.

ogrisel · 2020-07-27T09:23:47Z

I would go for cronjob + a flag in the commit message to make it possible to trigger the tests manually in PRs prior to merging ARM related fixes.

ogrisel · 2020-07-30T15:09:49Z

The ICC build failure is unrelated and probably transient:

[...]
Reading package lists... Done
E: Failed to fetch https://apt.repos.intel.com/oneapi/dists/all/main/binary-i386/Packages.gz  Hash Sum mismatch
E: Failed to fetch https://apt.repos.intel.com/oneapi/dists/all/main/binary-all/Packages.gz  
E: Some index files failed to download. They have been ignored, or old ones used instead.

ogrisel · 2020-07-30T15:11:42Z

@jeremiedbb any idea why we fetch binary-i386/Packages.gz on a x66_64 CI machine for the icc-build?

ogrisel

LGTM to me on my end. Ready for final review.

ogrisel · 2020-07-30T15:19:07Z

FYI @rth, I believe the scikit-learn/scikit-learn folder is automatically added to the PYTHONPATH by travis hence the fact that the toplevel conftest.py is already picked up by pytest without the need to copy it to the test folder.

rth · 2020-07-30T15:19:19Z

Thanks for your perseverance in finding this issue @ogrisel ! I also have no idea why travis_terminate is needed either.

I think there are failing tests in other TravisCI builds, that's expected since the scipy-dev build fails on master. We can probably address that in a separate PR.

rth

Thanks @ogrisel! LGTM.

As mentioned above, I would say the other CI failure in Travis CI are unrelated to this PR?

rth · 2020-07-30T15:23:21Z

conftest.py

+            item.add_marker(marker)
+        # Known failure on with GradientBoostingClassifier on ARM64
+        elif (item.name.endswith('GradientBoostingClassifier')
+                and platform.machine() == 'aarch64'):


I'm still confused on the "aarch64" vs "ARM64" since the latter can also happen apparently https://github.com/python/cpython/blob/0c4f0f3b29d84063700217dcf90ad6860ed71c70/Lib/test/test_regrtest.py#L662

Anyway the the issue where this doctest failure was originally reported #17797 was also aarch64 so it's probably fine to merge as is

jeremiedbb · 2020-07-30T15:26:22Z

any idea why we fetch binary-i386/Packages.gz on a x66_64 CI machine for the icc-build?

No but it's just packages available in the apt repository, not the package we're installing, right ?
we add intel apt repo with sudo add-apt-repository "deb https://apt.repos.intel.com/oneapi all main". I guess the 'all main' is a bit overkill but should not be a problem. tbh I just copy/pasted the instructions from intel (https://software.intel.com/content/www/us/en/develop/articles/oneapi-repo-instructions.html#aptpkg)

jeremiedbb · 2020-07-30T15:28:26Z

/opt/intel/inteloneapi/setvars.sh: No such file or directory
hum...

ogrisel · 2020-07-30T15:54:00Z

Indeed I restarted the icc-build to check if the i386 hash mismatch could be reproduced and now it fails with:

[...]
Setting up intel-oneapi-tbb-devel-2021.1-beta08 (2021.1-560.beta08) ...
Setting up intel-oneapi-tbb-devel (2021.1-560.beta08) ...
Setting up intel-oneapi-icc-2021.1-beta08 (2021.1-863.beta08) ...
Setting up intel-oneapi-icc (2021.1-863.beta08) ...
build_tools/travis/install.sh: line 92: /opt/intel/inteloneapi/setvars.sh: No such file or directory

They are probably in the process of updating the apt repo.

ogrisel · 2020-07-30T16:16:43Z

The last arm64 has a failed doctest:

______________________________ [doctest] sgd.rst _______________________________
[gw4] linux -- Python 3.7.8 /home/travis/miniconda/envs/testenv/bin/python3.7
124 parameters if an example violates the margin constraint, which makes
125 training very efficient and may result in sparser models (i.e. with more zero
126 coefficents), even when L2 penalty is used.
127 
128 Using ``loss="log"`` or ``loss="modified_huber"`` enables the
129 ``predict_proba`` method, which gives a vector of probability estimates
130 :math:`P(y|x)` per sample :math:`x`::
131 
132     >>> clf = SGDClassifier(loss="log", max_iter=5).fit(X, y)
133     >>> clf.predict_proba([[1., 1.]])
Expected:
    array([[0.00..., 0.99...]])
Got:
    array([[4.97248476e-07, 9.99999503e-01]])

There is not random_state. This failure is probably random but very rare and not related to arm64. Let me try to restart this build to confirm its randomness.

jeremiedbb · 2020-07-30T16:26:34Z

But array([[4.97248476e-07, 9.99999503e-01]]) is close to array([[0.00..., 0.99...]]). It seems to fail just because it's written in scientific notation.

ogrisel · 2020-07-30T17:37:53Z

But array([[4.97248476e-07, 9.99999503e-01]]) is close to array([[0.00..., 0.99...]]). It seems to fail just because it's written in scientific notation.

But most of the time it passes. It's probably random variations at some lower decimals that trigger the switch between the 2 notations.

glemaitre · 2020-07-31T07:29:12Z

LGTM. I would just add the new tag there: https://scikit-learn.org/stable/developers/contributing.html#continuous-integration-ci
such that it is mentioned in the documentation.

glemaitre · 2020-07-31T07:31:38Z

I am even thinking that in the near future we will probably have to run this CI all the time if manufacturer like Apple is selling personal computers with these processors

ogrisel · 2020-07-31T08:11:44Z

Merged. Thanks all!

Arch64 CI setup with TravisCI

11193af

rth changed the title ~~WIP Arm64 CI setup with TravisCI~~ Arm64 CI setup with TravisCI Jul 26, 2020

Fix some tests

6a9a875

github-actions bot added module:cluster module:neural_network labels Jul 26, 2020

rth added 8 commits July 26, 2020 12:36

Debug CPU arch

83d1d15

Try a different aarch key

1c5068a

Iter

85648f9

Iter

a319f78

Install cython

962e7df

Determine the CPU count

ede0d17

Use pytest-xdist

862a94b

Improve travis bash scripts

9100c73

Pre-fetch dataset and skip failing GradientBoostingClassifier docstring

35feb00

Install jq

fd9a2ae

thomasjpfan mentioned this pull request Jul 26, 2020

ENH Makes fetch_* more threadsafe #18000

Closed

Fix typo

0d1c629

rth added 4 commits July 26, 2020 19:01

More fixes for CI

048d4ae

Better workaround for cache corruption issues

323c28a

Fix tab/spaces

956824e

Fix naming of platform.machine() on arm64

6d5b31f

rth mentioned this pull request Jul 26, 2020

TravisCI aarch64 job is testing x86_64 not arm64 numpy/numpy#16947

Closed

Don't copy conftest

702ce16

rth mentioned this pull request Jul 26, 2020

[0.23.1] doctest GradientBoostingClassifier failes on arm(rhel) processors #17797

Closed

thomasjpfan mentioned this pull request Jul 26, 2020

TST Default skip network tests in datatests' common tests #18002

Merged

ogrisel added 2 commits July 30, 2020 16:42

Fix circle ci

918db5c

Trigger [arm64]

cbee178

ogrisel mentioned this pull request Jul 30, 2020

DOC add section on debugging ARM64 problems #18033

Merged

Test doc in parallel on travis [scipy-dev] [arm64] [icc-build]

3df4c4a

ogrisel marked this pull request as ready for review July 30, 2020 15:16

ogrisel approved these changes Jul 30, 2020

View reviewed changes

ogrisel added arch:arm ARM related issues Waiting for Reviewer labels Jul 30, 2020

rth commented Jul 30, 2020

View reviewed changes

Trigger [arm64] CI

846fa3a

Update CI commit message markers

242527f

ogrisel merged commit 90dc61f into scikit-learn:master Jul 31, 2020

rth deleted the aarch-ci branch July 31, 2020 09:44

thomasjpfan mentioned this pull request Aug 3, 2020

remove Travis config #15189

Closed

cmarmo mentioned this pull request Aug 3, 2020

MNT Add more architectur in CRON job in Travis #15650

Closed

janaknat mentioned this pull request Oct 21, 2020

Add support for aarch64/ARM64 wheels MacPython/scikit-learn-wheels#66

Closed

jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020

Arm64 CI setup with TravisCI (scikit-learn#17996)

51108d5

Uh oh!

Arm64 CI setup with TravisCI #17996

Arm64 CI setup with TravisCI #17996

Uh oh!

Conversation

rth commented Jul 26, 2020

Uh oh!

ogrisel commented Jul 26, 2020

Uh oh!

ogrisel commented Jul 26, 2020

Uh oh!

ogrisel commented Jul 26, 2020

Uh oh!

rth commented Jul 26, 2020 • edited by ogrisel Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Jul 27, 2020

Uh oh!

ogrisel commented Jul 30, 2020

Uh oh!

ogrisel commented Jul 30, 2020

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Jul 30, 2020

Uh oh!

rth commented Jul 30, 2020

Uh oh!

rth left a comment

Choose a reason for hiding this comment

Uh oh!

rth Jul 30, 2020

Choose a reason for hiding this comment

Uh oh!

jeremiedbb commented Jul 30, 2020

Uh oh!

jeremiedbb commented Jul 30, 2020

Uh oh!

ogrisel commented Jul 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Jul 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeremiedbb commented Jul 30, 2020

Uh oh!

ogrisel commented Jul 30, 2020

Uh oh!

glemaitre commented Jul 31, 2020

Uh oh!

glemaitre commented Jul 31, 2020

Uh oh!

ogrisel commented Jul 31, 2020

Uh oh!

Uh oh!

rth commented Jul 26, 2020 •

edited by ogrisel

Loading

ogrisel commented Jul 30, 2020 •

edited

Loading

ogrisel commented Jul 30, 2020 •

edited

Loading