Skip to content

Fix the stalled linux/arm64 [cd build] jobs on travis #20958

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Sep 7, 2021

Conversation

ogrisel
Copy link
Member

@ogrisel ogrisel commented Sep 6, 2021

As an alternative to #20711 that might be too slow because we only have one concurrent executor on the free circle ci account.

@ogrisel
Copy link
Member Author

ogrisel commented Sep 6, 2021

The root cause of the problem seems to be described here: https://travis-ci.community/t/builds-hang-with-output-truncated-mid-line/7611/9 (currently reading the discussion).

@ogrisel
Copy link
Member Author

ogrisel commented Sep 6, 2021

Ok apparently my workaround works, for some reason I do not really understand...

To workaround travis arm64 buffering issues causing the job
to be wrongly detected as stalled after 10min by travis.
@ogrisel
Copy link
Member Author

ogrisel commented Sep 6, 2021

Testing wheels used to take 961.99s before tuning parallelism using joblib.cpu_count(). Let's see if the last commit can improve upon this.

@alfaro96
Copy link
Member

alfaro96 commented Sep 7, 2021

Having a look to the logs, the issue is apparently a stalled test, is it not?

@ogrisel
Copy link
Member Author

ogrisel commented Sep 7, 2021

Having a look to the logs, the issue is apparently a stalled test, is it not?

It could be but we cannot be sure because the std output buffer is truncated. And it often happens that the stalling happens when cibuildwheel is running basic pip install commands prior to running the tests themselves.

Others suspect in https://travis-ci.community/t/builds-hang-with-output-truncated-mid-line/7611/9 that the buffering of std output on travis linux/arm64 workers is actually the cause of the problem: because travis does not see output, it detects the build as stalled even though it is not.

@ogrisel
Copy link
Member Author

ogrisel commented Sep 7, 2021

I think parallelism is well tuned now. ~7 min for each build and they mostly happen in parallel. This is much faster than the circleci sequential build.

Let me re-enable the other CIs.

@ogrisel ogrisel changed the title DEBUG Try to debug the stalled ARM travis wheel build jobs Fix the stalled ARM travis wheel build jobs Sep 7, 2021
@ogrisel
Copy link
Member Author

ogrisel commented Sep 7, 2021

I am not sure which change helps the most to fix the original problem. I think the first time I got it to work prior to rebasing was after moving the pip install inside the test script. I am not sure if disabling Python buffering and removing the set -e mode in bash scripts is actually needed or not.

But anyways it seems to work and more importantly it is fast!

@ogrisel ogrisel requested a review from alfaro96 September 7, 2021 10:08
@ogrisel ogrisel changed the title Fix the stalled ARM travis wheel build jobs Fix the stalled linux/arm64 [cd build] jobs on travis Sep 7, 2021
@ogrisel
Copy link
Member Author

ogrisel commented Sep 7, 2021

When squash-merging this PR, it would be nice to put [cd build] in the commit message so has to trigger a fake weekly build for linux/arm64 on the main branch and get the README.md badge green again.

@ogrisel ogrisel added this to the 1.0 milestone Sep 7, 2021
@ogrisel
Copy link
Member Author

ogrisel commented Sep 7, 2021

The failure is unrelated and being addressed in #20963.

@ogrisel ogrisel mentioned this pull request Sep 7, 2021
Copy link
Member

@alfaro96 alfaro96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Thank you @ogrisel for working on this.

Just a question, have you tried the new CPU_COUNT without the || travis_terminate sentences?

@ogrisel
Copy link
Member Author

ogrisel commented Sep 7, 2021

Just a question, have you tried the new CPU_COUNT without the || travis_terminate sentences?

I think I had one that worked with a lower value of CPU_COUNT. Unfortunately I rebased and it's no longer in the history of this PR.

@ogrisel ogrisel merged commit 89d66b3 into scikit-learn:main Sep 7, 2021
@ogrisel ogrisel deleted the travis-wheel-debug branch September 7, 2021 17:24
@ogrisel
Copy link
Member Author

ogrisel commented Sep 7, 2021

Merged to make it easier to move forward with the release process.

@ogrisel
Copy link
Member Author

ogrisel commented Sep 7, 2021

@adrinjalali you might want to backport this to the 1.0.X branch to get linux/arm64 wheels generated in fast and robust way (hopefully).

@ogrisel ogrisel added the To backport PR merged in master that need a backport to a release branch defined based on the milestone. label Sep 7, 2021
thomasjpfan pushed a commit to thomasjpfan/scikit-learn that referenced this pull request Sep 7, 2021
@adrinjalali
Copy link
Member

tagging #20965

adrinjalali pushed a commit to adrinjalali/scikit-learn that referenced this pull request Sep 10, 2021
samronsin pushed a commit to samronsin/scikit-learn that referenced this pull request Nov 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Blocker Bug Build / CI To backport PR merged in master that need a backport to a release branch defined based on the milestone.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants