Skip to content

FIX update debian 32 bit CI config to avoid a SIMD related bug in numpy 1.24.2 on some GitHub Actions CI workers #29771

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 34 commits into from
Sep 4, 2024

Conversation

ogrisel
Copy link
Member

@ogrisel ogrisel commented Sep 2, 2024

This draft PR aims at collecting debug information about the recently frequently happening failures discussed at:

FAILED cluster/tests/test_spectral.py::test_spectral_clustering_with_arpack_amg_solvers
FAILED feature_extraction/tests/test_image.py::test_img_to_graph - IndexError...
FAILED impute/tests/test_impute.py::test_imputation_mean_median[csc_matrix]
FAILED impute/tests/test_impute.py::test_imputation_mean_median[csc_array] - ...
FAILED preprocessing/tests/test_data.py::test_robust_scale_axis1 - IndexError...

All of these errors are IndexError that might be caused by some alternative integer arithmetic implemented by some CI workers but that cannot be reproduced locally in a docker container.

@ogrisel ogrisel changed the title DEBUG: print verbose info on integer ops in test_img_to_graph DEBUG: print verbose info on integer ops in test_img_to_graph Sep 2, 2024
Copy link

github-actions bot commented Sep 2, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 5c7751d. Link to the linter CI: here

@ogrisel
Copy link
Member Author

ogrisel commented Sep 2, 2024

On the last run I got:

edges[0] = array([ 0,  1,  2,  4,  5,  6,  8,  9, 10, 12, 13, 14,  0,  1,  2,  3,  4,
        5,  6,  7,  8,  9, 10, 11])
n_y = 4, n_z = 1
n_y * n_z = 4
edges[0] // (n_y * n_z) = array([ 0,  1,  2,  4,  5,  6,  8,  9, 10, 12, 13, 14,  0,  1,  2,  3,  1,
        1,  1,  1,  2,  2,  2,  2], dtype=int32)

So the Euclidean division computed by NumPy is inconsistent: some results are correct (at the end of the array) but most are invalid!

Also, I don't understand why the dtype is not the same for edges[0] and edges[0] // (n_y * n_z).

@ogrisel
Copy link
Member Author

ogrisel commented Sep 3, 2024

It seems that the problem is random on other PRs but since I added the following reproducer:

    a = np.asarray(
        [0, 1, 2, 4, 5, 6, 8, 9, 10, 12, 13, 14, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
    )
    b = a // 4
    assert all(b <= 3)

to this PR, I have the impression that this problem is always triggered. The problem is that numpy Euclidean distance yields invalid results on a part of the array b. It could be related to a bug in a SIMD operation but numpy has not changed in a while on this build. Maybe the hardware on github actions has changed?

@ogrisel
Copy link
Member Author

ogrisel commented Sep 3, 2024

The last run did not fail. Here is the diff between a CPU that does not trigger the error (E5-2673 v4) and once that does (Platinum 8370C):

< Model name:                           Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
---
> Model name:                           Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
10c10
< Model:                                79
---
> Model:                                106
14,16c14,16
< Stepping:                             1
< BogoMIPS:                             4589.37
< Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt md_clear
---
> Stepping:                             6
> BogoMIPS:                             5586.87
> Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm avx512f avx512dq rdseed adx smap clflushopt avx512cd avx512bw avx512vl xsaveopt xsavec xsaves md_clear
19c19
< L1d cache:                            64 KiB (2 instances)
---
> L1d cache:                            96 KiB (2 instances)
21,22c21,34
< L2 cache:                             512 KiB (2 instances)
< L3 cache:                             50 MiB (1 instance)
---
> L2 cache:                             2.5 MiB (2 instances)
> L3 cache:                             48 MiB (1 instance)
> NUMA node(s):                         1
> NUMA node0 CPU(s):                    0,1
> Vulnerability Gather data sampling:   Unknown: Dependent on hypervisor status
> Vulnerability Itlb multihit:          KVM: Mitigation: VMX unsupported
> Vulnerability L1tf:                   Mitigation; PTE Inversion
> Vulnerability Mds:                    Mitigation; Clear CPU buffers; SMT Host state unknown
> Vulnerability Meltdown:               Mitigation; PTI
> Vulnerability Mmio stale data:        Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
> Vulnerability Reg file data sampling: Not affected
> Vulnerability Retbleed:               Not affected
> Vulnerability Spec rstack overflow:   Not affected
> Vulnerability Spec store bypass:      Vulnerabl

For reference here are the full outputs of lscpu on each run:

Incorrect integer division (test failure):

  • Platinum 8370C (has AVX 512)
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        46 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               2
On-line CPU(s) list:                  0,1
Vendor ID:                            GenuineIntel
Model name:                           Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
CPU family:                           6
Model:                                106
Thread(s) per core:                   1
Core(s) per socket:                   2
Socket(s):                            1
Stepping:                             6
BogoMIPS:                             5586.87
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm avx512f avx512dq rdseed adx smap clflushopt avx512cd avx512bw avx512vl xsaveopt xsavec xsaves md_clear
Hypervisor vendor:                    Microsoft
Virtualization type:                  full
L1d cache:                            96 KiB (2 instances)
L1i cache:                            64 KiB (2 instances)
L2 cache:                             2.5 MiB (2 instances)
L3 cache:                             48 MiB (1 instance)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0,1
Vulnerability Gather data sampling:   Unknown: Dependent on hypervisor status
Vulnerability Itlb multihit:          KVM: Mitigation: VMX unsupported
Vulnerability L1tf:                   Mitigation; PTE Inversion
Vulnerability Mds:                    Mitigation; Clear CPU buffers; SMT Host state unknown
Vulnerability Meltdown:               Mitigation; PTI
Vulnerability Mmio stale data:        Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Vulnerable
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Retpoline
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Mitigation; Clear CPU buffers; SMT Host state unknown

EDIT: also fails on:

  • Platinum 8272CL (with AVX 512) (4 runs)
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        46 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               2
On-line CPU(s) list:                  0,1
Vendor ID:                            GenuineIntel
Model name:                           Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
CPU family:                           6
Model:                                85
Thread(s) per core:                   1
Core(s) per socket:                   2
Socket(s):                            1
Stepping:                             7
BogoMIPS:                             5187.81
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm avx512f avx512dq rdseed adx smap clflushopt avx512cd avx512bw avx512vl xsaveopt xsavec xsaves md_clear
Hypervisor vendor:                    Microsoft
Virtualization type:                  full
L1d cache:                            64 KiB (2 instances)
L1i cache:                            64 KiB (2 instances)
L2 cache:                             2 MiB (2 instances)
L3 cache:                             35.8 MiB (1 instance)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0,1
Vulnerability Gather data sampling:   Unknown: Dependent on hypervisor status
Vulnerability Itlb multihit:          KVM: Mitigation: VMX unsupported
Vulnerability L1tf:                   Mitigation; PTE Inversion
Vulnerability Mds:                    Mitigation; Clear CPU buffers; SMT Host state unknown
Vulnerability Meltdown:               Mitigation; PTI
Vulnerability Mmio stale data:        Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Vulnerable
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Retpoline
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Mitigation; Clear CPU buffers; SMT Host state unknown

valid integer division (no test failure):

  • CPU E5-2673 (no AVX512) (2 runs)
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        46 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               2
On-line CPU(s) list:                  0,1
Vendor ID:                            GenuineIntel
Model name:                           Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
CPU family:                           6
Model:                                79
Thread(s) per core:                   1
Core(s) per socket:                   2
Socket(s):                            1
Stepping:                             1
BogoMIPS:                             4589.37
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt md_clear
Hypervisor vendor:                    Microsoft
Virtualization type:                  full
L1d cache:                            64 KiB (2 instances)
L1i cache:                            64 KiB (2 instances)
L2 cache:                             512 KiB (2 instances)
L3 cache:                             50 MiB (1 instance)
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Retpoline
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Mitigation; Clear CPU buffers; SMT Host state unknown

@ogrisel
Copy link
Member Author

ogrisel commented Sep 3, 2024

One noticeable difference are the AVX 512 flags on the architecture that fails. That might be a hint that the problem might be related to SIMD operations in numpy. However numpy 1.24.2 is very old and predates the highway-based optimizations I believe.

EDIT: 1.24.2 was released prior to the use of highway but there were already SIMD optimizations:

Let me trigger a few more CI runs to check if this is reproducible.

@lesteve
Copy link
Member

lesteve commented Sep 3, 2024

Inside Azure, I have seen cases where the architecture reported by OpenBLAS is SkyLakeX or Haswell. We don't have OpenBLAS in this build but I think this is somewhat similar to show the output of the following command?

head /sys/devices/cpu*/caps/pmu_name

@ogrisel
Copy link
Member Author

ogrisel commented Sep 3, 2024

@lesteve I added this command to the latest commit. However, I went trough most of the past runs in this PR (once I had enabled lscpu) and I am consolidating the number of runs for each (test success or failure) in #29771 (comment).

So far, all failures (more than 5 so far) happened with Platinum 8370C or Platinum 8272CL CPUs while the CPU E5-2673 CPU was fine twice.

So this seems to confirm that SIMD instructions are the culprit.

We would need to build a more recent version of numpy (e.g. 2.1) from source to check if this problem has been fixed with the switch to highway in numpy 2 but this seems tedious.

Alternatively we might want try to hunt for another docker image with a more recent version of python3-numpy for the 32 bit i686 platform.

@lesteve
Copy link
Member

lesteve commented Sep 3, 2024

However the build time seem slower (8 min vs 3 min previously) than they used to be and I don't know why since the cache statistic look fine. However the total time is largely dominated by the test time (more than 40 min) and that does not seem to have changed.

I think we can live with this. I also thought originally that it may have been a ccache thing but like you I checked the ccache statistics which are very similar between fast and slow build so apparently not ...

@ogrisel
Copy link
Member Author

ogrisel commented Sep 3, 2024

I checked locally and trixie also ships numpy 1.26.4 so we can reasonably hope that it will fix the problem while being more stable than running apt update / install in a sid image.

@ogrisel
Copy link
Member Author

ogrisel commented Sep 3, 2024

Package libatlas3-base is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source

E: Package 'libatlas3-base' has no installation candidate
E: Package 'libatlas-base-dev' has no installation candidate

f*ck. I will debug this tomorrow...

@glemaitre glemaitre self-requested a review September 3, 2024 18:08
@glemaitre
Copy link
Member

glemaitre commented Sep 3, 2024

Indeed libatlas is not available in trixie but it is back in sid: https://packages.debian.org/sid/libatlas3-base

Edit:

Apparently it was announce that it will be removed: https://lists.debian.org/debian-science/2023/07/msg00010.html
There is additional information here regarding the status: https://tracker.debian.org/pkg/atlas

So I'm wondering if we should not stop testing against ATLAS because there is not update since 2018. Subsequently, we could modify the documentation and avoid mentioning ATLAS.

@glemaitre glemaitre removed their request for review September 3, 2024 18:55
@lesteve
Copy link
Member

lesteve commented Sep 4, 2024

For some reason, svm/tests/test_svm.py::test_svc_ovr_tie_breaking[NuSVC] is unstable on 32bit. I modified the test using global_random_seed, and it fails for 37 of the seeds on a Debian 32bit image. I skipped this test for now.

Full disclosure, on my machine inside the Debian 32bit image I get a segmentation fault with OpenBLAS when running pytest sklearn/metrics/tests but apparently not the CI ...

@ogrisel
Copy link
Member Author

ogrisel commented Sep 4, 2024

The test failure in test_svc_ovr_tie_breaking[NuSVC] was already reported as: #29633

@ogrisel
Copy link
Member Author

ogrisel commented Sep 4, 2024

I tried to use global_random_seed in test_svc_ovr_tie_breaking on a 64 bit runtime and that test always passes. So it's probably not just a problem of bad test data design unfortunately.

@lesteve
Copy link
Member

lesteve commented Sep 4, 2024

Another mystery partially solved: I added ccache statistics zeroing and now this shows that there is no ccache hit, which is probably the reason why the "install" step takes longer. I don't remember for sure, but I am guessing that the cache may comes from the main branch and once this PR is merged cache reuse may be better 🤞

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Copy link
Member Author

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. However we need to rename the lock file and the CI config to remove "atlas" from it and avoid adding confusion about what is tested in this build. Do you want to do it @lesteve? Otherwise I can do it.

Trying to fix #29633 is low priority for me unless it affects pyodide which does not seem to be the case so far.

@lesteve
Copy link
Member

lesteve commented Sep 4, 2024

However we need to rename the lock file and the CI config to remove "atlas" from it and avoid adding confusion about what is tested in this build

Good point (I thought about it at one point and forgot), done.

@lesteve
Copy link
Member

lesteve commented Sep 4, 2024

OK last build is green and uses a CPU with AVX512 so let's merge this one. Thanks for debugging this one @ogrisel!

@lesteve lesteve merged commit d0a5fdf into scikit-learn:main Sep 4, 2024
29 checks passed
@lesteve
Copy link
Member

lesteve commented Sep 4, 2024

I don't remember for sure, but I am guessing that the cache may comes from the main branch and once this PR is merged cache reuse may be better 🤞

The hypothesis seems somewhat valid as the build time is back to 2-3 minutes with a ccache hit of 100% see this PR build log.

@ogrisel ogrisel deleted the debug-debian-32-indexerror branch September 4, 2024 12:15
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Sep 9, 2024
…py 1.24.2 on some GitHub Actions CI workers (scikit-learn#29771)

Co-authored-by: Loïc Estève <loic.esteve@ymail.com>
glemaitre pushed a commit that referenced this pull request Sep 11, 2024
…py 1.24.2 on some GitHub Actions CI workers (#29771)

Co-authored-by: Loïc Estève <loic.esteve@ymail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants