-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
FIX update debian 32 bit CI config to avoid a SIMD related bug in numpy 1.24.2 on some GitHub Actions CI workers #29771
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX update debian 32 bit CI config to avoid a SIMD related bug in numpy 1.24.2 on some GitHub Actions CI workers #29771
Conversation
test_img_to_graph
On the last run I got:
So the Euclidean division computed by NumPy is inconsistent: some results are correct (at the end of the array) but most are invalid! Also, I don't understand why the dtype is not the same for |
It seems that the problem is random on other PRs but since I added the following reproducer: a = np.asarray(
[0, 1, 2, 4, 5, 6, 8, 9, 10, 12, 13, 14, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
)
b = a // 4
assert all(b <= 3) to this PR, I have the impression that this problem is always triggered. The problem is that numpy Euclidean distance yields invalid results on a part of the array |
The last run did not fail. Here is the diff between a CPU that does not trigger the error (E5-2673 v4) and once that does (Platinum 8370C): < Model name: Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
---
> Model name: Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
10c10
< Model: 79
---
> Model: 106
14,16c14,16
< Stepping: 1
< BogoMIPS: 4589.37
< Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt md_clear
---
> Stepping: 6
> BogoMIPS: 5586.87
> Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm avx512f avx512dq rdseed adx smap clflushopt avx512cd avx512bw avx512vl xsaveopt xsavec xsaves md_clear
19c19
< L1d cache: 64 KiB (2 instances)
---
> L1d cache: 96 KiB (2 instances)
21,22c21,34
< L2 cache: 512 KiB (2 instances)
< L3 cache: 50 MiB (1 instance)
---
> L2 cache: 2.5 MiB (2 instances)
> L3 cache: 48 MiB (1 instance)
> NUMA node(s): 1
> NUMA node0 CPU(s): 0,1
> Vulnerability Gather data sampling: Unknown: Dependent on hypervisor status
> Vulnerability Itlb multihit: KVM: Mitigation: VMX unsupported
> Vulnerability L1tf: Mitigation; PTE Inversion
> Vulnerability Mds: Mitigation; Clear CPU buffers; SMT Host state unknown
> Vulnerability Meltdown: Mitigation; PTI
> Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
> Vulnerability Reg file data sampling: Not affected
> Vulnerability Retbleed: Not affected
> Vulnerability Spec rstack overflow: Not affected
> Vulnerability Spec store bypass: Vulnerabl For reference here are the full outputs of Incorrect integer division (test failure):
EDIT: also fails on:
valid integer division (no test failure):
|
One noticeable difference are the AVX 512 flags on the architecture that fails. That might be a hint that the problem might be related to SIMD operations in numpy. However numpy 1.24.2 is very old and predates the highway-based optimizations I believe. EDIT: 1.24.2 was released prior to the use of highway but there were already SIMD optimizations: Let me trigger a few more CI runs to check if this is reproducible. |
Inside Azure, I have seen cases where the architecture reported by OpenBLAS is SkyLakeX or Haswell. We don't have OpenBLAS in this build but I think this is somewhat similar to show the output of the following command?
|
@lesteve I added this command to the latest commit. However, I went trough most of the past runs in this PR (once I had enabled So far, all failures (more than 5 so far) happened with So this seems to confirm that SIMD instructions are the culprit. We would need to build a more recent version of numpy (e.g. 2.1) from source to check if this problem has been fixed with the switch to highway in numpy 2 but this seems tedious. Alternatively we might want try to hunt for another docker image with a more recent version of |
I think we can live with this. I also thought originally that it may have been a ccache thing but like you I checked the ccache statistics which are very similar between fast and slow build so apparently not ... |
I checked locally and trixie also ships numpy 1.26.4 so we can reasonably hope that it will fix the problem while being more stable than running apt update / install in a sid image. |
f*ck. I will debug this tomorrow... |
Indeed Edit: Apparently it was announce that it will be removed: https://lists.debian.org/debian-science/2023/07/msg00010.html So I'm wondering if we should not stop testing against ATLAS because there is not update since 2018. Subsequently, we could modify the documentation and avoid mentioning ATLAS. |
For some reason, Full disclosure, on my machine inside the Debian 32bit image I get a segmentation fault with OpenBLAS when running |
The test failure in |
I tried to use |
Another mystery partially solved: I added ccache statistics zeroing and now this shows that there is no ccache hit, which is probably the reason why the "install" step takes longer. I don't remember for sure, but I am guessing that the cache may comes from the |
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. However we need to rename the lock file and the CI config to remove "atlas" from it and avoid adding confusion about what is tested in this build. Do you want to do it @lesteve? Otherwise I can do it.
Trying to fix #29633 is low priority for me unless it affects pyodide which does not seem to be the case so far.
Good point (I thought about it at one point and forgot), done. |
OK last build is green and uses a CPU with AVX512 so let's merge this one. Thanks for debugging this one @ogrisel! |
The hypothesis seems somewhat valid as the build time is back to 2-3 minutes with a ccache hit of 100% see this PR build log. |
…py 1.24.2 on some GitHub Actions CI workers (scikit-learn#29771) Co-authored-by: Loïc Estève <loic.esteve@ymail.com>
…py 1.24.2 on some GitHub Actions CI workers (#29771) Co-authored-by: Loïc Estève <loic.esteve@ymail.com>
This draft PR aims at collecting debug information about the recently frequently happening failures discussed at:
All of these errors are
IndexError
that might be caused by some alternative integer arithmetic implemented by some CI workers but that cannot be reproduced locally in a docker container.