Native code order of magnitude slower than translated code on Apple M1 #17989

neurolabusc · 2020-12-12T13:14:37Z

I realize numpy is using experimental compilers for native builds on the M1, and still has some bugs, so it might be premature to discuss optimizations. Perhaps this is a feature request and not a bug. However, one would expect that native ARM code would typically be at least as fast as translated x86-64 code. I noticed that the nibabel bench_finite_range.py test is much slower for the native code than translated code. I found translated code (Python 3.8.3, NumPy version 1.19.4) is x10 faster than native code (Python 3.9.1rc1, NumPy version 1.19.4)

Reproducing code example:

# -*- coding: utf-8 -*-

import numpy as np
from numpy.testing import measure
#example where translated code (Python 3.8.3, NumPy version 1.19.4) is x10 faster than native code (Python 3.9.1rc1, NumPy version 1.19.4)
rng = np.random.RandomState(20111001)
img_shape = (128, 128, 64, 10)
repeat = 100
arr = rng.normal(size=img_shape)
mtime = measure('np.max(arr)', repeat)
print('%30s %6.2f' % ('max all finite', mtime))
mtime = measure('np.min(arr)', repeat)
print('%30s %6.2f' % ('min all finite', mtime))
arr[:, :, :, 1] = np.nan
mtime = measure('np.max(arr)', repeat)
print('%30s %6.2f' % ('max all nan', mtime))
mtime = measure('np.min(arr)', repeat)
print('%30s %6.2f' % ('min all nan', mtime))

Performance:

Translated:

$ time ./numpy_native_slower_than_translated.py
                max all finite   0.18
                min all finite   0.18
                   max all nan   0.18
                   min all nan   0.19
./numpy_native_slower_than_translated.py  1.32s user 1.28s system 214% cpu 1.213 total

Native:

$ time ./numpy_native_slower_than_translated.py
                max all finite   1.98
                min all finite   1.99
                   max all nan   1.99
                   min all nan   1.98
./numpy_native_slower_than_translated.py  8.49s user 0.14s system 104% cpu 8.237 total

NumPy/Python version information:

Translated:

1.19.4 3.8.3 (default, May 19 2020, 13:54:14)
[Clang 10.0.0 ]

Native:

1.19.4 3.9.1rc1 | packaged by conda-forge | (default, Nov 28 2020, 22:21:58)
[Clang 11.0.0 ]

The text was updated successfully, but these errors were encountered:

mattip · 2020-12-12T17:44:52Z

Is there a way to test the HEAD or the numpy 1.20rc1 with native? We have started to use SIMD intrinsics in a way that might make this faster.

neurolabusc · 2020-12-13T12:49:55Z

Happy to help, are there instructions for building on this platform? I have Clang 12.0.0 and gFortran 11.0.0 20201114 (experimental) but not gcc proper. It looks like there are dependencies for bras - do I use brew to install from source?

$pip install numpy==1.20rc1 
Collecting numpy==1.20rc1
  Using cached numpy-1.20.0rc1.zip (7.7 MB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
    Preparing wheel metadata ... done
Building wheels for collected packages: numpy
  Building wheel for numpy (PEP 517) ... error
  ERROR: Command errored out with exit status 1:
   command: /Users/chris/miniforge3/bin/python3.9 /Users/chris/miniforge3/lib/python3.9/site-packages/pip/_vendor/pep517/_in_process.py build_wheel /var/folders/4p/cxb1ys0s4wz6dh3rtj22pq8w0000gn/T/tmpsavxx_s9
       cwd: /private/var/folders/4p/cxb1ys0s4wz6dh3rtj22pq8w0000gn/T/pip-install-t6cbgk1m/numpy_96013a2b4f96408886c142437efacd24
  Complete output (708 lines):
  Running from numpy source directory.
  numpy/random/_bounded_integers.pxd.in has not changed
  numpy/random/_philox.pyx has not changed
  numpy/random/_bounded_integers.pyx.in has not changed
  numpy/random/_sfc64.pyx has not changed
  numpy/random/_mt19937.pyx has not changed
  numpy/random/bit_generator.pyx has not changed
  Processing numpy/random/_bounded_integers.pyx
  numpy/random/mtrand.pyx has not changed
  numpy/random/_generator.pyx has not changed
  numpy/random/_pcg64.pyx has not changed
  numpy/random/_common.pyx has not changed
  Cythonizing sources
  blas_opt_info:
  blas_mkl_info:
  customize UnixCCompiler
    libraries mkl_rt not found in ['/Users/chris/miniforge3/lib', '/usr/local/lib', '/usr/lib', '/opt/local/lib']
    NOT AVAILABLE
  
  blis_info:
    libraries blis not found in ['/Users/chris/miniforge3/lib', '/usr/local/lib', '/usr/lib', '/opt/local/lib']
    NOT AVAILABLE
  
  openblas_info:
    libraries openblas not found in ['/Users/chris/miniforge3/lib', '/usr/local/lib', '/usr/lib', '/opt/local/lib']
    NOT AVAILABLE
  
  atlas_3_10_blas_threads_info:
  Setting PTATLAS=ATLAS
    libraries tatlas not found in ['/Users/chris/miniforge3/lib', '/usr/local/lib', '/usr/lib', '/opt/local/lib']
    NOT AVAILABLE
  
  atlas_3_10_blas_info:
    libraries satlas not found in ['/Users/chris/miniforge3/lib', '/usr/local/lib', '/usr/lib', '/opt/local/lib']
    NOT AVAILABLE
  
  atlas_blas_threads_info:
  Setting PTATLAS=ATLAS
    libraries ptf77blas,ptcblas,atlas not found in ['/Users/chris/miniforge3/lib', '/usr/local/lib', '/usr/lib', '/opt/local/lib']
    NOT AVAILABLE
  
  atlas_blas_info:
    libraries f77blas,cblas,atlas not found in ['/Users/chris/miniforge3/lib', '/usr/local/lib', '/usr/lib', '/opt/local/lib']
    NOT AVAILABLE
  
  /private/var/folders/4p/cxb1ys0s4wz6dh3rtj22pq8w0000gn/T/pip-install-t6cbgk1m/numpy_96013a2b4f96408886c142437efacd24/numpy/distutils/system_info.py:1989: UserWarning:
      Optimized (vendor) Blas libraries are not found.
      Falls back to netlib Blas library which has worse performance.
      A better performance should be easily gained by switching
      Blas library.
    if self._calc_info(blas):
  blas_info:
  C compiler: clang -pthread -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /Users/chris/miniforge3/include -arch arm64 -fPIC -O2 -isystem /Users/chris/miniforge3/include -arch arm64
  
  creating /var/folders/4p/cxb1ys0s4wz6dh3rtj22pq8w0000gn/T/tmpyamaythh/var
  creating /var/folders/4p/cxb1ys0s4wz6dh3rtj22pq8w0000gn/T/tmpyamaythh/var/folders
  creating /var/folders/4p/cxb1ys0s4wz6dh3rtj22pq8w0000gn/T/tmpyamaythh/var/folders/4p
  creating /var/folders/4p/cxb1ys0s4wz6dh3rtj22pq8w0000gn/T/tmpyamaythh/var/folders/4p/cxb1ys0s4wz6dh3rtj22pq8w0000gn
  creating /var/folders/4p/cxb1ys0s4wz6dh3rtj22pq8w0000gn/T/tmpyamaythh/var/folders/4p/cxb1ys0s4wz6dh3rtj22pq8w0000gn/T
  creating /var/folders/4p/cxb1ys0s4wz6dh3rtj22pq8w0000gn/T/tmpyamaythh/var/folders/4p/cxb1ys0s4wz6dh3rtj22pq8w0000gn/T/tmpyamaythh
  compile options: '-I/usr/local/include -I/opt/local/include -I/Users/chris/miniforge3/include -c'
  clang: /var/folders/4p/cxb1ys0s4wz6dh3rtj22pq8w0000gn/T/tmpyamaythh/source.c
  /var/folders/4p/cxb1ys0s4wz6dh3rtj22pq8w0000gn/T/tmpyamaythh/source.c:1:10: fatal error: 'cblas.h' file not found
  #include <cblas.h>
           ^~~~~~~~~
  1 error generated.

mattip · 2020-12-13T13:06:07Z

homebrew install openblas or so?

neurolabusc · 2020-12-13T16:30:26Z

I have been able to install many native libraries with homebrew by installing from source. However, the successful formulas appear to compile with clang, not gcc. It looks to me like openblas demands gcc:

$export ARCHFLAGS='-arch arm64'
$brew install -s openblas

Warning: You are running macOS on a arm64 CPU architecture.
We do not provide support for this (yet).
Reinstall Homebrew under Rosetta 2 until we support it.
You will encounter build failures with some formulae.
Please create pull requests instead of asking for help on Homebrew's GitHub,
Twitter or any other official channels. You are responsible for resolving
any issues you experience while you are running this
unsupported configuration.

==> Downloading https://raw.githubusercontent.com/Homebrew/formula-patches/7baf6e2f/gcc/bigsur.diff
######################################################################## 100.0%
==> Downloading https://ftp.gnu.org/gnu/gcc/gcc-10.2.0/gcc-10.2.0.tar.xz
Already downloaded: /Users/chrisrorden/Library/Caches/Homebrew/downloads/c8d4ca732a98ae691d04472b15de6d9e06a09016499af6ff16c4f55081bfc6b9--gcc-10.2.0.tar.xz

...
Error: You are running macOS on a arm64 CPU architecture.

As a Hail Mary, I tried to compile Iain Sandoe's experimental gcc build from source. I could not find any explicit instructions, so used a generic recipe, which did not go well...

$git clone https://github.com/iains/gcc-darwin-arm64
$cd gcc-darwin-arm64
$contrib/download_prerequisites
$mkdir build && cd build
$../configure --prefix=/usr/local/gcc-11 \
              --enable-checking=release \
              --enable-languages=c,c++,fortran \
              --disable-multilib \
              --with-sysroot=/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk \
              --program-suffix=-11

$make -j 
...

make[3]: /Users/chrisrorden/src/gcc-darwin-arm64/build/./prev-gcc/xg++: Resource temporarily unavailable
xg++: fatal error: cannot execute ‘/Users/chrisrorden/src/gcc-darwin-arm64/build/./prev-gcc/cc1plus’: vfork: Operation timed out
compilation terminated.

stweil · 2020-12-13T19:24:40Z

I compiled OpenBLAS (git master) with clang and built numpy (git master) with it. The result is not faster:

            max all finite   1.97
            min all finite   1.97
               max all nan   1.97
               min all nan   1.98
   11.37 real         9.84 user         0.30 sys

A second run starts faster:

    8.28 real        10.09 user         0.03 sys

mattip · 2020-12-14T04:31:42Z

I wonder what SIMD features are exposed in native and in translated mode. Could you try this code with 1.20?

python -c "from numpy.core._multiarray_umath import __cpu_features__ as feat; print([f for f in feat if feat[f]])"

At some point we should expose a nicer interface to the __cpu_features__ dictionary of SIMD features detected at runtime.

stweil · 2020-12-14T05:43:03Z

Native ARM with numpy git master:

% python -c "from numpy.core._multiarray_umath import __cpu_features__ as feat; print([f for f in feat if feat[f]])"
['NEON', 'NEON_FP16', 'NEON_VFPV4', 'ASIMD', 'FPHP', 'ASIMDHP']

Do min and max as in the test code above use SIMD features?

mattip · 2020-12-14T06:31:13Z

What does the translated processor show? There is SIMD code for avx512f, but none for Neon (yet).

stweil · 2020-12-14T06:53:01Z

That might explain the results. I don't have a translated Python installed. @neurolabusc, could you please test that?

stweil · 2020-12-14T07:20:51Z

There is SIMD code for avx512f, but none for Neon (yet).

There is a numpy/core/src/common/simd/neon/ in git master. Which part is missing?

mattip · 2020-12-14T08:54:45Z

The code in numpy/core/src/common/simd/neon is the infrastructure used by universal intrinsics to abstract away architecture-specific code inside the actual implementations. What is missing is that someone needs to rewrite the avx512f code for minumum and maximum to use the universal intrinsics. xref gh-17985 which is doing the rewrite for add/subtract/multiply/divide

neurolabusc · 2020-12-14T15:36:18Z

Rosetta supports different forms of SSE, but does not support any variant of AVX. The times for running each test on a macOS Intel i5-8259U are ~0.6s and on a Linux AMD 3900x ~0.33. Neither of those computers support avx512f.

Note that for many benchmarks the native code outperforms translated code. Here are the nibabel benchmarks which rely on numpy. The only test from that battery which showed a regression was bench_finite_range.py, which led me to drill down and determine that the np.min and np/max functions were specifically slow.

I also wrote a C program that reports the maximum for an array of 64-bit doubles of the same size. The program times both scalar and SIMD (SSE/neon) instructions. The translated X86 code is much slower than native code for scalar, but they perform just as fast for the SIMD. The caveat with this test mimicking numpy's NaN propagation behavior (e.g. amax vs nanmax).

$g++ -O3 -o tstX86 main.cpp  -target x86_64-apple-macos10.12 -DmyDisableAVX; ./tstX86 4
Reporting minimum time for 4 tests
max64=1: min/mean	2091	2091	ms
max64SSE=1: min/mean	332	332	ms
max64(NaN)=1: min/mean	2086	2088	ms
max64SSE(NaN)=0.793516: min/mean	332	332	ms


$make -j; ./tst 4
g++ -O3 -o tst main.cpp -march=armv8-a+fp+simd+crypto+crc
Reporting minimum time for 4 tests
max64=1: min/mean	659	659	ms
max64SSE=1: min/mean	330	330	ms
max64(NaN)=1: min/mean	658	659	ms
max64SSE(NaN)=nan: min/mean	330	330	ms

neurolabusc · 2021-03-23T20:01:43Z

Just an update that same performance observed with numpy-1.20.1 with Python 3.9.2

mattip · 2021-03-23T23:01:52Z

Nothing will change until someone rewrites the code for max, min to use universal intrinsics.

akbir · 2021-05-05T08:21:50Z

Hi @mattip, I have access to hardware and would love to try get this moving. Can we make a specifc issue for this and I'll try get started?

Will certainly need help but keen to sink some hours into this!

neurolabusc · 2021-08-05T19:01:33Z

@Developer-Ecosystem-Engineering numpy is a tremendously useful and popular tool. Extending ARM support for universal intrinsics could have a profound impact.

Developer-Ecosystem-Engineering · 2021-08-19T18:37:56Z

@Developer-Ecosystem-Engineering numpy is a tremendously useful and popular tool. Extending ARM support for universal intrinsics could have a profound impact.

Accelerate support was re-enabled here #18874, its worth checking with that support

This fixes numpy#17989 by adding ARM NEON implementations for min/max and fmin/max. Before: Rosetta faster than native arm64 by `1.2x - 8.6x`. After: Native arm64 faster than Rosetta by `1.6x - 6.7x`. (2.8x - 15.5x improvement) **Benchmarks** ``` before after ratio [b0e1a44] [8301ffd7] <main> <gh-issue-17989/improve-neon-min-max> + 32.6±0.04μs 37.5±0.08μs 1.15 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'ceil'>, 2, 1, 'd') + 32.6±0.06μs 37.5±0.04μs 1.15 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'floor'>, 2, 1, 'd') + 37.8±0.09μs 43.2±0.09μs 1.14 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'floor'>, 4, 4, 'f') + 37.7±0.09μs 42.9±0.1μs 1.14 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'floor'>, 2, 2, 'd') + 37.9±0.2μs 43.0±0.02μs 1.14 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'ceil'>, 2, 2, 'd') + 37.7±0.01μs 42.3±1μs 1.12 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'>, 2, 2, 'd') + 34.2±0.07μs 38.1±0.05μs 1.12 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'floor'>, 4, 2, 'f') + 32.6±0.03μs 35.8±0.04μs 1.10 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'floor'>, 4, 1, 'f') + 37.1±0.1μs 40.3±0.1μs 1.09 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'ceil'>, 1, 2, 'd') + 37.2±0.1μs 40.3±0.04μs 1.08 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'floor'>, 2, 4, 'f') + 37.1±0.09μs 40.3±0.07μs 1.08 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'floor'>, 1, 2, 'd') + 68.6±0.5μs 74.2±0.3μs 1.08 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'ceil'>, 4, 4, 'd') + 37.1±0.2μs 40.0±0.1μs 1.08 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'>, 1, 2, 'd') + 2.42±0μs 2.61±0.05μs 1.08 bench_core.CountNonzero.time_count_nonzero_axis(3, 100, <class 'numpy.int16'>) + 69.1±0.7μs 73.5±0.7μs 1.06 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'>, 4, 4, 'd') + 54.7±0.3μs 58.0±0.2μs 1.06 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'ceil'>, 2, 4, 'd') + 54.5±0.2μs 57.8±0.2μs 1.06 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'conjugate'>, 2, 4, 'd') + 3.78±0.04μs 4.00±0.02μs 1.06 bench_core.CountNonzero.time_count_nonzero_multi_axis(2, 100, <class 'str'>) + 54.8±0.2μs 57.9±0.3μs 1.06 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'floor'>, 2, 4, 'd') + 3.68±0.01μs 3.87±0.02μs 1.05 bench_core.CountNonzero.time_count_nonzero_multi_axis(1, 100, <class 'object'>) + 69.6±0.2μs 73.1±0.2μs 1.05 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'floor'>, 4, 4, 'd') + 229±2μs 241±0.2μs 1.05 bench_random.Bounded.time_bounded('PCG64', [<class 'numpy.uint64'>, 1535]) - 73.0±0.8μs 69.5±0.2μs 0.95 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'trunc'>, 4, 4, 'd') - 37.6±0.1μs 35.7±0.3μs 0.95 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'trunc'>, 1, 4, 'f') - 88.7±0.04μs 84.2±0.7μs 0.95 bench_lib.Pad.time_pad((256, 128, 1), 1, 'wrap') - 57.9±0.2μs 54.8±0.2μs 0.95 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'trunc'>, 2, 4, 'd') - 39.9±0.2μs 37.2±0.04μs 0.93 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'positive'>, 1, 2, 'd') - 2.66±0.01μs 2.47±0.01μs 0.93 bench_lib.Nan.time_nanmin(200, 0) - 2.65±0.02μs 2.46±0.04μs 0.93 bench_lib.Nan.time_nanmin(200, 50.0) - 2.64±0.01μs 2.45±0.01μs 0.93 bench_lib.Nan.time_nanmax(200, 90.0) - 2.64±0μs 2.44±0.02μs 0.92 bench_lib.Nan.time_nanmax(200, 0) - 2.68±0.02μs 2.48±0μs 0.92 bench_lib.Nan.time_nanmax(200, 2.0) - 40.2±0.01μs 37.1±0.1μs 0.92 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'ceil'>, 2, 4, 'f') - 2.69±0μs 2.47±0μs 0.92 bench_lib.Nan.time_nanmin(200, 2.0) - 2.70±0.02μs 2.48±0.02μs 0.92 bench_lib.Nan.time_nanmax(200, 0.1) - 2.70±0μs 2.47±0μs 0.91 bench_lib.Nan.time_nanmin(200, 90.0) - 2.70±0μs 2.46±0μs 0.91 bench_lib.Nan.time_nanmin(200, 0.1) - 2.70±0μs 2.42±0.01μs 0.90 bench_lib.Nan.time_nanmax(200, 50.0) - 11.8±0.6ms 10.6±0.6ms 0.89 bench_core.CountNonzero.time_count_nonzero_axis(2, 1000000, <class 'str'>) - 42.7±0.1μs 37.8±0.02μs 0.88 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'positive'>, 2, 2, 'd') - 42.8±0.03μs 37.8±0.2μs 0.88 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rint'>, 2, 2, 'd') - 43.1±0.2μs 37.7±0.09μs 0.87 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'ceil'>, 4, 4, 'f') - 37.5±0.07μs 32.6±0.06μs 0.87 bench_ufunc_strides.Unary.time_ufunc(<ufunc 'rint'>, 2, 1, 'd') - 41.7±0.03μs 36.3±0.07μs 0.87 bench_ufunc_strides.Unary.time_ufunc(<ufunc '_ones_like'>, 1, 4, 'd') - 166±0.8μs 144±1μs 0.87 bench_ufunc.UFunc.time_ufunc_types('fmin') - 11.6±0.8ms 10.0±0.01ms 0.87 bench_core.CountNonzero.time_count_nonzero_multi_axis(2, 1000000, <class 'str'>) - 167±0.9μs 144±2μs 0.86 bench_ufunc.UFunc.time_ufunc_types('minimum') - 168±4μs 143±0.5μs 0.85 bench_ufunc.UFunc.time_ufunc_types('fmax') - 167±1μs 142±0.8μs 0.85 bench_ufunc.UFunc.time_ufunc_types('maximum') - 7.10±0μs 4.97±0.01μs 0.70 bench_ufunc_strides.AVX_BFunc.time_ufunc('minimum', 'd', 2) - 7.11±0.07μs 4.96±0.01μs 0.70 bench_ufunc_strides.AVX_BFunc.time_ufunc('maximum', 'd', 2) - 7.05±0.07μs 4.68±0μs 0.66 bench_ufunc_strides.AVX_BFunc.time_ufunc('minimum', 'f', 4) - 7.13±0μs 4.68±0.01μs 0.66 bench_ufunc_strides.AVX_BFunc.time_ufunc('maximum', 'f', 4) - 461±0.2μs 297±7μs 0.64 bench_app.MaxesOfDots.time_it - 7.04±0.07μs 3.95±0μs 0.56 bench_ufunc_strides.AVX_BFunc.time_ufunc('maximum', 'f', 2) - 7.06±0.06μs 3.95±0.01μs 0.56 bench_ufunc_strides.AVX_BFunc.time_ufunc('minimum', 'f', 2) - 7.09±0.06μs 3.24±0μs 0.46 bench_ufunc_strides.AVX_BFunc.time_ufunc('minimum', 'd', 1) - 7.12±0.07μs 3.25±0.02μs 0.46 bench_ufunc_strides.AVX_BFunc.time_ufunc('maximum', 'd', 1) - 14.5±0.02μs 3.98±0μs 0.27 bench_reduce.MinMax.time_max(<class 'numpy.int64'>) - 14.6±0.1μs 4.00±0.01μs 0.27 bench_reduce.MinMax.time_min(<class 'numpy.int64'>) - 6.88±0.06μs 1.34±0μs 0.19 bench_ufunc_strides.AVX_BFunc.time_ufunc('maximum', 'f', 1) - 7.00±0μs 1.33±0μs 0.19 bench_ufunc_strides.AVX_BFunc.time_ufunc('minimum', 'f', 1) - 39.4±0.01μs 3.95±0.01μs 0.10 bench_reduce.MinMax.time_min(<class 'numpy.float64'>) - 39.4±0.01μs 3.95±0.02μs 0.10 bench_reduce.MinMax.time_max(<class 'numpy.float64'>) - 254±0.02μs 22.8±0.2μs 0.09 bench_lib.Nan.time_nanmax(200000, 50.0) - 253±0.1μs 22.7±0.1μs 0.09 bench_lib.Nan.time_nanmin(200000, 0) - 254±0.06μs 22.7±0.09μs 0.09 bench_lib.Nan.time_nanmin(200000, 2.0) - 254±0.01μs 22.7±0.03μs 0.09 bench_lib.Nan.time_nanmin(200000, 0.1) - 254±0.04μs 22.7±0.02μs 0.09 bench_lib.Nan.time_nanmin(200000, 50.0) - 253±0.1μs 22.7±0.04μs 0.09 bench_lib.Nan.time_nanmax(200000, 0.1) - 253±0.03μs 22.7±0.04μs 0.09 bench_lib.Nan.time_nanmin(200000, 90.0) - 253±0.02μs 22.7±0.07μs 0.09 bench_lib.Nan.time_nanmax(200000, 0) - 254±0.03μs 22.7±0.02μs 0.09 bench_lib.Nan.time_nanmax(200000, 90.0) - 254±0.09μs 22.7±0.04μs 0.09 bench_lib.Nan.time_nanmax(200000, 2.0) - 39.2±0.01μs 2.51±0.01μs 0.06 bench_reduce.MinMax.time_max(<class 'numpy.float32'>) - 39.2±0.01μs 2.50±0.01μs 0.06 bench_reduce.MinMax.time_min(<class 'numpy.float32'>) ``` Size change of _multiarray_umath.cpython-39-darwin.so: Before: 3,890,723 After: 3,924,035 Change: +33,312 (~ +0.856 %)

rgommers · 2021-10-20T17:07:38Z

@neurolabusc there's a fix for this in PR gh-20131 thanks to @Developer-Ecosystem-Engineering. It's be great if you could test or review that PR.

neurolabusc · 2021-10-20T18:00:51Z

@rgommers thanks for bringing this to my attention. While I develop other tools, both Numpy and SIMD intrinsics are well outside my expertise. Therefore, I do not think I am suitable to review this PR. @Developer-Ecosystem-Engineering thanks for this PR that includes my specific test but also provides SIMD intrinsics for a wide range of computations. This will benefit macOS users as well as those using other ARM CPUs. This looks like a tremendous contribution!

Once the PR is accepted, this issue can be closed.

…_max BUG: min/max is slow, re-implement using NEON (#17989)

neurolabusc · 2022-06-23T11:10:35Z

This fix was introduced in numpy 1.23 which was released on June 22, 2022. On the same computer as my original post:

                max all finite   0.14
                min all finite   0.14
                   max all nan   0.14
                   min all nan   0.14
python ./numpy_native_slower_than_translated.py   2.76s  user 0.07s system 295% cpu 0.959 total

Thanks!

neurolabusc added a commit to neurolabusc/simd that referenced this issue Dec 14, 2020

Native versus translated fmax (numpy/numpy#17989)

6ee6069

rgommers mentioned this issue Jan 9, 2021

Please provide universal2 wheels on macOS #18143

Closed

neurolabusc mentioned this issue Oct 2, 2021

Updates on support programmes neurolabusc/AppleSiliconForNeuroimaging#5

Closed

Developer-Ecosystem-Engineering mentioned this issue Oct 18, 2021

BUG: min/max is slow, re-implement using NEON (#17989) #20131

Merged

mattip mentioned this issue Jan 11, 2022

BENCH: consistently test benchmarks (specifically argmax/argmin) #20785

Open

mattip closed this as completed in #20131 Jan 11, 2022

mattip added a commit that referenced this issue Jan 11, 2022

Merge pull request #20131 from Developer-Ecosystem-Engineering/as_min…

2d74972

…_max BUG: min/max is slow, re-implement using NEON (#17989)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Native code order of magnitude slower than translated code on Apple M1 #17989

Native code order of magnitude slower than translated code on Apple M1 #17989

neurolabusc commented Dec 12, 2020

mattip commented Dec 12, 2020

neurolabusc commented Dec 13, 2020

mattip commented Dec 13, 2020

neurolabusc commented Dec 13, 2020

stweil commented Dec 13, 2020 •

edited

Loading

mattip commented Dec 14, 2020

stweil commented Dec 14, 2020 •

edited

Loading

mattip commented Dec 14, 2020

stweil commented Dec 14, 2020 •

edited

Loading

stweil commented Dec 14, 2020

mattip commented Dec 14, 2020

neurolabusc commented Dec 14, 2020

neurolabusc commented Mar 23, 2021

mattip commented Mar 23, 2021

akbir commented May 5, 2021

neurolabusc commented Aug 5, 2021

Developer-Ecosystem-Engineering commented Aug 19, 2021

rgommers commented Oct 20, 2021

neurolabusc commented Oct 20, 2021

neurolabusc commented Jun 23, 2022

Native code order of magnitude slower than translated code on Apple M1 #17989

Native code order of magnitude slower than translated code on Apple M1 #17989

Comments

neurolabusc commented Dec 12, 2020

Reproducing code example:

Performance:

NumPy/Python version information:

mattip commented Dec 12, 2020

neurolabusc commented Dec 13, 2020

mattip commented Dec 13, 2020

neurolabusc commented Dec 13, 2020

stweil commented Dec 13, 2020 • edited Loading

mattip commented Dec 14, 2020

stweil commented Dec 14, 2020 • edited Loading

mattip commented Dec 14, 2020

stweil commented Dec 14, 2020 • edited Loading

stweil commented Dec 14, 2020

mattip commented Dec 14, 2020

neurolabusc commented Dec 14, 2020

neurolabusc commented Mar 23, 2021

mattip commented Mar 23, 2021

akbir commented May 5, 2021

neurolabusc commented Aug 5, 2021

Developer-Ecosystem-Engineering commented Aug 19, 2021

rgommers commented Oct 20, 2021

neurolabusc commented Oct 20, 2021

neurolabusc commented Jun 23, 2022

stweil commented Dec 13, 2020 •

edited

Loading

stweil commented Dec 14, 2020 •

edited

Loading

stweil commented Dec 14, 2020 •

edited

Loading