Skip to content

MNT: build numpy with link-time optimization in benchmarks #26616

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

ngoldbaum
Copy link
Member

@ngoldbaum ngoldbaum commented Jun 5, 2024

I've been trying to benchmark a big refactor of numpy in #26607. I remembered a blog post from @vstinner years ago on python performance work, where he found turning on LTO improved reproducibility. See e.g. this LWN summary. Can't seem to find the original blog post...

Now that we have meson, it's easy to use LTO. It also doesn't take that much more time to build numpy with LTO, at least on my ARM Macbook.

This makes benchmarks where I was seeing substantial changes in speed comparing with main have no difference at all. I haven't tried the full benchmark suite but I suspect the remaining changes will be "real" differences and not just due to code shuffling in the binary or changing size slightly.

Copy link
Member

@rgommers rgommers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like a good idea if you have observed that it helps.

It also raised the question for me about whether we need to use LTO anywhere else. IIRC building CPython from source doesn't use LTO, but all the installers on python.org do use it. It's not clear to me why that setup was chosen though - maybe backwards compat, portability, or to not make the choice for distros?

@@ -18,7 +18,7 @@
"branches": ["HEAD"],

"build_command": [
"python -m build --wheel -o {build_cache_dir} {build_dir}"
"python -m build --wheel -C'setup-args=-Db_lto=true' -C'setup-args=-Dbuildtype=release' -o {build_cache_dir} {build_dir}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"python -m build --wheel -C'setup-args=-Db_lto=true' -C'setup-args=-Dbuildtype=release' -o {build_cache_dir} {build_dir}"
"python -m build --wheel -Csetup-args=-Db_lto=true -o {build_cache_dir} {build_dir}"

Trying to write this in a slightly more idiomatic way. Release buildtype is the default when building a wheel, and the quotes weren't necessary

@vstinner
Copy link
Contributor

vstinner commented Jun 5, 2024

I've been trying to benchmark a big refactor of numpy in #26607. I remembered a blog post from @vstinner years ago on python performance work, where he found turning on LTO improved reproducibility. See e.g. this LWN summary. Can't seem to find the original blog post...

Hi, @vstinner here :-) I listed my articles there: https://vstinner.readthedocs.io/benchmark.html

Maybe you're thinking at this article which is about PGO: https://vstinner.github.io/journey-to-stable-benchmark-deadcode.html

@mattip
Copy link
Member

mattip commented Jun 5, 2024

IRC building CPython from source doesn't use LTO, but all the installers on python.org do use it

I wonder if it has to to with reproducible builds.

@vstinner
Copy link
Contributor

vstinner commented Jun 5, 2024

Python installers are built with LTO+PGO, as do Linux distributions such as Ubuntu and Fedora.

It's not done by default by ./configure because LTO and/or PGO were unstable a few years ago (ex: GCC had internal bugs, clang on old macOS didn't support it, etc.), and it's way slower. See the doc for recommended options: https://docs.python.org/dev/using/configure.html#performance-options

@rgommers
Copy link
Member

rgommers commented Jun 5, 2024

@ngoldbaum on what platform(s) did you test this? It doesn't seem to work for me with conda compilers on Linux x86-64:

$ # Note: `spin build -- -Db_lto=true -Dbuildtype=release` behaves the same
$ python -m build -wnx -Csetup-args=-Db_lto=true
...
[578/610] Linking static target numpy/_core/lib_simd_mtargets.a
/home/rgommers/mambaforge/envs/numpy-dev/bin/x86_64-conda-linux-gnu-ar: numpy/_core/lib_simd.dispatch.h_baseline.a.p/meson-generated__simd.dispatch.c.o: plugin needed to handle lto object
[580/610] Linking target numpy/_core/_simd.cpython-311-x86_64-linux-gnu.so
FAILED: numpy/_core/_simd.cpython-311-x86_64-linux-gnu.so 
/home/rgommers/mambaforge/envs/numpy-dev/bin/x86_64-conda-linux-gnu-c++  -o numpy/_core/_simd.cpython-311-x86_64-linux-gnu.so numpy/_core/_simd.cpython-311-x86_64-linux-gnu.so.p/src_common_npy_cpu_features.c.o numpy/_core/_simd.cpython-311-x86_64-linux-gnu.so.p/src__simd__simd.c.o -L/home/rgommers/mambaforge/envs/numpy-dev/lib -flto -Wl,--as-needed -Wl,--allow-shlib-undefined -Wl,-O1 -shared -fPIC -Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags -Wl,--gc-sections -Wl,--allow-shlib-undefined -Wl,-rpath,/home/rgommers/mambaforge/envs/numpy-dev/lib -Wl,-rpath-link,/home/rgommers/mambaforge/envs/numpy-dev/lib -fvisibility-inlines-hidden -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /home/rgommers/mambaforge/envs/numpy-dev/include -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /home/rgommers/mambaforge/envs/numpy-dev/include -Wl,--start-group numpy/_core/libnpymath.a numpy/_core/lib_simd_mtargets.a -Wl,--end-group
/home/rgommers/mambaforge/envs/numpy-dev/bin/../lib/gcc/x86_64-conda-linux-gnu/12.3.0/../../../../x86_64-conda-linux-gnu/bin/ld: /tmp/ccHtNG5C.ltrans0.ltrans.o: in function `PyInit__simd':
<artificial>:(.text.PyInit__simd+0x5e5): undefined reference to `simd_create_module'
/home/rgommers/mambaforge/envs/numpy-dev/bin/../lib/gcc/x86_64-conda-linux-gnu/12.3.0/../../../../x86_64-conda-linux-gnu/bin/ld: <artificial>:(.text.PyInit__simd+0x708): undefined reference to `simd_create_module_FMA3'
/home/rgommers/mambaforge/envs/numpy-dev/bin/../lib/gcc/x86_64-conda-linux-gnu/12.3.0/../../../../x86_64-conda-linux-gnu/bin/ld: <artificial>:(.text.PyInit__simd+0x755): undefined reference to `simd_create_module_FMA3__AVX2'
/home/rgommers/mambaforge/envs/numpy-dev/bin/../lib/gcc/x86_64-conda-linux-gnu/12.3.0/../../../../x86_64-conda-linux-gnu/bin/ld: <artificial>:(.text.PyInit__simd+0x7bf): undefined reference to `simd_create_module_SSE42'
/home/rgommers/mambaforge/envs/numpy-dev/bin/../lib/gcc/x86_64-conda-linux-gnu/12.3.0/../../../../x86_64-conda-linux-gnu/bin/ld: <artificial>:(.text.PyInit__simd+0x818): undefined reference to `simd_create_module_AVX2'
/home/rgommers/mambaforge/envs/numpy-dev/bin/../lib/gcc/x86_64-conda-linux-gnu/12.3.0/../../../../x86_64-conda-linux-gnu/bin/ld: <artificial>:(.text.PyInit__simd+0x851): undefined reference to `simd_create_module_AVX512F'
/home/rgommers/mambaforge/envs/numpy-dev/bin/../lib/gcc/x86_64-conda-linux-gnu/12.3.0/../../../../x86_64-conda-linux-gnu/bin/ld: <artificial>:(.text.PyInit__simd+0x8a5): undefined reference to `simd_create_module_AVX512_SKX'
/home/rgommers/mambaforge/envs/numpy-dev/bin/../lib/gcc/x86_64-conda-linux-gnu/12.3.0/../../../../x86_64-conda-linux-gnu/bin/ld: numpy/_core/_simd.cpython-311-x86_64-linux-gnu.so: hidden symbol `simd_create_module_AVX512_SKX' isn't defined
/home/rgommers/mambaforge/envs/numpy-dev/bin/../lib/gcc/x86_64-conda-linux-gnu/12.3.0/../../../../x86_64-conda-linux-gnu/bin/ld: final link failed: bad value
collect2: error: ld returned 1 exit status
[593/610] Compiling C++ object numpy/fft/_pocket...311-x86_64-linux-gnu.so.p/_pocketfft_umath.cpp.o
ninja: build stopped: subcommand failed.

Compiler version:

$ $CC --version
x86_64-conda-linux-gnu-cc (conda-forge gcc 12.3.0-3) 12.3.0
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ $CC -v
Reading specs from /home/rgommers/mambaforge/envs/numpy-dev/bin/../lib/gcc/x86_64-conda-linux-gnu/12.3.0/specs
could not find specs file conda.specs
COLLECT_GCC=/home/rgommers/mambaforge/envs/numpy-dev/bin/x86_64-conda-linux-gnu-cc
COLLECT_LTO_WRAPPER=/home/rgommers/mambaforge/envs/numpy-dev/bin/../libexec/gcc/x86_64-conda-linux-gnu/12.3.0/lto-wrapper
Target: x86_64-conda-linux-gnu
Configured with: ../configure --prefix=/home/conda/feedstock_root/build_artifacts/gcc_compilers_1699751145211/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placeho --with-slibdir=/home/conda/feedstock_root/build_artifacts/gcc_compilers_1699751145211/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placeho/lib --libdir=/home/conda/feedstock_root/build_artifacts/gcc_compilers_1699751145211/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placeho/lib --mandir=/home/conda/feedstock_root/build_artifacts/gcc_compilers_1699751145211/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placeho/man --build=x86_64-conda-linux-gnu --host=x86_64-conda-linux-gnu --target=x86_64-conda-linux-gnu --enable-default-pie --enable-languages=c,c++,fortran,objc,obj-c++ --enable-__cxa_atexit --disable-libmudflap --enable-libgomp --disable-libssp --enable-libquadmath --enable-libquadmath-support --enable-libsanitizer --enable-lto --enable-threads=posix --enable-target-optspace --enable-plugin --enable-gold --disable-nls --disable-bootstrap --disable-multilib --enable-long-long --with-sysroot=/home/conda/feedstock_root/build_artifacts/gcc_compilers_1699751145211/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placeho/x86_64-conda-linux-gnu/sysroot --with-build-sysroot=/home/conda/feedstock_root/build_artifacts/gcc_compilers_1699751145211/_build_env/x86_64-conda-linux-gnu/sysroot --with-gxx-include-dir=/home/conda/feedstock_root/build_artifacts/gcc_compilers_1699751145211/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placeho/x86_64-conda-linux-gnu/include/c++/12.3.0 --with-pkgversion='conda-forge gcc 12.3.0-3' --with-bugurl=https://github.com/conda-forge/ctng-compilers-feedstock/issues/new/choose
Thread model: posix
Supported LTO compression algorithms: zlib
gcc version 12.3.0 (conda-forge gcc 12.3.0-3) 

The problem here seems to be that gcc-ar (or llvm-ar for Clang) has to be used. I can fix that by either setting the AR env var (export AR=$(which gcc-ar)) or by using a native file like:

[binaries]
ar = '/home/rgommers/mambaforge/envs/numpy-dev/bin/gcc-ar'

and passing that to meson setup, e.g. like:

$ spin build -- -Db_lto=true -Dbuildtype=release --native-file=native-file.txt

That's not automatic though (I haven't checked why that is the case, probably lots of history in issue trackers), so this PR is going to break things.

@rgommers
Copy link
Member

rgommers commented Jun 5, 2024

Also, a few -Wlto-type-mismatch issues flagged by GCC 13:

[595/610] Linking target numpy/linalg/_umath_linalg.cpython-311-x86_64-linux-gnu.so
../numpy/_core/include/numpy/npy_math.h:485:7: warning: type of 'npy_cabsf' does not match original declaration [-Wlto-type-mismatch]
  485 | float npy_cabsf(npy_cfloat z);
      |       ^
../numpy/_core/src/npymath/npy_math_complex.c.src:1758:1: note: type mismatch in parameter 1
 1758 | npy_@kind@@c@(@ctype@ z)
      | ^
../numpy/_core/src/npymath/npy_math_complex.c.src:1758:1: note: type 'npy_cfloat' should match type 'struct npy_cfloat'
../numpy/_core/include/numpy/npy_common.h:371:3: note: the incompatible type is defined here
  371 | } npy_cfloat;
      |   ^
../numpy/_core/src/npymath/npy_math_complex.c.src:1758:1: note: 'npy_cabsf' was previously declared here
 1758 | npy_@kind@@c@(@ctype@ z)
      | ^
../numpy/_core/include/numpy/npy_math.h:457:8: warning: type of 'npy_cabs' does not match original declaration [-Wlto-type-mismatch]
  457 | double npy_cabs(npy_cdouble z);
      |        ^
../numpy/_core/src/npymath/npy_math_complex.c.src:1758:1: note: type mismatch in parameter 1
 1758 | npy_@kind@@c@(@ctype@ z)
      | ^
../numpy/_core/src/npymath/npy_math_complex.c.src:1758:1: note: type 'npy_cdouble' should match type 'struct npy_cdouble'
../numpy/_core/include/numpy/npy_common.h:366:3: note: the incompatible type is defined here
  366 | } npy_cdouble;
      |   ^
../numpy/_core/src/npymath/npy_math_complex.c.src:1758:1: note: 'npy_cabs' was previously declared here
 1758 | npy_@kind@@c@(@ctype@ z)
      | ^
[610/610] Linking target numpy/_core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so
../numpy/_core/src/common/numpyos.h:54:1: warning: type of 'NumPyOS_ascii_tolower' does not match original declaration [-Wlto-type-mismatch]
   54 | NumPyOS_ascii_tolower(char c);
      | ^
../numpy/_core/src/common/numpyos.c:419:1: note: type mismatch in parameter 1
  419 | NumPyOS_ascii_tolower(int c)
      | ^
../numpy/_core/src/common/numpyos.c:419:1: note: type 'int' should match type 'char'
../numpy/_core/src/common/numpyos.c:419:1: note: 'NumPyOS_ascii_tolower' was previously declared here
../numpy/_core/src/multiarray/compiled_base.h:15:1: warning: type of 'arr_interp_complex' does not match original declaration [-Wlto-type-mismatch]
   15 | arr_interp_complex(PyObject *, PyObject *const *, Py_ssize_t, PyObject *, PyObject *);
      | ^
../numpy/_core/src/multiarray/compiled_base.c:668:1: note: type mismatch in parameter 5
  668 | arr_interp_complex(PyObject *NPY_UNUSED(self), PyObject *const *args, Py_ssize_t len_args,
      | ^
../numpy/_core/src/multiarray/compiled_base.c:668:1: note: 'arr_interp_complex' was previously declared here
../numpy/_core/src/multiarray/compiled_base.h:13:1: warning: type of 'arr_interp' does not match original declaration [-Wlto-type-mismatch]
   13 | arr_interp(PyObject *, PyObject *const *, Py_ssize_t, PyObject *, PyObject *);
      | ^
../numpy/_core/src/multiarray/compiled_base.c:498:1: note: type mismatch in parameter 5
  498 | arr_interp(PyObject *NPY_UNUSED(self), PyObject *const *args, Py_ssize_t len_args,
      | ^
../numpy/_core/src/multiarray/compiled_base.c:498:1: note: 'arr_interp' was previously declared here

@ngoldbaum
Copy link
Member Author

on what platform(s) did you test this? It doesn't seem to work for me with conda compilers on Linux x86-64

Darn, in that case I think I'd like to close this PR. If anyone would like to figure out how to get this working nicely under GCC please feel free to take this up again. I don't have time right now to figure out how to get this working under GCC. I was using apple's clang on an ARM Mac and didn't try under gcc before making the PR.

I'm going to keep this in my back pocket as a way to reduce noise in the benchmarks.

It also raised the question for me about whether we need to use LTO anywhere else.

I think work on enabling LTO and PGO in our wheel builds will also lead to improved performance in the binaries we distribute. Definitely not urgent though.

@ngoldbaum ngoldbaum closed this Jun 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants