Skip to content

ENH: enable multi-platform SIMD compiler optimizations #13516

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jun 17, 2020

Conversation

seiko2plus
Copy link
Member

@seiko2plus seiko2plus commented May 9, 2019

This pullrequest changes


New Distutils class CCompilerOpt

Used for handling the CPU/hardware optimization, starting from parsing the command arguments, to managing the relation between the CPU baseline and dispatch-able features, also generating
the required C headers and ending with compiling the sources with proper compiler's flags.

Class CCompilerOpt doesn't provide runtime detection for the CPU features,
instead only focuses on the compiler side, but it creates abstract C headers
that can be used later for the final runtime dispatching process.


New Build Arguments

  • --cpu-baseline minimal set of required optimizations, default value is min which provides the minimum CPU features that can safely run on a wide range of users platforms.
  • --cpu-dispatch dispatched set of additional optimizations, default value is max -xop -fma4 which enables all CPU features, except for AMD legacy features.

The new arguments can be reached through build, build_clib , build_ext.
if build_clib or build_ext are not specified by the user, the arguments of build will be used instead, which also hold the default values.

Both --cpu-baseline and --cpu-dispatch are accepting the following CPU features :

X86: SSE, SSE2, SSE3, SSSE3, SSE41, POPCNT, SSE42, AVX, F16C, XOP, FMA4, FMA3, AVX2, AVX512F, AVX512CD, AVX512_KNL, AVX512_KNM, AVX512_SKX, AVX512_CLX, AVX512_CNL, AVX512_ICL

IBM/Power64: VSX, VSX2, VSX3

ARM7/8 : NEON, NEON_FP16, NEON_VFPV4, ASIMD, ASIMDHP, ASIMDDP, ASIMDFHM

Other acceptable options:

  • MIN: Enables the minimum CPU features that can safely run on a wide range of users platforms :
    X86: SSE, SSE2
    X86_64: SSE, SSE2, SSE3
    IBM/Power64(big-endian): nothing
    IBM/Power64(little-endian): VSX, VSX2
    ARM7: nothing
    ARM8(64-BIT): NEON, NEON_FP16, NEON_VFPV4, ASIMD

  • MAX: Enables all supported CPU features by the Compiler and platform.

  • NATIVE: Enables all CPU features that supported by the current machine, this operation is based on the compiler flags (-march=native, -xHost, /QxHost)

  • NONE: enables nothing

  • Operand +/-: remove or add features, useful with options MAX , MIN and NATIVE.
    NOTE: operand + is only added for nominal reason, For example :

    • --cpu-basline="min avx2" equivalent to --cpu-basline="min + avx2"
    • --cpu-basline="min,avx2" equivalent to --cpu-basline="min,+avx2"

NOTES:

  • Case-insensitive among all CPU features and other options.
  • Comma or space can be used as a separator, e.g.
    --cpu-dispatch= "avx2 avx512f" or --cpu-dispatch= "avx2, avx512f" both applicable.
  • If the CPU feature is not supported by the user platform or compiler, it will be skipped rather than
    raising a fatal error.
  • Any specified CPU features to --cpu-dispatch will be skipped if its part of CPU baseline features
  • Argument --cpu-baseline force enables implied features,
    e.g. --cpu-baseline="sse42" equivalent to --cpu-baseline="sse sse2 sse3 ssse3 sse41 popcnt sse42"
  • The value of --cpu-baseline will be treated as "native" if compiler native flag -march=native or -xHost or QxHost is enabled though environment variable CFLAGS
  • The user should always check the final report through the build log to verify the enabled features.

Explicitly disable the new infrastructure

Add new command argument --disable-optimization to explicitly disable the whole new infrastructure, It also adds a new compiler definition called NPY_DISABLE_OPTIMIZATION

when --disable-optimization is enabled the dispatch-able sources(explained in this comment).dispatch.c will be treated as a normal C source, also due to this disabling any C headers that generated by CCompilerOpt must guard it with NPY_DISABLE_OPTIMIZATION, otherwise, it will definitely break the build.


New auto-generated C header(core/src/common/_cpu_dispatch.h)

The new header contains all the definitions and headers of instruction-sets for the CPU baseline and dispatch-able features that have enabled through command arguments --cpu-baseline and --cpu-dispatch.

NPY_HAVE_ is the suffix of CPU features definitions, e.g. NPY_HAVE_SSE2, NPY_HAVE_VSX, NPY_HAVE_NEON.
The new header can be included inside normal C files or dispatch-able C sources(explained below),
however, if the new header is included inside normal C files, then it will only provides the definitions and headers of instruction-sets for the CPU baseline features, For example:

#include "_cpu_dispatch.h"
#ifdef NPY_HAVE_AVX2
// this branch will only be enabled if AVX2 is enabled through --cpu-baseline="avx2"
#endif

NOTE: It should not be used directly but through another new header called npy_cpu_dispatch.h


New C header(core/src/common/npy_cpu_dispatch.h)

This header contains all utilities that required for the whole CPU dispatching process, it also can be considered as a bridge linking the new infrastructure work with NumPy CPU runtime detection(#13421).


New CPU dispatcher solution(dispatch-able sources)

Explained in this comment


Add new attributes to umath module

  • __cpu_baseline__ a list of CPU baseline feature names that configured by --cpu-baseline

  • __cpu_dispatch__ a list of CPU dispatch feature names that configured by --cpu-dispatch

The new attributes contain the final enabled CPU features that supported by the platform.


Print the supported CPU features during the run of PytestTester

This could be one of the easiest ways to trace the enabled features during the run of unit tests.

An example on GCC 8.3.0(64-bit) - i7-8550U:

NumPy CPU features: SSE SSE2 SSE3 SSSE3* SSE41* POPCNT* SSE42* AVX* F16C* FMA3* AVX2* AVX512F? AVX512CD? AVX512_KNL? AVX512_KNM? AVX512_SKX? AVX512_CLX? AVX512_CNL? AVX512_ICL?

The format explained as follows :

  • Supported dispatched features by the running machine end with *

  • Dispatched features that are not supported by the running machine end with ?

  • Remained features are representing the baseline.

  • If there any missing features, that because:

    • not supported by the compiler or platform
    • not configured by command arguments --cpu-baseline and --cpu-dispatch.

@seiko2plus seiko2plus force-pushed the core_improve_infa_build branch 2 times, most recently from 026a1b2 to 1c24bd0 Compare May 11, 2019 11:12
@seiko2plus seiko2plus force-pushed the core_improve_infa_build branch 5 times, most recently from 2561efa to 660f621 Compare May 18, 2019 05:41
@seiko2plus seiko2plus force-pushed the core_improve_infa_build branch 2 times, most recently from e3f7603 to 95ffca3 Compare May 26, 2019 23:58
@seiko2plus seiko2plus force-pushed the core_improve_infa_build branch from 95ffca3 to 9e05278 Compare June 4, 2019 17:57
@seiko2plus seiko2plus force-pushed the core_improve_infa_build branch 6 times, most recently from 531f6a0 to 582cb46 Compare July 1, 2019 23:43
@mattip
Copy link
Member

mattip commented Jul 3, 2019

I am concerned all this will become a maintenance burden moving forward. Is there a way to use this as a library or a subrepo?

@edelsohn
Copy link

edelsohn commented Jul 3, 2019

What does this comment about library or subrepo mean? This sounds very hostile to NumPy supporting other architectures.

@rgommers
Copy link
Member

rgommers commented Jul 3, 2019

I am concerned all this will become a maintenance burden moving forward. Is there a way to use this as a library or a subrepo?

@mattip this is quite clearly addressed in the discussion that spurred this PR: gh-13393. And we also discussed it with the Steering Council before saying yes to posting the bounty mentioned in gh-13393. So yes it's a concern, but if things are done the right way there's no reason that this should substantially increase the maintenance burden. Please review that issue and comment on code or ideas that are potentially problematic, no need to start a new abstract discussion.

This sounds very hostile to NumPy supporting other architectures.

@edelsohn that's certainly not our intent as a project. We'd really like to have complete SIMD support for PowerPC as well as other architectures, with the compiler doing most of the heavy lifting.

@rgommers
Copy link
Member

rgommers commented Jul 3, 2019

That said, this is quite a bit of code. I believe a lot of it is well-tested though, taken over from OpenCV (correct me if I'm wrong @seiko2plus). Some more context in the issue description would be useful:

  • there's a TODO for an example
  • this contains both build-time and run-time features. does that change the situation we have today? IIRC we wanted to move more to run-time, is that achieved?
  • what parts are and aren't covered by CI (we have PPC on TravisCI, were thinking about ARM via drone.io I believe, and conda-forge has ARM too)?

@charris
Copy link
Member

charris commented Jul 4, 2019

What does this comment about library or subrepo mean?

It is simply another way to organize the code. NumPy already has two submodules that provide tools to build the documentation, putting the code in a submodule allows the NumPy to use the code while letting development proceed at its own pace. Likewise, NumPy builds several libraries for the sorting and math code, probably not strictly necessary these days, but routines in the libraries are linked in when needed, and strictly speaking could be used by downstream projects. You have a number of new files dedicated to this work, so it might be nice to put them together in subdirecties, maybe organized by architecture. Note that the libraries have their own setup and include files and configure the system. So the question here is how should we organize the code, not whether we should have it.

@rkern
Copy link
Member

rkern commented Jul 4, 2019

It seems to me, after a cursory inspection, that how you want to do the distutils integration might be dispositive. It would be nice to package this up into a separate library with its own distutils integration tooling so that I could write other packages that use this code without needing to use numpy.distutils. numpy.distutils works well for numpy and scipy, but we try not to make it necessary for packages that just want to build against numpy's functionality.

On the other hand, I recognize that the distutils integration required is fairly significant, and welding together two significant distutils augmentations is often fraught, so I can see an argument for avoiding it and just making this one more feature that numpy.distutils alone provides, like f2py. If that's the case, then numpy/numpy is the right place. There isn't much else that would be profitable to break out into a separate package.

One thing to note is that we ought to be able to use this infrastructure to build another package that uses CPU features that the numpy binary itself was not compiled to use. I don't see why that would not be the case in this PR, but I just wanted it stated explicitly. Organizing this as a separate library may help enforce that when future modifications are made.

@seiko2plus
Copy link
Member Author

@mattip,

I am concerned all this will become a maintenance burden moving forward.

A maintenance burden already moving forward, I think what I'm doing is just trying to ease the burden.

Is there a way to use this as a library or a subrepo?

I'm afraid not or at least for now , because the universal intrinsics (mapped intrinsics) are going to be heavily expanded depending on the needs and I think it should be part of NumPy's core, later we should forbid or at least reduce the direct use of raw SIMD intrinsics and only count on universal intrinsics for CPU optimizations.


@edelsohn,

This sounds very hostile to NumPy supporting other architectures.

Actually, most of the work here is mainly for the x86 due to the need to dealing with many SIMD extensions, compilers, and platforms which make it kinda complicated and I don't see VSX or even NEON create any obstacles here, maybe supporting BE mode for VSX will be a little bit challengable but I'm planning to implement it in a separate pull-request after I get done from re-implement the current optimized kernels using the new SIMD interface.


@rommers,

That said, this is quite a bit of code.

Yes, well I had to add some fundamental universal intrinsics just for testing propose but sure I will add a lot more depending on the needs.

I believe a lot of it is well-tested though

I believe we shouldn't put our trust in any intrinsics even if it directly mapping to native instruction-set since
compilers aren't developed by angels, that may be why we should try to cover all the used intrinsics within a unit testing.

taken over from OpenCV (correct me if I'm wrong @seiko2plus)

Like I mentioned before in #13393 the whole road map is inspired by OpenCV but not exactly the same design or code but I can tell the similarity is very high especially to infrastructure.

there's a TODO for an example

Of course, I will not be able to cover all areas in this pr but I think it's necessary to provide a decent example for illustrative and testing purposes and please feel free to add any extra tasks.

what parts are and aren't covered by CI (we have PPC on TravisCI, were thinking about ARM via drone.io I believe, and conda-forge has ARM too)?

if it's necessary having a custom CI using "buildbot" could be an alternative solution too.


@charris,

You have a number of new files dedicated to this work, so it might be nice to put them together in subdirecties, maybe organized by architecture. Note that the libraries have their own setup and include files and configure the system.

Do you think the current work isn't organized enough?

So the question here is how should we organize the code, not whether we should have it

Although this work still not completed yet, I would truly appreciate it if you could provide a proposal or any kind of notes.


@rkern,

it seems to me, after a cursory inspection, that how you want to do the distutils integration might be dispositive. It would be nice to package this up into a separate library with its own distutils integration tooling so that I could write other packages that use this code without needing to use numpy.distutils. numpy.distutils works well for numpy and scipy, but we try not to make it necessary for packages that just want to build against numpy's functionality.

On the other hand, I recognize that the distutils integration required is fairly significant, and welding together two significant distutils augmentations is often fraught, so I can see an argument for avoiding it and just making this one more feature that numpy.distutils alone provides, like f2py. If that's the case, then numpy/numpy is the right place. There isn't much else that would be profitable to break out into a separate package.

One thing to note is that we ought to be able to use this infrastructure to build another package that uses CPU features that the numpy binary itself was not compiled to use. I don't see why that would not be the case in this PR, but I just wanted it stated explicitly. Organizing this as a separate library may help enforce that when future modifications are made.

Sure it is a great idea and it seems like the community prefer this way (a separate package) but wait as you know every project have his own needs, that means you have to find compatibilities, flexible design which means more time, more issues and I think we should focus on the main purpose instead at least for now and have some fun by optimize some kernels :).

@rgommers
Copy link
Member

rgommers commented Jul 4, 2019

if it's necessary having a custom CI using "buildbot" could be an alternative solution too.

no not necessary, and in fact very much undesired. we used to have those in the pre-TravisCI age and they're a maintenance nightmare. we'll just cover what we can with the hosted CI platforms we have, and the rest we test manually at PR develop/merge time and rely on community input to keep things working. that's the way it is now too for AIX, Sparc, etc. we'll just slowly expand coverage as we go

@seiko2plus seiko2plus force-pushed the core_improve_infa_build branch 3 times, most recently from c1c7d9e to 8802a35 Compare August 22, 2019 13:40
  Add new C header located at `core/src/common/npy_cpu_dispatch.h`

    the new header works as a bridge linking the generated
    headers and macros by `CCompilerOpt` with NumPy runtime
    CPU features detection API through provides several
    C macros can be used in dispatching the generated objects
    from dispatch-able sources.
  - Add new attributes to umath module

    * __cpu_baseline__ a list contains the minimal set of required optimizations
      that supported by the compiler and platform according to the specified
      values to command argument '--cpu-baseline'.

    * __cpu_dispatch__ a list contains the dispatched set of additional optimizations
      that supported by the compiler and platform according to the specified
      values to command argument '--cpu-dispatch'

  - Print required and additional optimizations during the run of PytestTester
   Add testing unit for the utilites of CPU dispatcher
@seiko2plus seiko2plus force-pushed the core_improve_infa_build branch from ccfad27 to 93fee4e Compare June 16, 2020 16:42
@seiko2plus seiko2plus force-pushed the core_improve_infa_build branch from 93fee4e to e726538 Compare June 16, 2020 16:44
@mattip mattip merged commit 8245b39 into numpy:master Jun 17, 2020
@mattip
Copy link
Member

mattip commented Jun 17, 2020

Thanks @seiko2plus. Step 1 is done!

@seiko2plus
Copy link
Member Author

I started to love the purple color, thank you!

@charris
Copy link
Member

charris commented Jul 4, 2020

@seiko2plus This PR raises compiler warnings with fedora gcc 10 when AVX is missing.

/home/charris/Workspace/numpy.git/numpy/distutils/checks/cpu_avx512f.c:5:13: error: AVX512F vector return without AVX512F enabled changes the ABI [-Werror=psabi]
    5 |     __m512i a = _mm512_abs_epi32(_mm512_setzero_si512());
      |             ^
In file included from /usr/lib/gcc/x86_64-redhat-linux/10/include/immintrin.h:55,
                 from /home/charris/Workspace/numpy.git/numpy/distutils/checks/cpu_avx512f.c:1:
/usr/lib/gcc/x86_64-redhat-linux/10/include/avx512fintrin.h:15594:1: error: inlining failed in call to ‘always_inline’ ‘_mm512_castsi512_si128’: target specific option mismatch
15594 | _mm512_castsi512_si128 (__m512i __A)
      | ^~~~~~~~~~~~~~~~~~~~~~

EDIT: Also for other avx checks. Seems the checks should not be compiled without AVX.

@seiko2plus
Copy link
Member Author

@charris, class CCompilerOpt is doing two tests for each required feature to determine the compiler support:

  • check the flags without any involved intrinsics, to determine the availability of flags.
  • check the test file with or without the flags depends on the availability of flags(friendly with msvc).

If both tests failed, then the feature will be skipped, but since we're dealing with GCC-10 both tests should be passed.
And according to the provided error, CCompilerOpt was testing the check file cpu_avx512f.c without providing the avx512f flag -mavx512f which means, the flags test was failed.

However, I tested (GCC version 10.1.0) and it looks fine to me.
please could you provide the build log or provide at least the failure part that related to testing flag -mavx512f?

@charris
Copy link
Member

charris commented Jul 8, 2020

@seiko2plus I don't see that warning anymore, may have changed with a recent kernel update. Let me try with an older kernel.

@charris
Copy link
Member

charris commented Jul 8, 2020

No, they are still there, but tests run fine. I think the errors are configuration output that runtests.py should not be showing, so this is probably some sort of build organization problem. Are those the sort of errors you would expect during configuration? @mattip IIRC, you were the one who suppressed the configuration output, thoughts?

@mattip
Copy link
Member

mattip commented Jul 8, 2020

Could you put the complete build log up somewhere like https://paste.ubuntu.com/ ?

@charris
Copy link
Member

charris commented Jul 8, 2020

@mattip Here: https://paste.ubuntu.com/p/VH7x8qJkVS/ . Search for error:.

@seiko2plus
Copy link
Member Author

seiko2plus commented Jul 8, 2020

@charris, umm well it seems '-march=native' is provided through environment variable CFLAGS or it is part of distutils config vars. see the final build log:

CPU baseline  : 
  Requested   : 'native'
  Enabled     : SSE SSE2 SSE3 SSSE3 SSE41 POPCNT SSE42 AVX F16C FMA3 AVX2
  Flags       : -msse -msse2 -msse3 -mssse3 -msse4.1 -mpopcnt -msse4.2 -mavx -mf16c -mfma -mavx2

CPU dispatch  : 
  Requested   : 'max -xop -fma4'
  Enabled     : AVX512F AVX512CD AVX512_KNL AVX512_KNM AVX512_SKX AVX512_CLX AVX512_CNL AVX512_ICL
  Generated   : none
CCompilerOpt._cache_write[786] : write cache to path -> /home/charris/Workspace/numpy.git/build/temp.linux-x86_64-3.9/ccompiler_opt_cache_ext.py

########### CLIB COMPILER OPTIMIZATION ###########
CPU baseline  : 
  Requested   : 'native'
  Enabled     : SSE SSE2 SSE3 SSSE3 SSE41 POPCNT SSE42 AVX F16C FMA3 AVX2
  Flags       : -msse -msse2 -msse3 -mssse3 -msse4.1 -mpopcnt -msse4.2 -mavx -mf16c -mfma -mavx2

CPU dispatch  : 
  Requested   : 'max -xop -fma4'
  Enabled     : AVX512F AVX512CD AVX512_KNL AVX512_KNM AVX512_SKX AVX512_CLX AVX512_CNL AVX512_ICL
  Generated   : none
CCompilerOpt._cache_write[786] : write cache to path -> /home/charris/Workspace/numpy.git/build/temp.linux-x86_64-3.9/ccompiler_opt_cache_clib.py

The failure messages are normal in this case since CCompilerOpt was testing all features with the native flag to determine
the supported features by the running machine during the build time.

@mattip
Copy link
Member

mattip commented Jul 8, 2020

This is part of "build_clib" (line 627). The configuration work is done during "build_src" (line 135). Maybe we should do the same tricks during CCompilerOpt to ignore warnings that we do during build_src.

@seiko2plus
Copy link
Member Author

@mattip, why we should suppress the error messages?

@seiko2plus
Copy link
Member Author

@charris,

I think the errors are configuration output that runtests.py should not be showing, so this is probably some sort of build organization problem.

Yes, It should be showing if --show-build-log is enabled.

Are those the sort of errors you would expect during configuration?

Yes, when the compiler doesn't support certain feature.

").*"
)
@staticmethod
def _dist_test_spawn(cmd, display=None):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattip, @charris, we can suppress errors from here. if it's necessary I can add support for verbose levels.

@mattip
Copy link
Member

mattip commented Jul 10, 2020

I am not sure whether we should suppress the warnings when someone specifies -mnative or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
01 - Enhancement component: numpy.distutils component: SIMD Issues in SIMD (fast instruction sets) code or machinery
Projects
None yet
Development

Successfully merging this pull request may close these issues.