-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
ENH: enable multi-platform SIMD compiler optimizations #13516
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
026a1b2
to
1c24bd0
Compare
2561efa
to
660f621
Compare
e3f7603
to
95ffca3
Compare
95ffca3
to
9e05278
Compare
531f6a0
to
582cb46
Compare
I am concerned all this will become a maintenance burden moving forward. Is there a way to use this as a library or a subrepo? |
What does this comment about library or subrepo mean? This sounds very hostile to NumPy supporting other architectures. |
@mattip this is quite clearly addressed in the discussion that spurred this PR: gh-13393. And we also discussed it with the Steering Council before saying yes to posting the bounty mentioned in gh-13393. So yes it's a concern, but if things are done the right way there's no reason that this should substantially increase the maintenance burden. Please review that issue and comment on code or ideas that are potentially problematic, no need to start a new abstract discussion.
@edelsohn that's certainly not our intent as a project. We'd really like to have complete SIMD support for PowerPC as well as other architectures, with the compiler doing most of the heavy lifting. |
That said, this is quite a bit of code. I believe a lot of it is well-tested though, taken over from OpenCV (correct me if I'm wrong @seiko2plus). Some more context in the issue description would be useful:
|
It is simply another way to organize the code. NumPy already has two submodules that provide tools to build the documentation, putting the code in a submodule allows the NumPy to use the code while letting development proceed at its own pace. Likewise, NumPy builds several libraries for the sorting and math code, probably not strictly necessary these days, but routines in the libraries are linked in when needed, and strictly speaking could be used by downstream projects. You have a number of new files dedicated to this work, so it might be nice to put them together in subdirecties, maybe organized by architecture. Note that the libraries have their own setup and include files and configure the system. So the question here is how should we organize the code, not whether we should have it. |
It seems to me, after a cursory inspection, that how you want to do the On the other hand, I recognize that the One thing to note is that we ought to be able to use this infrastructure to build another package that uses CPU features that the |
A maintenance burden already moving forward, I think what I'm doing is just trying to ease the burden.
I'm afraid not or at least for now , because the universal intrinsics (mapped intrinsics) are going to be heavily expanded depending on the needs and I think it should be part of NumPy's core, later we should forbid or at least reduce the direct use of raw SIMD intrinsics and only count on universal intrinsics for CPU optimizations.
Actually, most of the work here is mainly for the x86 due to the need to dealing with many SIMD extensions, compilers, and platforms which make it kinda complicated and I don't see VSX or even NEON create any obstacles here, maybe supporting BE mode for VSX will be a little bit challengable but I'm planning to implement it in a separate pull-request after I get done from re-implement the current optimized kernels using the new SIMD interface.
Yes, well I had to add some fundamental universal intrinsics just for testing propose but sure I will add a lot more depending on the needs.
I believe we shouldn't put our trust in any intrinsics even if it directly mapping to native instruction-set since
Like I mentioned before in #13393 the whole road map is inspired by OpenCV but not exactly the same design or code but I can tell the similarity is very high especially to infrastructure.
Of course, I will not be able to cover all areas in this pr but I think it's necessary to provide a decent example for illustrative and testing purposes and please feel free to add any extra tasks.
if it's necessary having a custom CI using "buildbot" could be an alternative solution too.
Do you think the current work isn't organized enough?
Although this work still not completed yet, I would truly appreciate it if you could provide a proposal or any kind of notes.
Sure it is a great idea and it seems like the community prefer this way (a separate package) but wait as you know every project have his own needs, that means you have to find compatibilities, flexible design which means more time, more issues and I think we should focus on the main purpose instead at least for now and have some fun by optimize some kernels :). |
no not necessary, and in fact very much undesired. we used to have those in the pre-TravisCI age and they're a maintenance nightmare. we'll just cover what we can with the hosted CI platforms we have, and the rest we test manually at PR develop/merge time and rely on community input to keep things working. that's the way it is now too for AIX, Sparc, etc. we'll just slowly expand coverage as we go |
c1c7d9e
to
8802a35
Compare
Add several C macros to `utils.h`
Add new C header located at `core/src/common/npy_cpu_dispatch.h` the new header works as a bridge linking the generated headers and macros by `CCompilerOpt` with NumPy runtime CPU features detection API through provides several C macros can be used in dispatching the generated objects from dispatch-able sources.
- Add new attributes to umath module * __cpu_baseline__ a list contains the minimal set of required optimizations that supported by the compiler and platform according to the specified values to command argument '--cpu-baseline'. * __cpu_dispatch__ a list contains the dispatched set of additional optimizations that supported by the compiler and platform according to the specified values to command argument '--cpu-dispatch' - Print required and additional optimizations during the run of PytestTester
Add testing unit for the utilites of CPU dispatcher
ccfad27
to
93fee4e
Compare
93fee4e
to
e726538
Compare
Thanks @seiko2plus. Step 1 is done! |
I started to love the purple color, thank you! |
@seiko2plus This PR raises compiler warnings with fedora gcc 10 when AVX is missing.
EDIT: Also for other avx checks. Seems the checks should not be compiled without AVX. |
@charris, class
If both tests failed, then the feature will be skipped, but since we're dealing with GCC-10 both tests should be passed. However, I tested (GCC version 10.1.0) and it looks fine to me. |
@seiko2plus I don't see that warning anymore, may have changed with a recent kernel update. Let me try with an older kernel. |
No, they are still there, but tests run fine. I think the errors are configuration output that |
Could you put the complete build log up somewhere like https://paste.ubuntu.com/ ? |
@mattip Here: https://paste.ubuntu.com/p/VH7x8qJkVS/ . Search for |
@charris, umm well it seems '-march=native' is provided through environment variable CFLAGS or it is part of distutils config vars. see the final build log: CPU baseline :
Requested : 'native'
Enabled : SSE SSE2 SSE3 SSSE3 SSE41 POPCNT SSE42 AVX F16C FMA3 AVX2
Flags : -msse -msse2 -msse3 -mssse3 -msse4.1 -mpopcnt -msse4.2 -mavx -mf16c -mfma -mavx2
CPU dispatch :
Requested : 'max -xop -fma4'
Enabled : AVX512F AVX512CD AVX512_KNL AVX512_KNM AVX512_SKX AVX512_CLX AVX512_CNL AVX512_ICL
Generated : none
CCompilerOpt._cache_write[786] : write cache to path -> /home/charris/Workspace/numpy.git/build/temp.linux-x86_64-3.9/ccompiler_opt_cache_ext.py
########### CLIB COMPILER OPTIMIZATION ###########
CPU baseline :
Requested : 'native'
Enabled : SSE SSE2 SSE3 SSSE3 SSE41 POPCNT SSE42 AVX F16C FMA3 AVX2
Flags : -msse -msse2 -msse3 -mssse3 -msse4.1 -mpopcnt -msse4.2 -mavx -mf16c -mfma -mavx2
CPU dispatch :
Requested : 'max -xop -fma4'
Enabled : AVX512F AVX512CD AVX512_KNL AVX512_KNM AVX512_SKX AVX512_CLX AVX512_CNL AVX512_ICL
Generated : none
CCompilerOpt._cache_write[786] : write cache to path -> /home/charris/Workspace/numpy.git/build/temp.linux-x86_64-3.9/ccompiler_opt_cache_clib.py The failure messages are normal in this case since |
This is part of "build_clib" (line 627). The configuration work is done during "build_src" (line 135). Maybe we should do the same tricks during CCompilerOpt to ignore warnings that we do during build_src. |
@mattip, why we should suppress the error messages? |
Yes, It should be showing if
Yes, when the compiler doesn't support certain feature. |
").*" | ||
) | ||
@staticmethod | ||
def _dist_test_spawn(cmd, display=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure whether we should suppress the warnings when someone specifies |
This pullrequest changes
New Distutils class
CCompilerOpt
Used for handling the CPU/hardware optimization, starting from parsing the command arguments, to managing the relation between the CPU baseline and dispatch-able features, also generating
the required C headers and ending with compiling the sources with proper compiler's flags.
Class
CCompilerOpt
doesn't provide runtime detection for the CPU features,instead only focuses on the compiler side, but it creates abstract C headers
that can be used later for the final runtime dispatching process.
New Build Arguments
--cpu-baseline
minimal set of required optimizations, default value ismin
which provides the minimum CPU features that can safely run on a wide range of users platforms.--cpu-dispatch
dispatched set of additional optimizations, default value ismax -xop -fma4
which enables all CPU features, except for AMD legacy features.The new arguments can be reached through
build
,build_clib
,build_ext
.if
build_clib
orbuild_ext
are not specified by the user, the arguments ofbuild
will be used instead, which also hold the default values.Both
--cpu-baseline
and--cpu-dispatch
are accepting the following CPU features :X86
:SSE
,SSE2
,SSE3
,SSSE3
,SSE41
,POPCNT
,SSE42
,AVX
,F16C
,XOP
,FMA4
,FMA3
,AVX2
,AVX512F
,AVX512CD
,AVX512_KNL
,AVX512_KNM
,AVX512_SKX
,AVX512_CLX
,AVX512_CNL
,AVX512_ICL
IBM/Power64
:VSX
,VSX2
,VSX3
ARM7/8
:NEON
,NEON_FP16
,NEON_VFPV4
,ASIMD
,ASIMDHP
,ASIMDDP
,ASIMDFHM
Other acceptable options:
MIN: Enables the minimum CPU features that can safely run on a wide range of users platforms :
X86
:SSE
,SSE2
X86_64
:SSE
,SSE2
,SSE3
IBM/Power64(big-endian)
: nothingIBM/Power64(little-endian)
:VSX
,VSX2
ARM7
: nothingARM8(64-BIT)
:NEON
,NEON_FP16
,NEON_VFPV4
,ASIMD
MAX: Enables all supported CPU features by the Compiler and platform.
NATIVE: Enables all CPU features that supported by the current machine, this operation is based on the compiler flags (
-march=native, -xHost, /QxHost
)NONE: enables nothing
Operand +/-: remove or add features, useful with options
MAX
,MIN
andNATIVE
.NOTE: operand
+
is only added for nominal reason, For example :--cpu-basline
="min avx2" equivalent to--cpu-basline
="min + avx2"--cpu-basline
="min,avx2" equivalent to--cpu-basline
="min,+avx2"NOTES:
--cpu-dispatch
= "avx2 avx512f" or--cpu-dispatch
= "avx2, avx512f" both applicable.raising a fatal error.
--cpu-dispatch
will be skipped if its part of CPU baseline features--cpu-baseline
force enables implied features,e.g.
--cpu-baseline
="sse42" equivalent to--cpu-baseline
="sse sse2 sse3 ssse3 sse41 popcnt sse42"--cpu-baseline
will be treated as "native" if compiler native flag-march=native
or-xHost
orQxHost
is enabled though environment variableCFLAGS
Explicitly disable the new infrastructure
Add new command argument
--disable-optimization
to explicitly disable the whole new infrastructure, It also adds a new compiler definition calledNPY_DISABLE_OPTIMIZATION
when
--disable-optimization
is enabled the dispatch-able sources(explained in this comment).dispatch.c
will be treated as a normal C source, also due to this disabling any C headers that generated byCCompilerOpt
must guard it withNPY_DISABLE_OPTIMIZATION
, otherwise, it will definitely break the build.New auto-generated C header(
core/src/common/_cpu_dispatch.h
)The new header contains all the definitions and headers of instruction-sets for the CPU baseline and dispatch-able features that have enabled through command arguments
--cpu-baseline
and--cpu-dispatch
.NPY_HAVE_
is the suffix of CPU features definitions, e.g.NPY_HAVE_SSE2
,NPY_HAVE_VSX
,NPY_HAVE_NEON
.The new header can be included inside normal C files or dispatch-able C sources(explained below),
however, if the new header is included inside normal C files, then it will only provides the definitions and headers of instruction-sets for the CPU baseline features, For example:
NOTE: It should not be used directly but through another new header called
npy_cpu_dispatch.h
New C header(
core/src/common/npy_cpu_dispatch.h
)This header contains all utilities that required for the whole CPU dispatching process, it also can be considered as a bridge linking the new infrastructure work with NumPy CPU runtime detection(#13421).
New CPU dispatcher solution(dispatch-able sources)
Explained in this comment
Add new attributes to umath module
__cpu_baseline__
a list of CPU baseline feature names that configured by--cpu-baseline
__cpu_dispatch__
a list of CPU dispatch feature names that configured by--cpu-dispatch
The new attributes contain the final enabled CPU features that supported by the platform.
Print the supported CPU features during the run of PytestTester
This could be one of the easiest ways to trace the enabled features during the run of unit tests.
An example on GCC 8.3.0(64-bit) - i7-8550U:
The format explained as follows :
Supported dispatched features by the running machine end with
*
Dispatched features that are not supported by the running machine end with
?
Remained features are representing the baseline.
If there any missing features, that because:
--cpu-baseline
and--cpu-dispatch
.