Skip to content

Fixes to mean and better precision on sum (cpu) #3687

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

willyborn
Copy link
Contributor

When strided weights are used, the mean function typically returned random results.
The produced result of mean is no longer dependent on platform or processing style (linear, strided, on CPU or GPU)

Description

b74f28f sum on CPU now has the same precision as other platforms

  • Introduction of Kahan summation algorithm for series. This algorithm reduced the error when a series of summations are performed with values of varying sizes.
    With this introduction, the difference in result of a linear array and strided array is minimal. The precision in the tests are as a result improved.
    As result, the precision of many tests can be improved now since all platforms generate similar results, independent from parallel or serial processing.

9307d6f Fixes to mean, on all platforms
All:

  • Introduction of Kahan summation algorithm for series. This algorithm reduced the error when a series of summations are performed with values of varying sizes.
    With this introduction, the difference in result of a linear array and strided array is minimal. The precision in the tests are as a result improved.
  • Negative weights can no longer generate a division by zero.
  • In most cases sub-arrays would generated random results with strided weights.

CPU:

  • the idxs of the weights array is now calculated according to its own strides iso the strides of the input array.

CUDA:

  • Param were used without full initialisation. Because the construction of Param’s are deprecated, they are replaced by the createEmptyArray() constructions.
  • Idxs of weights are now calculated according to its own strides. Idem for the offsets.
  • Small arrays are processed on the CPU. Strides arrays are now processed on the GPU, independent from size since the CPU version only handles linear arrays.

ONEAPI:

  • Idxs of weights are now calculated according to its own strides. Idem for offsets.
  • Param were used without full initialisation. Because the construction of Param’s are deprecated, they are replaced by the createEmptyArray() constructions.
  • Small arrays are processed on the CPU. Strides arrays are now processed on the GPU, independent from size since the CPU version only handles linear arrays.
  • Offsets are handle for the linear processing.

OPENCL:

  • Idxs of weights are now calculated according to its own strides. Idem for offsets.
  • When converting a linear array to a vector, the weights offsets were overwritten.
  • The size of the tmpOut and tmpWeight arrays are now correctly sized. Intermediate results were overwritten resulting in a mean of a part of the array.
  • Small arrays are processed on the CPU. Strides arrays are now processed on the GPU, independent from size since the CPU version only handles linear arrays.
  • Param were used without full initialisation. Because the construction of Param’s are deprecated, they are replaced by the createEmptyArray() constructions.

TEST:
- extra tests added on all temporary formats.
- allowed fault tolerance for tests is lowered, since now the correct mean is calculated for all backends.
- MeanAllTest now uses random data iso constants for testing. This blocked the detection of partial processing of the input.

Additional information about the PR answering following questions:

  • Is this a new feature or a bug fix? BUG
  • More detail if necessary to describe all commits in pull request.
  • Why these changes are necessary. CORRECT RESULTS WERE NOT GUARANTEED
  • Potential impact on specific hardware, software or backends. NONE
  • New functions and their functionality. NONE
  • Can this PR be backported to older versions? NO
  • Future changes not implemented in this PR. NO
    Fixes:

Changes to Users

Better precision in all circumstances, and independent form platform or array format.

Checklist

  • Rebased on latest master
  • Code compiles
  • Tests pass
  • Functions added to unified API
  • Functions documented

@christophe-murphy
Copy link
Contributor

Thanks for this willyborn. FYI, our Buildbot CI system failed to build the PR with the following error:

/var/lib/buildbot/workers/ubuntu2204-cuda-toolkit-12-9/linux-all/build/src/backend/cuda/kernel/mean.hpp(65): error: more than one operator "*" matches these operands: built-in operator "arithmetic * arithmetic" function "operator*(const __half &, const __half &)" (declared at line 223 of /usr/local/cuda/include/cuda_fp16.hpp) operand types are: __half * float To y = (rhs - (*lhs)) * r_wt / (*l_wt) - (*c);

@willyborn
Copy link
Contributor Author

I'm on ubuntu 22.04LTS with GCC, and I can not reproduce any warning/error you mentioned.
As result, I encapsulated every element inside the stable_mean so that misunderstandings by the compiler now are eliminated. I hope this solves the error.

PS: Just to be certain I performed the same change for all platforms.

@christophe-murphy
Copy link
Contributor

This is happening when NVCC is compiling the stable_mean CUDA device kernel. Which version of CUDA are you using? Could you try with 12.9

@willyborn
Copy link
Contributor Author

willyborn commented Aug 8, 2025

I am using CUDA 12.9.86 on ubuntu 24.04.3 LTS kernel 6.14.
I have a GTX1080 which is on architecture 6.1 (stores halves, calculates in float).

As soon as I included all major architectures to the nvcc options, I got the error also, which is now also fixed and committed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants