-
Notifications
You must be signed in to change notification settings - Fork 14.9k
Description
The shader ElementWise_Identity_256_Strided4D_uint16_native
(Shader ID: 7633) is causing 563 single-shader DML Operator Test failures.
A "single-shader DML Operator Test" is a DML Operator Test that uses a single shader compiled by clang-dxc, but may use any number of fxc-compiled shaders. Therefore, ElementWise_Identity_256_Strided4D_uint16_native
is certainly the sole cause of these test failures.
Furthermore, there are a total of 10800 failing DML Operator Tests using the ElementWise_Identity_256_Strided4D_uint16_native
shader.
Test Group 1 (561 tests)
Some failing single-shader DML Operator Tests:
- OperatorTests::ElementWise_Identity#31
- OperatorTests::ElementWise_Identity_Transpose#metadataSet0#29
- OperatorTests::ElementWise_LogicalEquals#172
- OperatorTests::ElementWise_Pow#37
- OperatorTests::ElementWise_IsNan#2
- OperatorTests::ElementWise_If#29
- OperatorTests::ElementWise_IsInfinity#2
- OperatorTests::ElementWise_BitCount#100
- OperatorTests::Slice#35
- OperatorTests::SliceSimple#24
- OperatorTests::Cast#327
- OperatorTests::Join#17
- OperatorTests::Split#17
- OperatorTests::ConvolutionDepthwise#188
- OperatorTests::ConvolutionBasicGemm#186
- OperatorTests::FillValueSequence#4
- OperatorTests::MatrixMultiplyIntegerToFloatDefault#25
- OperatorTests::LayoutTransformedConvolutionDefault#metadataSet0#17
Test results on machines:
- clang-dml01 (AMD): Fail
- clang-dml02 (NVIDIA): Fail
- clang-dml03 (Intel): Fail
- local (WARP): Fail
Reproduction:
❯ ./TE.exe DirectML.Test.OperatorTests.dll /logOutput:low /p:DisableMetacommands=1 /name:"OperatorTests::LayoutTransformedConvolutionDefault#metadataSet0#17"
Test Authoring and Execution Framework v10.72 for x64
StartGroup: OperatorTests::LayoutTransformedConvolutionDefault#metadataSet0#17
Error: Output Tensor #0:
Error: Tensor Sizes: 4,1,35
Error: Tensor Data Type: float16
Error: Index: 0001 @00000010 [0,0,1]. Ref: 0.4689941406 (0x3781). DML: -nan (0xFFFF). Abs: nan. Rel: nan%. Ulp: 51326
Error: Index: 0002 @00000020 [0,0,2]. Ref: 0.4731445312 (0x3792). DML: -nan (0xFFFF). Abs: nan. Rel: nan%. Ulp: 51309
Error: Index: 0003 @00000030 [0,0,3]. Ref: 0.4912109375 (0x37DC). DML: -nan (0xFFFF). Abs: nan. Rel: nan%. Ulp: 51235
Error: Index: 0004 @00000040 [0,0,4]. Ref: 0.4916992188 (0x37DE). DML: -nan (0xFFFF). Abs: nan. Rel: nan%. Ulp: 51233
Error: Index: 0005 @00000050 [0,0,5]. Ref: 0.4812011719 (0x37B3). DML: -nan (0xFFFF). Abs: nan. Rel: nan%. Ulp: 51276
Error: Index: 0006 @00000060 [0,0,6]. Ref: 0.5278320312 (0x3839). DML: -nan (0xFFFF). Abs: nan. Rel: nan%. Ulp: 51142
Error: Index: 0007 @00000070 [0,0,7]. Ref: 0.5302734375 (0x383E). DML: -nan (0xFFFF). Abs: nan. Rel: nan%. Ulp: 51137
Error: Index: 0008 @00000080 [0,0,8]. Ref: 0.5830078125 (0x38AA). DML: -nan (0xFFFF). Abs: nan. Rel: nan%. Ulp: 51029
Error: Index: 0009 @00000090 [0,0,9]. Ref: 0.6674804688 (0x3957). DML: -nan (0xFFFF). Abs: nan. Rel: nan%. Ulp: 50856
Error: 139 / 140 (99.285714%) of elements were found to be above tolerance.
Error: Max absolute delta: 0.000244. Allowed absolute tolerance: 0.002000.
Error: Max relative delta: 0.054025%. Allowed relative tolerance: 0.040000%.
Error: Max ULP delta: 51326. Allowed tolerance: 3 ULPs (float16).
Error: Verify: Fail [File: C:\workspace\DirectML\SharedToolingLib\External\Test\TaefHelper\TaefHelper.cpp, Function: TaefHelper::Fail, Line: 133]
EndGroup: OperatorTests::LayoutTransformedConvolutionDefault#metadataSet0#17 [Failed]
Summary of Non-passing Tests:
OperatorTests::LayoutTransformedConvolutionDefault#metadataSet0#17 [Failed]
Summary: Total=1, Passed=0, Failed=1, Blocked=0, Not Run=0, Skipped=0
Note: The test may also be ran with WARP be changing the GPU adapter index by adding the argument /p:GpuAdapterIndex=N
where N
is the index for the Microsoft Basic Render Driver. Remove the argument /logOutput:low
to see which GPU was selected, along with some other potentially helpful information when running the test.
The latest version of DML built with clang-dxc compiled shaders can be obtained from an internal ClangDML Azure pipeline via the published x64-win-redist-release-hlsl-clang
pipeline artifact.
The latest validated DXIL shader binary can be obtained from another internal ClangDML Azure pipeline via the published ValidatedShaders
pipeline artifact.
Test Group 2 (2 tests)
This group is interesting because these DML Operator Tests are also single-shader, but passes on all machines except the Intel GPU machine.
Failing single-shader DML Operator Tests:
- OperatorTests::ConvolutionGemmBackwardSplit#104
- OperatorTests::LayoutTransformedConvolutionDefault#metadataSet2#42
Test results on machines:
- clang-dml01 (AMD): Pass
- clang-dml02 (NVIDIA): Pass
- clang-dml03 (Intel): Fail
- local (WARP): Pass
Reproduction (on clang-dml03):
> ./TE.exe DirectML.Test.OperatorTests.dll /logOutput:low /p:DisableMetacommands=1 /name:"OperatorTests::ConvolutionGemmBackwardSplit#104" /p:GpuAdapterIndex=0
Test Authoring and Execution Framework v10.72 for x64
StartGroup: OperatorTests::ConvolutionGemmBackwardSplit#104
OutputDebugString: D3D12: Removing Device.
OutputDebugString: D3D12 WARNING: ID3D12Device::RemoveDevice: Device removal has been triggered for the following reason (DXGI_ERROR_DRIVER_INTERNAL_ERROR: There is strong evidence that the driver has performed an undefined operation; but it may be because the application performed an illegal or undefined operation to begin with.). [ EXECUTION WARNING #233: DEVICE_REMOVAL_PROCESS_POSSIBLY_AT_FAULT]
OutputDebugString: D3D12: **BREAK** enabled for the previous message, which was: [ WARNING EXECUTION #233: DEVICE_REMOVAL_PROCESS_POSSIBLY_AT_FAULT ]
Error: TAEF: [HRESULT 0x800706BE] A failure occurred while running a test operation: 'OperatorTests::ConvolutionGemmBackwardSplit'. (A crash with exception code 0x0000087A occurred in module "KERNELBASE.dll" in the process hosting the test code while invoking a test operation.)
EndGroup: OperatorTests::ConvolutionGemmBackwardSplit#104 [Failed]
TestSkipped: TAEF: The cleanup method 'OperatorTests::MethodCleanup' will not be run as TAEF has stopped communicating with the test host process due to a previous failure.
TestSkipped: TAEF: The cleanup method 'DllCleanup' will not be run as TAEF has stopped communicating with the test host process due to a previous failure.
Summary of Non-passing Tests:
OperatorTests::ConvolutionGemmBackwardSplit#104 [Failed]
Summary: Total=1, Passed=0, Failed=1, Blocked=0, Not Run=0, Skipped=0
Metadata
Metadata
Assignees
Labels
Type
Projects
Status