Skip to content

[HLSL] DML shader ElementWise_Identity_256_Strided4D_uint16_native causing 10800 DML Operator Test failures #155919

@Icohedron

Description

@Icohedron

The shader ElementWise_Identity_256_Strided4D_uint16_native (Shader ID: 7633) is causing 563 single-shader DML Operator Test failures.

A "single-shader DML Operator Test" is a DML Operator Test that uses a single shader compiled by clang-dxc, but may use any number of fxc-compiled shaders. Therefore, ElementWise_Identity_256_Strided4D_uint16_native is certainly the sole cause of these test failures.

Furthermore, there are a total of 10800 failing DML Operator Tests using the ElementWise_Identity_256_Strided4D_uint16_native shader.

Test Group 1 (561 tests)

Some failing single-shader DML Operator Tests:

  • OperatorTests::ElementWise_Identity#31
  • OperatorTests::ElementWise_Identity_Transpose#metadataSet0#29
  • OperatorTests::ElementWise_LogicalEquals#172
  • OperatorTests::ElementWise_Pow#37
  • OperatorTests::ElementWise_IsNan#2
  • OperatorTests::ElementWise_If#29
  • OperatorTests::ElementWise_IsInfinity#2
  • OperatorTests::ElementWise_BitCount#100
  • OperatorTests::Slice#35
  • OperatorTests::SliceSimple#24
  • OperatorTests::Cast#327
  • OperatorTests::Join#17
  • OperatorTests::Split#17
  • OperatorTests::ConvolutionDepthwise#188
  • OperatorTests::ConvolutionBasicGemm#186
  • OperatorTests::FillValueSequence#4
  • OperatorTests::MatrixMultiplyIntegerToFloatDefault#25
  • OperatorTests::LayoutTransformedConvolutionDefault#metadataSet0#17

Test results on machines:

  • clang-dml01 (AMD): Fail
  • clang-dml02 (NVIDIA): Fail
  • clang-dml03 (Intel): Fail
  • local (WARP): Fail

Reproduction:

❯ ./TE.exe DirectML.Test.OperatorTests.dll /logOutput:low /p:DisableMetacommands=1 /name:"OperatorTests::LayoutTransformedConvolutionDefault#metadataSet0#17"                      
Test Authoring and Execution Framework v10.72 for x64

StartGroup: OperatorTests::LayoutTransformedConvolutionDefault#metadataSet0#17
Error: Output Tensor #0:
Error: Tensor Sizes: 4,1,35
Error: Tensor Data Type: float16
Error: Index: 0001 @00000010 [0,0,1].  Ref: 0.4689941406 (0x3781).  DML: -nan (0xFFFF).  Abs: nan.  Rel: nan%.  Ulp: 51326
Error: Index: 0002 @00000020 [0,0,2].  Ref: 0.4731445312 (0x3792).  DML: -nan (0xFFFF).  Abs: nan.  Rel: nan%.  Ulp: 51309
Error: Index: 0003 @00000030 [0,0,3].  Ref: 0.4912109375 (0x37DC).  DML: -nan (0xFFFF).  Abs: nan.  Rel: nan%.  Ulp: 51235
Error: Index: 0004 @00000040 [0,0,4].  Ref: 0.4916992188 (0x37DE).  DML: -nan (0xFFFF).  Abs: nan.  Rel: nan%.  Ulp: 51233
Error: Index: 0005 @00000050 [0,0,5].  Ref: 0.4812011719 (0x37B3).  DML: -nan (0xFFFF).  Abs: nan.  Rel: nan%.  Ulp: 51276
Error: Index: 0006 @00000060 [0,0,6].  Ref: 0.5278320312 (0x3839).  DML: -nan (0xFFFF).  Abs: nan.  Rel: nan%.  Ulp: 51142
Error: Index: 0007 @00000070 [0,0,7].  Ref: 0.5302734375 (0x383E).  DML: -nan (0xFFFF).  Abs: nan.  Rel: nan%.  Ulp: 51137
Error: Index: 0008 @00000080 [0,0,8].  Ref: 0.5830078125 (0x38AA).  DML: -nan (0xFFFF).  Abs: nan.  Rel: nan%.  Ulp: 51029
Error: Index: 0009 @00000090 [0,0,9].  Ref: 0.6674804688 (0x3957).  DML: -nan (0xFFFF).  Abs: nan.  Rel: nan%.  Ulp: 50856
Error: 139 / 140 (99.285714%) of elements were found to be above tolerance.
Error: Max absolute delta: 0.000244.  Allowed absolute tolerance: 0.002000.
Error: Max relative delta: 0.054025%.  Allowed relative tolerance: 0.040000%.
Error: Max ULP delta: 51326.  Allowed tolerance: 3 ULPs (float16).
Error: Verify: Fail [File: C:\workspace\DirectML\SharedToolingLib\External\Test\TaefHelper\TaefHelper.cpp, Function: TaefHelper::Fail, Line: 133]
EndGroup: OperatorTests::LayoutTransformedConvolutionDefault#metadataSet0#17 [Failed]

Summary of Non-passing Tests:
    OperatorTests::LayoutTransformedConvolutionDefault#metadataSet0#17 [Failed]

Summary: Total=1, Passed=0, Failed=1, Blocked=0, Not Run=0, Skipped=0

Note: The test may also be ran with WARP be changing the GPU adapter index by adding the argument /p:GpuAdapterIndex=N where N is the index for the Microsoft Basic Render Driver. Remove the argument /logOutput:low to see which GPU was selected, along with some other potentially helpful information when running the test.

The latest version of DML built with clang-dxc compiled shaders can be obtained from an internal ClangDML Azure pipeline via the published x64-win-redist-release-hlsl-clang pipeline artifact.

The latest validated DXIL shader binary can be obtained from another internal ClangDML Azure pipeline via the published ValidatedShaders pipeline artifact.

Test Group 2 (2 tests)

This group is interesting because these DML Operator Tests are also single-shader, but passes on all machines except the Intel GPU machine.

Failing single-shader DML Operator Tests:

  • OperatorTests::ConvolutionGemmBackwardSplit#104
  • OperatorTests::LayoutTransformedConvolutionDefault#metadataSet2#42

Test results on machines:

  • clang-dml01 (AMD): Pass
  • clang-dml02 (NVIDIA): Pass
  • clang-dml03 (Intel): Fail
  • local (WARP): Pass

Reproduction (on clang-dml03):

> ./TE.exe DirectML.Test.OperatorTests.dll /logOutput:low /p:DisableMetacommands=1 /name:"OperatorTests::ConvolutionGemmBackwardSplit#104" /p:GpuAdapterIndex=0
Test Authoring and Execution Framework v10.72 for x64

StartGroup: OperatorTests::ConvolutionGemmBackwardSplit#104
OutputDebugString: D3D12: Removing Device.
OutputDebugString: D3D12 WARNING: ID3D12Device::RemoveDevice: Device removal has been triggered for the following reason (DXGI_ERROR_DRIVER_INTERNAL_ERROR: There is strong evidence that the driver has performed an undefined operation; but it may be because the application performed an illegal or undefined operation to begin with.). [ EXECUTION WARNING #233: DEVICE_REMOVAL_PROCESS_POSSIBLY_AT_FAULT]
OutputDebugString: D3D12: **BREAK** enabled for the previous message, which was: [ WARNING EXECUTION #233: DEVICE_REMOVAL_PROCESS_POSSIBLY_AT_FAULT ]
Error: TAEF: [HRESULT 0x800706BE] A failure occurred while running a test operation: 'OperatorTests::ConvolutionGemmBackwardSplit'. (A crash with exception code 0x0000087A occurred in module "KERNELBASE.dll" in the process hosting the test code while invoking a test operation.)
EndGroup: OperatorTests::ConvolutionGemmBackwardSplit#104 [Failed]
TestSkipped: TAEF: The cleanup method 'OperatorTests::MethodCleanup' will not be run as TAEF has stopped communicating with the test host process due to a previous failure.
TestSkipped: TAEF: The cleanup method 'DllCleanup' will not be run as TAEF has stopped communicating with the test host process due to a previous failure.

Summary of Non-passing Tests:
    OperatorTests::ConvolutionGemmBackwardSplit#104 [Failed]

Summary: Total=1, Passed=0, Failed=1, Blocked=0, Not Run=0, Skipped=0

Metadata

Metadata

Assignees

No one assigned

    Labels

    HLSLHLSL Language SupportduplicateResolved as duplicate

    Type

    Projects

    Status

    Closed

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions