Skip to content

Refactor Blackwell #27537

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: 4.x
Choose a base branch
from
Open

Refactor Blackwell #27537

wants to merge 1 commit into from

Conversation

johnnynunez
Copy link
Contributor

@johnnynunez johnnynunez commented Jul 13, 2025

In CUDA 13:

  • 10.0 is b100/b200 same for aarch64 (gb200)
  • 10.3 is GB300
  • 11.0 is Thor with new OpenRm driver (moves to SBSA)
  • 12.0 is RTX/RTX PRO
  • 12.1 is Spark GB10

Thor was moved from 10.1 to 11.0 and Spark is 12.1.
Related patch: pytorch/pytorch#156176

@@ -109,7 +109,7 @@ macro(ocv_initialize_nvidia_device_generations)
set(_arch_ampere "8.0;8.6")
set(_arch_lovelace "8.9")
set(_arch_hopper "9.0")
set(_arch_blackwell "10.0;12.0")
set(_arch_blackwell "10.0;10.3,11.0;12.0;12.1")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm.. each option here is yet another kernel build. Longer build and fatter binary. I would say, that we do not need all possible combinations, if the target is defined by the arch name. "12.0;12.1" most probably should be just 12.1. The same for 10.x variations, if pure 10.0 devices do not exist and 10.3 is minimal production version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello,
In CUDA 13
10.0 is b100/b200 same for aarch64 (gb200)
10.3 is GB300
11.0 is Thor with new OpenRm driver (moves to SBSA)
12.0 is RTX/RTX PRO
12.1 is Spark GB10

feel free to change it or recommend me the most optimized codegen there

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Orin is moving to sbsa (deletes old nvgpu) in Q1 2026. I don’t if it will mantain the same codegen

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is new option that is putting f 10.0f and takes optimizarion for all blackwell family

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnnynunez Do you have any official documentation mentioning the new compute capabilities and the f option?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnnynunez Do you have any official documentation mentioning the new compute capabilities and the f option?

https://developer.nvidia.com/blog/nvidia-blackwell-and-nvidia-cuda-12-9-introduce-family-specific-architecture-features/

Copy link
Contributor

@cudawarped cudawarped Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm.. each option here is yet another kernel build. Longer build and fatter binary. I would say, that we do not need all possible combinations, if the target is defined by the arch name. "12.0;12.1" most probably should be just 12.1. The same for 10.x variations, if pure 10.0 devices do not exist and 10.3 is minimal production version.

@asmorkalov @johnnynunez @tomoaki0705 Firstly I don't really know what the use case for specifying the CUDA architecture is? That said if someone explicitly selects it I don't see the issue in compiling for all architectures because the user should be aware of the implications of there choice.

I think where the size is an issue is when the CUDA_ARCH_BIN and CUDA_ARCH_PTX are not specified and all supported architectures from all generations are built. In a future PR we could relax this and just build major architectures (5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0 and 12.0), equivalent to the CMake all-major flag.

A better alternative might be to make the default CUDA_GENERATION=Auto which I think makes much more sense. If a user hasn't explicitley specified the architechture this is probably what they want. We could then have an additional option equivalent to the existing default (CUDA_GENERATION=All) or one for building compatibility with all major architectures (CUDA_GENERATION=All-Major)

@asmorkalov asmorkalov self-assigned this Jul 14, 2025
@asmorkalov
Copy link
Contributor

cc @cudawarped

@asmorkalov asmorkalov added this to the 4.13.0 milestone Jul 14, 2025
@cudawarped
Copy link
Contributor

In CUDA 13:

* 10.0 is b100/b200 same for aarch64 (gb200)

* 10.3 is GB300

* 11.0 is Thor with new OpenRm driver (moves to SBSA)

* 12.0 is RTX/RTX PRO

* 12.1 is Spark GB10

It looks like Nvidia have stuck to this. Output from nvcc from CUDA 13.0 below.

nvcc --list-gpu-code
sm_75
sm_80
sm_86
sm_87
sm_88
sm_89
sm_90
sm_100
sm_110
sm_103
sm_120
sm_121

@johnnynunez
Copy link
Contributor Author

In CUDA 13:

* 10.0 is b100/b200 same for aarch64 (gb200)

* 10.3 is GB300

* 11.0 is Thor with new OpenRm driver (moves to SBSA)

* 12.0 is RTX/RTX PRO

* 12.1 is Spark GB10

It looks like Nvidia have stuck to this. Output from nvcc from CUDA 13.0 below.

nvcc --list-gpu-code
sm_75
sm_80
sm_86
sm_87
sm_88
sm_89
sm_90
sm_100
sm_110
sm_103
sm_120
sm_121

I joined nvidia this month… haha

@cudawarped
Copy link
Contributor

I joined nvidia this month… haha

Congratulations 🥳

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants