NVDEC_VideoDecoder_API_ProgGuide
NVDEC_VideoDecoder_API_ProgGuide
Programming Guide
NVIDIA GPUs - beginning with the NVIDIA® Fermi™ generation - contain a video decoder
engine (referred to as NVDEC in this document) which provides fully-accelerated hardware
video decoding capability. NVDEC can be used for decoding bitstreams of various formats: AV1,
H.264, HEVC (H.265), VP8, VP9, MPEG-1, MPEG-2, MPEG-4 and VC-1. NVDEC runs completely
independent of compute/graphics engine.
NVIDIA provides software API and libraries for programming NVDEC. The software API, hereafter
referred to as NVDECODE API lets developers access the video decoding features of NVDEC and
interoperate NVDEC with other engines on the GPU.
NVDEC decodes the compressed video streams and copies the resulting YUV frames to video
memory. With frames in video memory, video post processing can be done using CUDA.
The NVDECODE API also provides CUDA-optimized implementation of commonly used post-
processing operations such as scaling, cropping, aspect ratio conversion, de-interlacing and
color space conversion to many popular output video formats. The client can choose to use the
CUDA-optimized implementations provided by the NVDECODE API for these post-processing
steps or choose to implement their own post-processing on the decoded output frames.
Decoded video frames can be presented to the display with graphics interoperability for video
playback, passed directly to a dedicated hardware encoder (NVENC) for high-performance video
transcoding, used for GPU accelerated inferencing or consumed further by CUDA or CPU-based
processing.
‣ MPEG-1,
‣ MPEG-2,
‣ MPEG4,
‣ VC-1,
‣ H.264 (AVCHD) (8 bit),
‣ H.265 (HEVC) (8bit, 10 bit and 12 bit),
‣ VP8,
‣ VP9(8bit, 10 bit and 12 bit),
Table 1 shows the codec support and capabilities of the hardware video decoder for each GPU
architecture.
Maximum
Resolution:
Maximum
4096x4096
Maximum Resolution:
Fermi (GF1xx) Profile: Unsupported Unsupported Unsupported Unsupported
Resolution: 2048x1024
Baseline,
4080x4080 & 1024x2048
Main,
High profile
up to Level 4.1
Maximum
Resolution:
Maximum
Kepler Maximum 4096x4096
Resolution: Unsupported Unsupported Unsupported Unsupported
(GK1xx) Resolution: Profile:
2048x1024
4080x4080 Main,
& 1024x2048
Highprofile
up to Level4.1
Maximum
Resolution:
First Maximum 4096x4096
generation Maximum Resolution: Profile:
Resolution: 2048x1024 Baseline, Unsupported Unsupported Unsupported Unsupported
Maxwell
(GM10x) 4080x4080 & 1024x2048 Main,
High
profile up
to Level5.1
Maximum
Second Maximum Resolution:
generation Resolution: 4096x4096
Maxwell Maximum 2048x1024 Profile: Maximum
Resolution: & 1024x2048 Baseline, Unsupported Resolution: Unsupported Unsupported
(GM20x,
except 4080x4080 Main, 4096x4096
GM206) Max bitrate: High
60 Mbps profile up
to Level5.1
Maximum Maximum
Resolution: Resolution:
Maximum 4096x4096 4096x2304 Maximum
Maximum Resolution: Profile: Profile: Maximum Resolution:
GM206 Resolution: 2048x1024 Baseline, Main Resolution: 4096x2304 Unsupported
4080x4080 & 1024x2048 Main, profile up 4096x4096 Profile:
High to Level5.1 Profile 0
profile up and main10
to Level5.1 profile
Maximum Maximum
Resolution: Resolution:
4096x4096 4096x4096 Maximum
Maximum
Maximum Profile: Profile: Maximum Resolution:
GP100 Resolution: Unsupported
Resolution: Baseline, Main Resolution: 4096x4096
2048x1024
4080x4080 Main, profile up 4096x4096 Profile:
& 1024x2048
High to Level 5.1, Profile 0
profile up main10 and
to Level 5.1 main12 profile
Maximum Maximum
Resolution: Resolution: Maximum
4096x4096 8192x8192 Resolution:
GP10x/ Maximum
Maximum Profile: Profile: Maximum 8192x8192[2]
GV100/ Resolution: Unsupported
Resolution: Baseline, Main Resolution: Profile:
Turing/GA100 2048x1024
4080x4080 Main, profile up 4096x4096[1] Profile 0, 10-
& 1024x2048
High to Level 5.1, bit and 12-
profile up main10 and bit decoding
to Level 5.1 main12 profile
Maximum Maximum
Resolution: Resolution: Maximum
4096x4096 8192x8192 Resolution:
Maximum
Maximum Profile: Profile: Maximum 8192x8192
Hopper Resolution: Unsupported
Resolution: Baseline, Main Resolution: Profile:
2048x1024
4080x4080 Main, profile up 4096x4096 Profile 0, 10-
& 1024x2048
High to Level 5.1, bit and 12-
profile up main10 and bit decoding
to Level 5.1 main12 profile
Maximum Maximum
Resolution: Resolution: Maximum
Maximum
4096x4096 8192x8192 Resolution:
Maximum Resolution:
Maximum Profile: Profile: Maximum 8192x8192
GA10x/AD10x Resolution: 8192x8192
Resolution: Baseline, Main Resolution: Profile:
2048x1024 Profile:
4080x4080 Main, profile up 4096x4096 Profile 0, 10-
& 1024x2048 Profile 0
High to Level 5.1, bit and 12-
upto level 6.0
profile up main10 and bit decoding
to Level 5.1 main12 profile
[1] Supported only on select GP10x GPUs, all Turing GPUs and GA100
[2] VP9 10-bit and 12-bit decoding is supported on select GP10x GPUs, all Turing GPUs and GA100
Decoder pipeline consists of three major components - Demuxer, Video Parser, and Video
Decoder. The components are not dependent on each other and hence can be used
independently. NVDECODE API provide API’s for NVIDIA video parser and NVIDIA video decoder.
Of these, NVIDIA video parser is purely a software component and users can implement their
own parser in place of NVIDIA video parser, if required.
At a high level the following steps should be followed for decoding any video content using
NVDECODEAPI:
1. Create a CUDA context.
2. Query the decode capabilities of the hardware decoder.
All NVDECODE APIs are exposed in two header-files: cuviddec.h and nvcuvid.h. These
headers can be found under Interface folder in the Video Codec SDK package. The samples in
NVIDIA Video Codec SDK statically load the library (which ships as a part of the SDK package for
windows) functions and include cuviddec.h and nvcuvid.h in the source files. The Windows
DLL nvcuvid.dll is included in the NVIDIA display driver for Windows. The Linux library
libnvcuvid.so is included with NVIDIA display driver for Linux.
The following sections in this chapter explain the flow that should be followed to accelerate
decoding using NVDECODE API.
‣ CodecType: must be from enum cudaVideoCodec, indicating codec type of content like
H.264, HEVC, VP9 etc.
‣ ulMaxNumDecodeSurfaces: This is number of surfaces in parser’s DPB (decode
picture buffer). This value may not be known at the parser initialization time and
can be set to a dummy number like 1 to create parser object. Application must
register a callback pfnSequenceCallback with the driver, which is called by the
parser when the parser encounters the first sequence header or any changes in
the sequence. This callback reports the minimum number of surfaces needed by
parser’s DPB for correct decoding in CUVIDEOFORMAT::min_num_decode_surfaces.
The sequence callback may return this value to the parser if wants to
update CUVIDPARSERPARAMS::ulMaxNumDecodeSurfaces. The parser then overwrites
CUVIDPARSERPARAMS::ulMaxNumDecodeSurfaces with the value returned by the sequence
callback, if return value of the sequence callback is greater than 1 (see description
about pfnSequenceCallback below). Therefore, for optimum memory allocation, decoder
object creation should be deferred until CUVIDPARSERPARAMS::ulMaxNumDecodeSurfaces
is known, so that the decoder object can be created with required
‣ 0: fail
‣ 1: succeeded, but driver should not override
CUVIDPARSERPARAMS::ulMaxNumDecodeSurfaces
‣ >1: succeeded, and driver should override
CUVIDPARSERPARAMS::ulMaxNumDecodeSurfaces with this return value
‣ pfnDecodePicture: Parser triggers this callback when bitstream data for one frame is
ready. In case of field pictures, there may be two decode calls per one display call since two
fields make up one frame. Return value from this callback is interpreted as:
‣ 0: fail
‣ ≥1: succeeded
‣ pfnDisplayPicture: Parser triggers this callback when a frame in display order is ready.
Return value from this callback is interpreted as:
‣ 0: fail
‣ ≥1: succeeded
‣ pfnGetOperatingPoint: Parser triggers this callback to get operating point of an AV1
scalable stream. Parser picks default operating point as 0 and outputAllLayers flag as 0 if
pfnGetOperatingPoint is not set or return value is -1 or invalid operating point. Return
value from this callback is interpreted as:
‣ < 0: fail
‣ ≥0: succeeded (bit 0-9: currOperatingPoint, bit 10-10: bOutputAllLayer)
‣ pfnGetSEIMsg: Parser triggers this callback in decode order when all the unregistered user
SEI messages or Metadata OBUs are parsed for a frame. Currently this callback is supported
for H264, HEVC and AV1 codecs. Return value from this callback is interpreted as:
‣ 0: fail
‣ ≥1: succeeded
‣ flags: These flags are set by application and interpreted by parser as below:
‣ CUVID_PKT_ENDOFSTREAM: MUST be set with last packet for this stream. Parser will
trigger display callback for all pending buffers in the display queue.
‣ CUVID_PKT_TIMESTAMP: indicate that timestamp in packet is valid.
‣ CUVID_PKT_DISCONTINUITY: should be set if there is any discontinuity like packet after
seek.
‣ CUVID_PKT_ENDOFPICTURE: MUST be set when packet contains exactly one frame
or one field data. NALU based codecs have one frame latency for decode callback as
parser detects frame boundary when some non-VCL NALU are received (that belong to
next frame). This flag will force parser to skip this boundary check and trigger decode
callback immediately. If packet has incomplete data, decode callback will get triggered
with partial frame data. If packet has more than one frame data, parser will trigger
decode callback for first frame data. Rest of the NALU will get dropped.
‣ CUVID_PKT_NOTIFY_EOS: If this flag is set along with CUVID_PKT_ENDOFSTREAM,
an additional (dummy) display callback will be invoked with null value of
CUVIDPARSERDISPINFO which should be interpreted as end of the stream.
‣ payload_size: represents number of bytes in payload
‣ payload: points to bitstream memory buffer
The decoded result gets associated with a picture-index value in the CUVIDPICPARAMS structure,
which is also provided by the parser. This picture index is later used to map the decoded frames
to CUDA memory.
When cuvidGetDecoderCaps() is called , the underlying driver fills up the remaining fields of
CUVIDDECODECAPS, indicating the support for the queried capabilities, supported output formats
and the maximum and minimum resolutions the hardware supports.
The following pseudo-code illustrates how to query the capabilities of NVDEC.
CUVIDDECODECAPS decodeCaps = {};
// set IN params for decodeCaps
decodeCaps.eCodecType = cudaVideoCodec_HEVC;//HEVC
decodeCaps.eChromaFormat = cudaVideoChromaFormat_420;//YUV 4:2:0
decodeCaps.nBitDepthMinus8 = 2;// 10 bit
result = cuvidGetDecoderCaps(&decodeCaps);
Returned parameters from API can be interpreted as below to validate if content can be decoded
on underlying hardware:
In most situations, bit-depth and chroma subsampling to be used at the decoder output is same
as that at the decoder input (i.e. in the content). In certain cases, however, it may be necessary
to have the decoder produce output with bit-depth and chroma subsampling different from that
used in the input bitstream. In general, it’s always a good idea to first check if the desired output
bit-depth and chroma subsampling format is supported before creating the decoder. This can
be done in the following way:
‣ bitDepthMinus8: bit-depth minus 8 of video stream to be decoded like 0 for 8-bit, 2 for 10-
bit, 4 for 12-bit.
‣ ulNumDecodeSurfaces: Referred to as decode surfaces elsewhere in this document,
this is the number of surfaces that the driver will internally allocate for storing
the decoded frames. Using a higher number ensures better pipelining but increases
GPU memory consumption. For correct operation, minimum value is defined in
CUVIDEOFORMAT::min_num_decode_surfaces and can be obtained from first sequence
callback from Nvidia parser. The NVDEC engine writes decoded data to one of these surfaces.
These surfaces are not accessible by the user of NVDECODE API, but the mapping stage,
which includes decoder output format conversion, scaling, cropping etc.) use these surfaces
as input surfaces.
‣ ulNumOutputSurfaces: This is the maximum number of output surfaces that the
client will simultaneously map to decode surfaces for further processing using
cuvidMapVideoFrame(). These surfaces have postprocessed decoded output to be used
by client. The driver internally allocates the corresponding number of surfaces (referred as
output surfaces in this document). Client will have access to output surfaces. Refer to section
Preparing the decoded frame for further processing to understand the definition of map.
‣ OutputFormat: Output surface format defined as enum cudaVideoSurfaceFormat.
This output format must be one of supported format obtained in
decodecaps.nOutputFormatMask in cuvidGetDecoderCaps(). If an unsupported output
format is passed, API will fail with error CUDA_ERROR_NOT_SUPPORTED.
‣ ulTargetWidth, ulTargetHeight: This is resolution of output surfaces. For use-case
which involve no scaling, these should be set to ulWidth, ulHeight, respectively.
‣ DeinterlaceMode: This should be set to cudaVideoDeinterlaceMode_Weave
or cudaVideoDeinterlaceMode_Bob for progressive content and
cudaVideoDeinterlaceMode_Adaptive for interlaced content.
cudaVideoDeinterlaceMode_Adaptiveyields better quality but increases memory
consumption.
‣ ulCreationFlags: It is defined as enum cudaVideoCreateFlags. It is optional to explicitly
define this flag. Driver will pick appropriate mode if not defined.
‣ ulIntraDecodeOnly: Set this flag to 1 to instruct the driver that the content being decoded
contains only I/IDR frames. This helps the driver optimize memory consumption. Do not set
this flag if content has non-intra frames.
‣ enableHistogram: Set this flag to 1 to enable histogram data collection.
The cuvidCreateDecoder() call fills CUvideodecoder with the decoder handle which should
be retained till the decode session is active. The handle needs to be passed along with other
NVDECODE API calls.
The user can also specify the following parameters in the CUVIDDECODECREATEINFO to control
the final output:
‣ Scaling dimension
‣ Cropping dimension
‣ Dimension if the user wants to change the aspect ratio
The following code demonstrates the setup of decoder in case of scaling, cropping, or aspect
ratio conversion.
// Scaling. Source size is 1280x960. Scale to 1920x1080.
CUresult rResult;
unsigned int uScaleW, uScaleH;
uScaleW = 1920;
uScaleH = 1080;
...
CUVIDDECODECREATEINFO stDecodeCreateInfo;
memset(&stDecodeCreateInfo, 0, sizeof(CUVIDDECODECREATEINFO));
... // Setup the remaining structure members
stDecodeCreateInfo.ulTargetWidth = uScaleWidth;
stDecodeCreateInfo.ulTargetHeight = uScaleHeight;
rResult = cuvidCreateDecoder(&hDecoder, &stDecodeCreateInfo);
...
‣ The client needs to fill up the structure with parameters derived during the parsing
process. CUVIDPICPARAMS contains a structure specific to every supported codec which
should also be filled up.
‣ Call cuvidDecodePicture() and pass the decoder handle and the pointer to
CUVIDPICPARAMS. cuvidDecodePicture() kicks off the decoding on NVDEC.
decoder (using API cuvidCreateDecoder()). CUDA device pointer of histogram buffer can be
obtained from CUVIDPROCPARAMS::histogram_dptr.
Histogram buffer is mapped to output buffer in driver so cuvidUnmapVideoFrame() does unmap
of histogram buffer also along with output surface.
The following code demonstrates how to use cuvidMapVideoFrame() and
cuvidUnmapVideoFrame() for accessing histogram buffer.
// MapFrame: Call cuvidMapVideoFrame and get the output frame and associated
// histogram buffer CUDA device pointer
CUVIDPROCPARAMS stProcParams;
CUresult rResult;
unsigned long long cuOutputFramePtr = 0, cuHistogramPtr = 0;
int nPitch;
int histogram_size = (decodecaps.nCounterBitDepth / 8) *
decodecaps.nMaxHistogramBins;
unsigned char *pHistogramPtr = nullptr;
memset(&stProcParams, 0, sizeof(CUVIDPROCPARAMS));
/*************************************************
* setup stProcParams
**************************************************/
stProcParams.histogram_dptr = &cuHistogramPtr;
‣ Decoding is in progress.
‣ Decoding of the frame completed successfully.
‣ The bitstream for the frame was corrupted and concealed by NVDEC.
‣ The bitstream for the frame was corrupted, however could not be concealed by NVDEC.
The API is expected to help in the scenarios where the client needs to take a further decision
based on the decoding status of the frame, for e.g. whether to carry out inferencing on the frame
or not.
Please note that the NVDEC can detect a limited number of errors depending on the codec. This
API is supported for HEVC, H264 and JPEG on Maxwell and above generation GPUs.
time dynamic linking of these libraries if needed. Below code snippets can help understand the
changes needed in programming style:
#ifdef UNICODE
static LPCWSTR __DriverLibName = L"nvcuvid.dll";
#else
static LPCSTR __DriverLibName = "nvcuvid.dll";
#endif
if (*pInstance == NULL)
{
printf("LoadLibrary \"%s\" failed!\n", __DriverLibName);
return CUDA_ERROR_UNKNOWN;
}
return CUDA_SUCCESS;
}
#include <dlfcn.h>
if (*pInstance == NULL)
{
printf("dlopen \"%s\" failed!\n", __DriverLibName);
return CUDA_ERROR_UNKNOWN;
}
return CUDA_SUCCESS;
}
#endif
tcuvidCreateVideoParser *cuvidCreateVideoParser;
tcuvidParseVideoData *cuvidParseVideoData;
tcuvidDestroyVideoParser *cuvidDestroyVideoParser;
tcuvidGetDecoderCaps *cuvidGetDecoderCaps;
tcuvidCreateDecoder *cuvidCreateDecoder;
tcuvidDestroyDecoder *cuvidDestroyDecoder;
tcuvidDecodePicture *cuvidDecodePicture;
#define CHECKED_CALL(call) \
do { \
CUresult result = (call); \
if (CUDA_SUCCESS != result) { \
return result; \
} \
} while(0)
CHECKED_CALL(LOAD_LIBRARY(&DriverLib));
GET_PROC(cuvidGetDecoderCaps);
GET_PROC(cuvidCreateDecoder);
GET_PROC(cuvidDestroyDecoder);
GET_PROC(cuvidDecodePicture);
return CUDA_SUCCESS;
}
by running command 'nvidia-smi -q'. PCIE link width can be configured in the system's BIOS
settings.
In the use cases where there is frequent change of decode resolution and/or post processing
parameters, it is recommended to use cuvidReconfigureDecoder() instead of destroying
the existing decoder instance and recreating a new one.
The following steps should be followed for optimizing video memory usage:
1. Make CUVIDDECODECREATEINFO::ulNumDecodeSurfaces = CUVIDEOFORMAT::
min_num_decode_surfaces. This will ensure that the underlying driver allocates
minimum number of decode surfaces to correctly decode the sequence. In
case there is reduction in decoder performance, clients can slightly increase
CUVIDDECODECREATEINFO::ulNumDecodeSurfaces. It is therefore recommended to
choose the optimal value of CUVIDDECODECREATEINFO::ulNumDecodeSurfaces to ensure
right balance between decoder throughput and memory consumption.
2. CUVIDDECODECREATEINFO::ulNumOutputSurfaces should be decided optimally after due
experimentation for balancing decoder throughput and memory consumption.
3. CUVIDDECODECREATEINFO::DeinterlaceMode should be set
“cudaVideoDeinterlaceMode::cudaVideoDeinterlaceMode_Weave” or
“cudaVideoDeinterlaceMode::cudaVideoDeinterlaceMode_Bob”. For interlaced
contents, choosing
cudaVideoDeinterlaceMode::cudaVideoDeinterlaceMode_Adaptive results to
higher quality but increases memory consumption. Using
cudaVideoDeinterlaceMode::cudaVideoDeinterlaceMode_Weave or
cudaVideoDeinterlaceMode::cudaVideoDeinterlaceMode_Bob results to minimum
memory consumption though it may result in lesser video
quality. In case “CUVIDDECODECREATEINFO::DeinterlaceMode” is not
specified by the client, the underlying display driver sets it to
“cudaVideoDeinterlaceMode::cudaVideoDeinterlaceMode_Adaptive” which results
to higher memory consumption. Hence it is strongly recommended to choose the right value
of CUVIDDECODECREATEINFO::DeinterlaceMode depending on the requirement.
4. While decoding multiple streams it is recommended to allocate minimum number of CUDA
contexts and share it across sessions. This saves the memory overhead associated with the
CUDA context creation.
5. CUVIDDECODECREATEINFO::ulIntraDecodeOnly should be set to 1 if it is known
beforehand that the sequence contains Intra frames only. This feature is supported only for
HEVC, H.264 and VP9. However, decoding might fail if the flag is enabled in case of supported
codecs for regular bit streams having P and/or B frames.
The sample applications included with the Video Codec SDK are written to demonstrate the
functionality of various APIs, but they may not be fully optimized. Hence programmers are
strongly encouraged to ensure that their application is well-designed, with various stages in
the decode-postprocess-display pipeline structured in an efficient manner to achieve desired
performance and memory consumption.
NVIDIA reserves the right to make corrections, modifications, enhancements, improvements, and any other changes to this document, at any time without notice.
Customer should obtain the latest relevant information before placing orders and should verify that such information is current and complete.
NVIDIA products are sold subject to the NVIDIA standard terms and conditions of sale supplied at the time of order acknowledgment, unless otherwise agreed in
an individual sales agreement signed by authorized representatives of NVIDIA and customer (“Terms of Sale”). NVIDIA hereby expressly objects to applying any
customer general terms and conditions with regards to the purchase of the NVIDIA product referenced in this document. No contractual obligations are formed
either directly or indirectly by this document.
NVIDIA products are not designed, authorized, or warranted to be suitable for use in medical, military, aircraft, space, or life support equipment, nor in applications
where failure or malfunction of the NVIDIA product can reasonably be expected to result in personal injury, death, or property or environmental damage. NVIDIA
accepts no liability for inclusion and/or use of NVIDIA products in such equipment or applications and therefore such inclusion and/or use is at customer’s own risk.
NVIDIA makes no representation or warranty that products based on this document will be suitable for any specified use. Testing of all parameters of each product
is not necessarily performed by NVIDIA. It is customer’s sole responsibility to evaluate and determine the applicability of any information contained in this document,
ensure the product is suitable and fit for the application planned by customer, and perform the necessary testing for the application in order to avoid a default of
the application or the product. Weaknesses in customer’s product designs may affect the quality and reliability of the NVIDIA product and may result in additional
or different conditions and/or requirements beyond those contained in this document. NVIDIA accepts no liability related to any default, damage, costs, or problem
which may be based on or attributable to: (i) the use of the NVIDIA product in any manner that is contrary to this document or (ii) customer product designs.
Trademarks
NVIDIA, the NVIDIA logo, and cuBLAS, CUDA, CUDA Toolkit, cuDNN, DALI, DIGITS, DGX, DGX-1, DGX-2, DGX Station, DLProf, GPU, Jetson, Kepler, Maxwell, NCCL,
Nsight Compute, Nsight Systems, NVCaffe, NVIDIA Deep Learning SDK, NVIDIA Developer Program, NVIDIA GPU Cloud, NVLink, NVSHMEM, PerfWorks, Pascal,
SDK Manager, Tegra, TensorRT, TensorRT Inference Server, Tesla, TF-TRT, Triton Inference Server, Turing, and Volta are trademarks and/or registered trademarks
of NVIDIA Corporation in the United States and other countries. Other company and product names may be trademarks of the respective companies with which
they are associated.
Copyright
© 2010-2024 NVIDIA Corporation. All rights reserved.