Rpribas, 327-QuaseFinal

Journal of Integrated Circuits and Systems, vol. 16, n.
2, 2021 1
Approximate Hardware Architecture for the Interpolation Filters of Versatile Video

Coding
Giovane G. Silva1 , Ícaro G. Siqueira1 , Mateus Grellert2 , Cláudio M. Diniz3
1
Graduate Program on Electronic Engineering and Computing, Catholic University of Pelotas (UCPel), Pelotas, Brazil
2
Embedded Computing Lab, Federal University of Santa Catarina (UFSC), Florianópolis, Brazil
3
Institute of Informatics, Federal University of Rio Grande do Sul (UFRGS), Porto Alegre, Brazil
e-mail: cmdiniz@ieee.org
Abstract— The new Versatile Video Coding (VVC) standard higher frame sampling rate, increasing storage/transmission
was recently developed to improve compression efficiency of requirements. A video with a resolution of 1920 × 1080
previous video coding standards and to support new applica- pixels with 30 frames per second (fps), when pixels are rep-
tions. The compression efficiency gain was achieved in the resented with 24 bits, produces a bit rate of 186.6 MB per
standardization process at the cost of an increase in the com- second. To store 1 hour of this video would require 671.8
putational complexity of the encoder algorithms, which leads GB of storage space. UHD 4K videos increase bit rate and
to the need to develop hardware accelerators and to apply ap- storage space requirements by 4× compared to HD video.
proximate computing techniques to reach the performance and Thus, it becomes unfeasible to use such raw video represen-
power dissipation required for systems that encode video. This tation, with motivates the need for video compression.
work proposes the implementation of an approximate hard- The Versatile Video Coding (VVC) [3, 4] standard was
ware architecture for interpolation filters defined in the VVC recently developed by the International Telecommunication
standard targeting fractional motion estimation requirements Union (ITU) Video Coding Experts Group (VCEG) and In-
of real-time processing of high resolution videos scenario. The ternational Organization for Standardization (ISO) Motion
architecture includes four filter cores in parallel, each one gen- Picture Experts Group (MPEG) to increase compression ef-
erating 15 fractional per clock cycle, so it calculates 60 frac- ficiency compared to previous VCEG/MPEG standard High
tional pixels in parallel. Each filter core is based on approxi- Efficiency Video Coding (HEVC) [5], and to be versatile
mating the original 8-tap and 7-tap interpolation filters defined to support different video applications, e.g. high dynamic
in the VVC standard to 6-tap interpolation filters, and by ap- range, screen content, multiview, and 360-degree videos. As
plying Multiple Constant Multiplication (MCM) algorithm to reported by [4], VVC offers bit rate savings of about 50%
optimize filter datapaths. The architecture is able to process up compared to HEVC for equal subjective quality. However,
to 2560 × 1600 pixels videos at 30 fps with power dissipation this comes with an impact on the computational complex-
of 23.9 mW when operating at a frequency of 522 MHz, with ity required to encode videos. The processing time of the
an average compression efficiency degradation of only 0.41% VVC encoder software is 10.2 times higher than HEVC en-
compared to default VVC video encoder software configura- coder (on average for different videos) when Single Instruc-
tion. tion Multiple Data (SIMD) instructions are enabled, and this
Index Terms— Video coding; Versatile Video Coding; Inter- cost increases by 15.9 times when SIMD instructions are dis-
polation Filter; Hardware; Architecture. abled [6].
Motion Estimation (ME) stands out as one of the most
I. I NTRODUCTION computing-intensive parts in modern encoders. This step is
Digital video is widespread into many electronic devices, commonly composed of an integer motion estimation (IME)
enabling a diversity of applications such as video on demand, and fractional motion estimation (FME), each requiring sev-
digital television, video surveillance, etc. There is a growing eral block-matching operations to be performed. Particularly
demand for digital video, which is explained by the increased FME is even more concerning, as it requires an interpolation
number of devices: a forecast by Cisco points out that by of the fractional pixels prior to its block-matching. To in-
2023 the number of devices connected to Internet Protocol terpolate these samples, the HEVC standard uses 3 different
(IP) networks will be more than three times the global pop- FIR filters with 8-taps to generate 1/2 and 1/4 pixels. VVC
ulation [1]. The huge demand for digital video and the raise increases this complexity, as it introduces a precision of 1/16
of video resolutions and frame rates pushes the internet data pixels for the motion vectors in Affine mode [7]. Therefore,
traffic related to video transmission. By 2023, 66% of flat- VVC fractional interpolation filter is at least 17× more com-
panel TVs will support Ultra-High-Definition (UHD) or 4K plex than HEVC fractional interpolation filter.
resolution (3840 × 2160 pixels). It results in an increase of The high computational complexity of the VVC standard
video traffic over the Internet. Today video traffic share is also brings restrictions regarding power consumption on mo-
about 80% of total Internet traffic and it continues to grow bile devices. In order to deal with these restrictions, a com-
for the next years [2]. mon and efficient solution is to implement hardware acceler-
Given these demands, and the market need for applica- ators, since these dedicated hardware architectures are more
tions with even higher visual quality, videos are constantly efficient in terms of power/energy. Recent solutions also rely
produced with higher spatial resolution, higher bit depth and on approximate computing to further reduce power of in-
Digital Object Identifier 10.29292/jics.v16i2.327

2 SILVA et al.: Approximate Hardware Architecture for the Interpolation Filters of Versatile Video Coding
Input video frames split into
terpolation filters in recent video encoders (as we detail in blocks (coding tree units)
quantized
residual samples coefficients
Section II). The main limitations of previous works on filter
original samples + Transforms and Entropy
hardware architectures for VVC standard is that they do not Quantization coding Encoded
video
achieve the required throughput to process FME for UHD -
Inter-frame
prediction
videos, and the coding efficiency analysis due to approxima- motion info
tion are based on a few videos, which do not represent a real Intra-frame
prediction
inter/intra decision
scenario, as we discuss in Section II. reference

intra prediction modes
samples
This paper presents the design of an approximate frac-
predicted samples
tional interpolation hardware to reduce the computational Reconstructed +
In-loop
complexity and power dissipation of the FME operation on reference video
filters + Inverse
Transforms and
frames
Rescaling
VVC encoders. Our design is based on approximating the
number of taps of VVC filters to 6-taps, and by applying Fig. 1 Video encoder diagram. Source: modified from [8].
Multiple Constant Multiplication (MCM) algorithm to opti-
mize filter datapaths. The compression efficiency loss due to
approximation is analyzed with the Bjontegaard Delta Rate block), and MC reconstructs the block based on the obtained
(BD-Rate) and PSNR (BD-PSNR) metrics compared to the MV.
precise solution. Our architecture includes 4 filter cores in The difference between the original samples and predicted
parallel, making it able calculate 60 fractional pixels per samples are called the residual samples, which are then pro-
clock cycle. Hence, the achieved throughput of the archi- cessed by the transforms and quantization steps. Transforms
tecture makes it able to process up to 2560 × 1600 pixels convert the residual samples into the frequency domain, and
resolution at 30 fps. The throughput achieved by our archi- Quantization decreases the amount of data of frequency co-
tecture was not yet achieved by competing solutions that tar- efficients representation, eliminating information which will
get VVC standard, showing that our contribution bridges the likely not be perceived by the human visual system. Quanti-
gap between state-of-the-art encoding techniques and real- zation is controlled by a Quantization Parameter (QP), which
time applications. is directly proportional to the strength of the coding loss, and
This article is organized as follows. In Section II, after controls the quality of the reconstructed video.
an overview of VVC standard, integer and fractional motion The in-loop filters are responsible for removing coding ar-
estimation algorithm is explained and related works about tifacts that occur in previous steps due to the block-based
interpolation filter architectures are reviewed. In Section encoding. Finally, the Entropy Coding step compresses the
III, the proposed approximate VVC fractional interpolation residual data using binary arithmetic encoder, generating the
hardware architecture is discussed. Compression efficiency final coded bitstream of video.
and synthesis results are presented in Section IV. Finally, Such structure is very similar to the two previous advanced
Section V presents the conclusions. video coding standards developed by ITU-T VCEG and
II. BACKGROUND AND R ELATED W ORK ISO/IEC MPEG (HEVC and H.264/AVC). Most innovations
brought by VVC rely on the new quadtree with nested multi-
A. Versatile Video Coding Overview type tree coding block partitioning structure, which supports
Video coding standards specify video decoder and the for- binary and ternary splits to allow non-square coding units,
mat of coded video data (called bitstream). The structure and the tools included in each video coding step. Reviewing
of VVC is based on a hybrid video coding, which combines the details of each innovation of VVC is out of the scope of
prediction and transforms to reduce redundancy of the input this work and the reader can refer to [4] for more informa-
video signal, followed by quantization of prediction resid- tion. We gave an in-depth detail on integer and fractional
ual. A compatible VVC video encoder is composed by the motion estimation in VVC in the next subsection, which is
following steps: Prediction, Transforms, Quantization, In- closely related to the main topic of our work, which are the
verse Transforms and Rescaling, In-loop Filtering, and En- VVC interpolation filters.
tropy Coding. All those steps are executed in a block-based
manner, so the input video frames are first partitioned into B. Integer and Fractional Motion Estimation in VVC
blocks of samples so-called coding tree units, a set of cod- In modern video encoders, the inter-frame prediction is
ing tree blocks containing luminance and usually subsam- responsible for reducing the temporal redundancies by ana-
pled chrominance samples. Fig. 1 shows a simplified dia- lyzing the pixels of the frame to be encoded and of previ-
gram of the VVC video encoder. ously encoded ones (called reference frames). The image
Prediction is divided into two parts: (1) Intra-frame pre- to be encoded is divided into blocks and, for each block of
diction, which explores spatial redundancy and generates a the current image, the encoder searches for similar blocks
predicted block based on neighboring samples from the same in the reference image to find the block that best resembles
frame, and (2) Inter-frame prediction, which explores tempo- the block of the current image (which will be encoded). The
ral redundancy by generating predicted samples from sam- best match block is called the predicted block. When the
ples in previously encoded frames. Inter-frame prediction best match is found, a MV is generated to indicate the offset
is composed by ME and Motion Compensation (MC). ME of the position of the current block and the position of the
searches in reference frames for blocks similar to the original selected block in the reference image. Inter-frame predic-
block to be encoded and generates a Motion Vector (MV) in- tion is composed by ME and MC steps. In the reconstruction
dicating the displacement of the best match (the most similar step, the information processed by the ME is used in MC to
Journal of Integrated Circuits and Systems, vol. 16, n. 2, 2021 3
generate a displayable frame once again. MC combines the
residue blocks with the ones pointed by the MVs (from ME)
to build a reconstructed block.
Current video coding standards support fractional-
precision motion vectors as well. The purpose of this step
is to allow sub-pixel motion representation, which is com-
mon when higher frame rates are used. To implement this,
the ME process is composed of two steps: IME and FME.
In the video encoder, to generate fractional precision motion
vectors, it is necessary to define FME, which is composed of
two steps: (i) interpolation filter to generate fractional pix-
els, since the only pixels available in the image are those
of full precision; and (ii) search stage to find the fractional
block with better resemblance to the original block, to find
the direction in which the sub-pixel shift occurred. Like in
the FME, MC must realize an interpolation process to obtain
fractional samples from integer ones when a fractional MV is
used. Hence, interpolation filter is an important component
of codec design, because it is used in both the encoder and
decoder. Another observation is that MC architectures have Fig. 3 Interpolated Pixels in VVC. Source: [10]
smaller throughput requirements than ME ones, because the
ME operation is performed over several candidate blocks
for each input block, whereas MC only interpolates a sin- integer pixels of an image in VVC standard. Each filter
gle block (the best one found in ME) for each input block. is named based on its position in a left-to-right and top-to-
Fig. 2 shows an example of MVs with integer and fractional bottom convention.
accuracy, thus requiring IME and FME. Like HEVC, VVC uses 8-tap FIR filters to generate frac-
HEVC supports 1/2 and 1/4 pixel MV accuracy, which tional pixels. Each of the 15 filters has its own coefficient
means the FME generates three fractional pixels between set. The filter coefficients are shown in Table I. An exam-
two integer pixels in horizontal and vertical directions, and ple of how filter 6 (F6) is calculated from the 8 integer input
also nine pixels in the diagonal region, generating 15 interpo- samples A−3 , A−2 , A−1 , A0 , A1 , A2 , A3 , and A4 is shown
lated pixels for each integer pixel. This interpolation is com- in (1).
puted using 7-tap and 8-tap Finite Impulse Response (FIR)
filters. In VVC, the accuracy of a MV is 1/16 of a pixel, so 15 F6 = (−1A−3 + 3A−2 − 9A−1 + 47A0 + 31A1
pixels are interpolated horizontally and vertically, plus 225 (1)
−10A2 + 4A3 − 1A4 ) >> 6
pixels for the diagonal region, resulting in 255 interpolated
pixels for each integer pixel. This represents 17× more in- C. Related Work
terpolated pixels in VVC compared with HEVC. Hence, it is Several works have proposed hardware architectures for
expected at least a 17× increase in computational complex- the HEVC interpolation filters. Diniz et al. [11] propose
ity of VVC interpolation filter compared to the one defined an architecture containing two modules to address luma and
in HEVC. chroma interpolation filters, each one with 12 pixel parallel
Fig. 3 illustrates the fractional pixels created between two configurable interpolation datapaths. The configurable dat-
apath is optimized to reduce area of the interpolation dat-
Table I. Coefficients of the interpolation filters defined in VVC standard.

Coefficients
Filters
A−3 A−2 A−1 A0 A1 A2 A3 A4
1 0 1 -3 63 4 -2 1 0
2 -1 2 -5 62 8 -3 1 0
3 -1 3 -8 60 13 -4 1 0
4 -1 4 -10 58 17 -5 1 0
5 -1 4 -11 52 26 -8 3 -1
6 -1 3 -9 47 31 -10 4 -1
7 -1 4 -11 45 34 -10 4 -1
8 -1 4 -11 40 40 -11 4 -1
9 -1 4 -10 34 45 -11 4 -1
10 -1 4 -10 31 47 -9 3 -1
11 -1 3 -8 26 52 -11 4 -1
12 0 1 -5 17 58 -10 4 -1
13 0 1 -4 13 60 -8 3 -1
14 0 1 -3 8 62 -5 2 -1
Fig. 2 Integer and Fractional Motion Estimation. Source: [9] 15 0 1 -2 4 63 -3 1 0
apath, by relying on the symmetries of the HEVC interpo- tiplication approach. The results are shown only for Field
lation filters. This work was extended to reduce power of Programmable Gate Array (FPGA) platform, which is not
HEVC interpolation by employing dynamic reconfigurabil- suitable for low power systems. The works in [10, 19, 20]
ity in field-programmable gate arrays [12], and adder com- provide precise results that do not affect BD-Rate. The afore-
pressors [13]. The works [11–13] provide precise solutions mentioned works have a limited throughput when consider-
that do not affect BD-Rate. Afonso et al. [14] propose a ing FME, that are supported by [10, 20], but allow real-time
hardware architecture for FME in HEVC, thus including a encoding for up to 1920 × 1080 video resolution.
interpolation unit and comparison units for the search phase. The only work that introduces an approximate VVC frac-
The main contribution to reduce FME complexity is to adopt tional interpolation hardware employs 4-tap filters instead of
only squared-shaped prediction units instead of supporting the original 8-tap ones, which leads to a power reduction of
all the 24 possible prediction unit shapes. Such choice is up to 40% [21]. However the work targets only FPGA plat-
based on a software analysis on HEVC reference software form, which is less suitable for low power systems that will
which reveal that limiting prediction unit to square-shaped benefit of such approximate computing techniques. More-
sizes reduces almost 59% the encoding time (on average of over, it only evaluates the compression efficiency loss of the
various videos) at the cost of around 4% BD-Rate increase. approximation for two videos (Kimono and Tennis) which
Approximate architectures were also proposed to further are not in the Common Test Conditions (CTC) of VVC stan-
extend the area and power savings of dedicated HEVC in- dard [22]. Although they show a low impact in compres-
terpolation filter hardware. Penny et al. [15] propose a con- sion efficiency for these two videos, in up to 0.52% increase
figurable hardware that supports the 8-tap HEVC interpola- of BD-Rate compared to the default interpolation filters in-
tion filters and 6-tap approximate interpolation filters by re- cluded in the standard, the analysis was conducted for only
moving the leftmost and rightmost taps of the original filter. 10 frames of those two videos in Low Delay P configura-
The approach increases BD-Rate in 0.527% compared to the tion. It is not possible to conclude if this result will keep
original HEVC interpolation filters. Kalali et al. [16] propose for other video sequences, resolutions and configurations.
approximate 4-tap and 3-tap interpolation filters for HEVC Our approach rely on a more comprehensive analysis with
encoding. The work employs HCub multiplierless multiple 14 video sequences using the more generic Random Access
constant multiplication algorithm [17] to reduce the number Configuration (which includes bi-prediction frames) and us-
and size of adders of the proposed filter hardware. The ap- ing 32 frames of each video sequence. In addition, the ar-
proach results in up to 1.14% BD-Rate increase compared chitecture has also a limited throughput for FME, supporting
to the original HEVC interpolation filters. Silva et al. [18] 1920 × 1080 video resolution at 47 fps.
propose an architectural template for approximate HEVC in- Our work addresses the limitations of previous works by
terpolation filter supporting 6-tap, 4-tap and 2-tap filters, in- providing an architecture with higher throughput for the
creasing BD-Rate in 0.02%, 0.25% and 0.89%, respectively, VVC interpolation filters thus supporting higher resolution
compared to the original HEVC interpolation filters. The video encoding in real-time, and an analysis of video cod-
approximate filters were designed by removing the leftmost ing efficiency drops (using BD-Rate metric) with 14 videos
and rightmost filter coefficients, and the removed coefficients from the CTC document of VVC standard [22]. This is a
were added to their closest remaining neighbors to keep the more realistic analysis on the video coding efficiency drops
sum of coefficients in 64. due to approximation in the context of VVC compared to
the one presented in [21] that evaluates only two videos that
All those works [11–16, 18] target HEVC interpolation
are not included in VVC’s CTC. In addition, it show re-
filtering, which is 17× less complex than VVC interpola-
sults for Application Specific Integrated Circuit (ASIC), us-
tion filtering, as we motivated in the introduction. Perfor-
ing a standard-cells implementation in 65 nm CMOS tech-
mance, power, and energy of HEVC and VVC interpola-
nology, which is a more suitable fabrication technology for
tion filters cannot be directly compared. Although the ap-
low power systems than FPGA. A summary of characteris-
proximate computing approaches proposed in the context of
tics of related works is shown in Table II, comparing those
HEVC are applicable to VVC, the coding efficiency results
works with our work with respect to if the work adopt ap-
are also very different from those obtained with HEVC, be-
proximation techniques to the filters, what is the target video
cause they are sensible to video content, video format and
coding standard, which technology fabric (FPGA, ASIC, or
encoding decisions.
both), what is the design target, and the number of videos
Since the VVC standard was finished in July, 2020, a few used in the BD-Rate analysis if some approximation is per-
works proposing interpolation filter hardware architectures formed.
for VVC can be found in the literature. Azgin et al. [19] pro-
pose a reconfigurable hardware architecture for VVC inter- III. P ROPOSED A RCHITECTURE
polation filters targeting MC, that needs to interpolate only
one fractional pixel for each integer pixel. Mert et al. [10] Our proposed architecture is based on the design of a new
propose a hardware architecture focusing on FME. This de- set of 6-tap interpolation filters. The proposed filters with
sign implements 15 fractional interpolation filter datapaths only 6-tap replace the original 8-tap filters defined in the
in parallel. It uses the Hcub MCM algorithm [17] to imple- VVC standard, but only for the FME stage of video en-
ment multiplications by constant using shifters and adders. coder. It is important to notice that video encoder is not stan-
Mahdavi and Hamzaoglu [20] propose a VVC interpolation dardized by VVC, but only the video decoder and the input
filter hardware with a memory-based multiple constant mul- video decoder syntax. Hence, the design of video encoder is
Table II. Summary of related works Table III. Coefficients of the Proposed Approximate Interpolation Filters
Approx. Standard Techn. Design Videos Coefficients
Filters
[11] No HEVC ASIC FME + MC N.A. A−3 A−2 A−1 A0 A1 A2 A3 A4
[12] No HEVC FPGA FME + MC N.A. 1 0 1 -3 63 4 -2 1 0
[13] No HEVC ASIC FME + MC N.A. 2 0 1 -5 62 8 -3 1 0
[14] No HEVC Both FME + MC 24 3 0 2 -8 60 13 -4 1 0
[15] Yes HEVC Both FME 24 4 0 3 -10 58 17 -5 1 0
[16] Yes HEVC Both FME 14 5 0 3 -11 52 26 -8 2 0
[18] Yes HEVC FPGA FME + MC 15 6 0 2 -9 47 31 -10 3 0
[19] No VVC Both MC N.A. 7 0 3 -11 45 34 -10 3 0
[10] No VVC Both FME + MC N.A. 8 0 3 -11 40 40 -11 3 0
[20] No VVC FPGA FME N.A. 9 0 3 -10 34 45 -11 3 0
[21] No VVC FPGA FME 2 10 0 3 -10 31 47 -9 2 0
Our Yes VVC ASIC FME 14 11 0 2 -8 26 52 -11 3 0
12 0 1 -5 17 58 -10 3 0
13 0 1 -4 13 60 -8 2 0
14 0 1 -3 8 62 -5 1 0
free for optimizations provided that it generates a compati-
15 0 1 -2 4 63 -3 1 0
ble bitstream for the standardized VVC decoder. This way,
our work approximates only the interpolation filters in FME
stage, and the interpolation filters used in the MC stage are
kept as defined in the VVC standard, to make this stage in
the encoder fully compatible with MC used in the video de-
coder, thus making our approach fully compatible with the
VVC standard. Our approach benefits on the approximation
in only the FME module because it needs to generate at most
255 fractional pixels for each integer pixel, since it chooses
what is the best fractional offset to be encoded, while MC
needs to generate only one fractional pixel for each integer
pixel.
Another principle to design such approximate filters is to
reduce the number of taps to 6-tap but keeping the sum of
the coefficients of each filter in 64, in order to keep the filter
response of the approximate version similar to the precise
version, and to keep the shift operation by 6 at the end of
the calculation, as shown in (1). Our goal is not to design a
completely different (and more simple) interpolation filter to
replace the original one, but to make a small approximation
on the interpolation filter in order to not degrade the coding
efficiency that is provided by VVC. To maintain the sum of Fig. 4 Approximate Hardware Architecture for VVC Interpolation Filter
filter coefficients in 64 and reduce the number of taps from 8
to 6, the leftmost and rightmost coefficients of original 7-tap
and 8-tap filters 2 to 14 were added to the leftmost and right- throughput needed by the VVC interpolation filter to process
most coefficients of the new 6-tap filters, as follows: on the high resolutions videos in real-time.
left side, each coefficient multiplied by the input sample A−3 A buffer stores the fractional pixels that are used as input
is added with each coefficient multiplied by the input sample to calculate other fractional pixels. The input multiplexer
A−2 . On the right side, each coefficient multiplied by the selects the input pixels or the pixels from buffer depending
input sample A4 was added with each coefficient multiplied on which set of pixels must be interpolated. To support two
by the input sample A3 . In this way, it was possible to re- passes of the 8-bit depth reference integer input pixels, the
duce the number of taps, without changing the total sum of filter datapaths were designed with 10 bits. The output frac-
the coefficients in each filter. Table III shows the proposed tional pixels bit depth is based on the number of fractional
approximate filters for use in FME stage of VVC video en- samples required for a 4 × 4 block size, which in the pro-
coders. posed solution, has outputs of 12 bits.
The architecture proposed in this work is presented in Fig. Fig. 5 details the internal architecture of the Filter Core,
4. The proposed architecture was developed to receive 9 in- which was instantiated 4 times in the top level architecture
teger samples as input, which are needed to generate a line of to generate 60 pixels in parallel. This module implements all
a 4 × 4 block. The input samples are delivered to four filter the 15 filters defined in Table III using the MCM technique,
cores in parallel. Each filter core receives 6 integer samples that replaces the multiplication of input values of multiple
as input and generates 15 interpolated pixels in parallel at constant coefficients by add and shift operations, optimizing
the output according to the filters shown in Table III. Hence, the number of adders and critical path. Since the same input
the whole architecture outputs 60 fractional pixels in each pixel is multiplied by multiple constants when looking at all
clock cycle. This parallelism was employed to reach the high the 15 filters, the datapath can be optimized by applying an
Fig. 5 Filter Core Architecture
Table IV. Comparison of Precise and Approximate Filters

efficient MCM algorithm. The MCM algorithm used in this
Adders Shifters Multipliers
work is Hcub, first proposed in [17]. To generate 15 filters Precise 95 — 160
in parallel, our architecture includes only 6 MCM modules Precise (MCM) 123 96 0
in parallel, because this is the number of integer pixels de- Approximate 75 — 120
livered in the input. MCMs were generated with the Spiral Approximate (MCM) 103 96 0
software [23]. Spiral has a website interface that generates
a C and Verilog files of a optimized MCM module based on
the user-defined inputs: constants (the filter coefficients), the IV. R ESULTS AND D ISCUSSION
number of fractional bits (which is zero for our case, because
the coefficients are integer), the MCM algorithm (we have To measure the compression efficiency, the BD-Rate and
employed HCub [17]), the depth limit (which specifies the Bjontegaard-Delta PSNR (BD-PSNR) metrics are used. BD-
maximum allowed adder tree depth, and we kept as unlim- Rate represents an average difference in the bit rate between
ited), and a secondary opimization phase. The 6 MCM mod- a reference encoder and a test encoder for equivalent im-
ule outputs are delivered to 15 adder modules that sum the age quality. When the test encoder is more efficient than
corresponding outputs indicated by colored arrows to gener- the reference, the resulting BD-Rate is negative [24]. Sim-
ate the final fractional interpolated pixels. All this process is ilarly, BD-PSNR represents the difference in quality (cal-
conducted in one clock cycle. Hence, the throughput of our culated with PSNR, in dB) when the equivalent bit rate is
architecture is 60 pixels per cycle. considered. In our work, we consider as reference encoder
the default configuration of VTM (VVC Test Model) version
10.1rc1 [25]. The test configuration was made by modifying
the VTM software to implement our approximate interpo-
Table IV shows a comparison in the number of arith- lation filters in software only for the FME. We have used
metic operators when employing our approximate filters and Random Access (RA) configuration with the first 32 frames
MCM, compared with precise version defined in VVC stan- of 14 video sequences of Class B (1920x1080 pixels), Class
dard. First, by using MCM there is a complete elimination C (832x480 pixels) and Class D (416x240 pixels). Thirteen
of the 160 multipliers needed to implement precise VVC fil- of these video sequences are recommended by the CTC of
ters, with an increase of only 8% in the number of adders. VVC [22] while Kimono video sequence is included in the
The replacing of multipliers by adders and shifters makes CTC of HEVC but is not included in CTC of VVC. Each
a substantial reduction of circuit area, given the area of the video sequence was encoded with four Quantization Param-
multiplier is much higher compared to the area of adders and eters (22, 27, 32, 37) to enable BD-Rate and BD-PSNR cal-
subtractors. The approximate technique further reduces the culation. The results are shown in Table V.
number of arithmetic operators compared to precise filters. The most part of BD-Rate values of Table V are pos-
Table V.: Results for BD-Rate and BD-PSNR (using the CTC configuration)
power and area resources compared to the implementation
Sequence Name Class BD-Rate (%) BD-PSNR (dB)
Kimono Class B 0.12 -0.0067
of the approximate hardware in HEVC [15] and the precise
MarketPlace Class B -0.01 0.0001 hardware in VVC [10]. We also computed the energy/pixel
Cactus Class B 0.19 -0.0037 consumption, which represents how much energy is spent
BasketballDrive Class B 0.19 -0.0031 in a second to process each input pixel. Such increase in
BQTerrace Class B 0.46 -0.0084 power and area is justified because the proposed architecture
RitualDance Class B 0.28 -0.0134 was designed aiming at a higher performance to support the
BasketballDrill Class C -0.02 0.0008 processing requirements of the FME accelerators for UHD
BQMall Class C 0.51 -0.0196 videos, while related works [10,15] support real-time encod-
PartyScene Class C 0.35 -0.0162 ing of HD videos. With the increase in frequency, our archi-
RaceHorses Class C 0.81 -0.0505
tecture is able to achieve a higher performance compared to
BasketballPass Class D 0.54 -0.0263
BQSquare Class D 0.58 -0.0268
other solutions. This higher throughput also increases energy
BlowingBubbles Class D 0.69 -0.0265 consumption, as our architecture presents a 25% overhead to
RaceHorses Class D 1.07 -0.0505 process each pixel when compared to [15]. However, we can
Average 0.41 -0.0165 note that the solution in [15] targets HEVC encoder which
is nearly 17× less complex than VVC encoder. Our solution
also provides a more comprehensive analysis on compres-
itive and small, showing a negative impact in compression efficiency drop of interpolation filter approximation in
sion efficiency of the approximate interpolation filter com- VVC compared with [21], as it provides the analysis only for
pared to the precise (default) VVC interpolation filters. This two videos. It was not possible to compare synthesis results
is expected as we approximate filters by reducing filter with [21] because they do not provide ASIC implementation
taps. However, the impact in compression efficiency can results.
be considered small, since most of BD-Rate values are less V. C ONCLUSION
than 1%, while only RaceHorses (Class D) video sequence
This article presents a dedicated hardware architecture for
presents a BD-Rate value slightly higher than 1%. Our av-
an approximate interpolation filter based on VVC standard.
erage value is 0.41% which is very negligible compression
The architecture supports 15 filters of 6-taps, implemented
efficiency drop.
using the Hcub MCM algorithm. With these techniques the
Table V also shows negative BD-Rate values. It means architecture is able to process up to 2560×1600 pixels videos
that the compression efficiency was improved compared with at 30 fps with power dissipation of 23.9 mW when operating
the default version. It also happens because of the com- at a frequency of 522 MHz, with an average compression
plicated algorithms involved in the rate distortion optimiza- efficiency degradation of only 0.41% compared to default
tion of advanced video encoders, that makes the encoder to VVC video encoder software configuration.
find a better solution when they skip a local minimum rate-
distortion point. It happens in our case for BasketballDrill ACKNOWLEDGEMENTS
(Class C) and MarketPlace (Class B) video sequences. In The authors would like to thank CNPq, CAPES and
the video sequences that the BD-PSNR is positive, it also FAPERGS for the financial support to this work.
means that for the same bit rate the proposed approximate
filters have obtained a better quality than precise filters. R EFERENCES
[1] Cisco, “Cisco annual internet report (2018–2023) white paper,” Tech.
The proposed architecture was described in VHDL and Rep., Mar. 2020. [Online]. Available: www.cisco.com
synthesized with the Cadence Genus Synthesis Solution tool
using the ST Microelectronics 65 nm standard cells at 1.35 V. [2] ——, “Cisco visual networking index: Forecast and trends,
The synthesis results were obtained for each supported reso- 2017–2022,” Tech. Rep., Nov. 2018. [Online]. Available:
www.cisco.com
lution which translates to a target frequency, as presented in
Table VI. The gate count (in kgates) is the number of equiv- [3] ITU-T and ISO/IEC, “Versatile Video Coding,” ITU-T Recommenda-
alent 2-input NAND gates. The needed throughput of each tion H.266 and ISO/IEC 23090-3”, 2020.
video resolution and frame rate is calculated as the number of [4] B. Bross, J. Chen, J.-R. Ohm, G. J. Sullivan, and Y.-K. Wang, “De-
interpolated samples to be calculated for second considering velopments in international video coding standardization after AVC,
that in VVC for each integer sample we have to interpolate with an overview of versatile video coding (VVC),” Proceedings of
255 fractional samples in the worst case. Since our archi- the IEEE, pp. 1–31, 2021.
tecture calculates 60 fractional pixels per cycle it is possible [5] ITU-T and ISO/IEC, “High Efficiency Video Coding,” ITU-T Recom-
to determine target operating frequencies and set it as a con- mendation H.265 and ISO/IEC 23008-2, 2013.
straint in the Genus tool. Our architecture supports video
encoding in real-time of up to 2560 × 1600 pixels video at [6] Í. Siqueira, G. Correa, and M. Grellert, “Rate-distortion and complex-
ity comparison of hevc and vvc video encoders,” in 2020 IEEE 11th
30 fps dissipating 23.98 mW of total power, when operating Latin American Symposium on Circuits & Systems (LASCAS). IEEE,
at 522 MHz. On the other side, it can process low resolution 2020, pp. 1–4.
videos (416 × 240 pixels) by synthesizing the architecture to
12 MHz target frequency and dissipating only 5.86 mW of [7] D. Li, Z. Zhang, K. Qiu, Y. Pan, Y. Li, H. R. Hu, and L. Yu, “Affine
deformation model based intra block copy for intra frame coding,”
total power. in 2020 IEEE International Symposium on Circuits and Systems (IS-
In Table VII, we can see that our solution consumes more CAS), 2020, pp. 1–5.
Table VI. Synthesis Results
Supported resolution and frame rate Area (µm2 ) Gate Count (k) Total Power (mW) Frequency (MHz)
2560 × 1600 @ 30 fps 142.838 68.67 23.98 522.47
1920 × 1080 @ 50 fps 115.207 55.39 19.87 440.72
1920 × 1080 @ 30 fps 62.053 29.83 8.42 264.4
832 × 480 @ 50 fps 60.025 28.86 6.81 84.87
832 × 480 @ 30 fps 60.025 28.86 6.35 50.92
416 × 240 @ 50 fps 60.025 28.86 5.92 21.22
416 × 240 @ 30 fps 60.025 28.86 5.86 12.73
Table VII. Comparison With Related Works

VLSI design,” in Proceedings of the 32nd Symposium on Integrated
Related Work [15] [10] Our Circuits and Systems Design, 2019, pp. 1–6.
Standard HEVC VVC VVC
Technology (nm) 90 90 65 [19] H. Azgin, A. C. Mert, E. Kalali, and I. Hamzaoglu, “A reconfigurable
Frequency (MHz) 300 435 522 fractional interpolation hardware for VVC motion compensation,” in
Gate Count (k) 12.8 37.6 68.7 2018 21st Euromicro Conference on Digital System Design (DSD).
Total Power (mW) 15.8 – 23.9 IEEE, 2018, pp. 99–103.
Supported 1920x1080 1920x1080 2560x1600
[20] H. Mahdavi and I. Hamzaoglu, “A VVC fractional interpolation hard-
Resolution @ 49 fps @ 88 fps @ 30 fps ware using memory based constant multiplication,” in 2021 IEEE In-
Energy/pixel 0.155 nJ – 0.194 nJ ternational Conference on Consumer Electronics (ICCE), 2021, pp.
1–5.
[8] C. M. Diniz, B. Abreu, M. Grellert, F. M. Sampaio, D. Palomino, [21] H. Azgin, E. Kalali, and I. Hamzaoglu, “An approximate versatile
F. L. L. Ramos, B. Zatt, and S. Bampi, “Joint algorithm-architecture video coding fractional interpolation hardware,” in 2020 IEEE Inter-
design of video coding modules,” VLSI Architectures for Future Video national Conference on Consumer Electronics (ICCE). IEEE, 2020,
Coding, p. 41, 2019. pp. 1–4.
[9] B. Bing, Next-generation video coding and streaming. John Wiley [22] F. Bossen, “JVET common test conditions and software reference con-
& Sons, 2015. figurations for sdr video,” in Document JVET-N1010 14h JVET Meet-
ing, Geneva, CH, 2019.
[10] A. CanMert, E. Kalali, and I. Hamzaoglu, “A low power versatile
video coding (VVC) fractional interpolation hardware,” in 2018 Con- [23] J. M. Moura, J. Johnson, R. Johnson, D. Padua, V. Prasanna,
ference on Design and Architectures for Signal and Image Processing M. Püschel, and M. Veloso. (2020) Spiral multiplier block generator.
(DASIP). IEEE, 2018, pp. 43–47. http://spiral.ece.cmu.edu/mcm/gen.html.
[11] C. M. Diniz, M. Shafique, S. Bampi, and J. Henkel, “High-throughput [24] G. Bjontegaard, “Calculation of average PSNR differences between
interpolation hardware architecture with coarse-grained reconfig- RD-curves,” VCEG-M33, 2001.
urable datapaths for hevc,” in 2013 IEEE International Conference
[25] VTM. (2020) VVC test model (VTM) v. 10.1rc1.
on Image Processing, 2013, pp. 2091–2095.
https://jvet.hhi.fraunhofer.de/.
[12] ——, “A reconfigurable hardware architecture for fractional pixel in-
terpolation in high efficiency video coding,” IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Systems, vol. 34,
no. 2, pp. 238–251, 2015.
[13] C. Diniz, M. Fonseca, E. da Costa, and S. Bampi, “Evaluating the use

of adder compressors for power-efficient hevc interpolation filter ar-
chitecture,” Analog Integrated Circuits and Signal Processing, vol. 89,
2016.
[14] V. Afonso, H. Maich, L. Audibert, B. Zatt, M. Porto, L. Agostini, and

A. Susin, “Hardware implementation for the hevc fractional motion
estimation targeting real-time and low-energy,” Journal of Integrated
Circuits and Systems, vol. 11, no. 2, pp. 106–120, 2016.
[15] W. Penny, M. Ucker, I. Machado, L. Agostini, D. Palomino, M. Porto,

and B. Zatt, “Power-efficient and memory-aware approximate hard-
ware design for HEVC FME interpolator,” in 2018 25th IEEE Inter-
national Conference on Electronics, Circuits and Systems (ICECS).
IEEE, 2018, pp. 237–240.
[16] E. Kalali and I. Hamzaoglu, “Approximate hevc fractional interpola-

tion filters and their hardware implementations,” IEEE Transactions
on Consumer Electronics, vol. 64, no. 3, pp. 285–291, 2018.
[17] Y. Voronenko and M. Püschel, “Multiplierless multiple constant mul-

tiplication,” ACM Transactions on Algorithms (TALG), vol. 3, no. 2,
pp. 11–es, 2007.
[18] R. da Silva, Í. Siqueira, and M. Grellert, “Approximate interpolation

filters for the fractional motion estimation in HEVC encoders and their

Rpribas, 327-QuaseFinal

Uploaded by

Copyright:

Available Formats

Rpribas, 327-QuaseFinal

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Rpribas, 327-QuaseFinal

Uploaded by

Copyright:

Available Formats

Journal of Integrated Circuits and Systems, vol. 16, n.

Approximate Hardware Architecture for the Interpolation Filters of Versatile Video

Digital Object Identifier 10.29292/jics.v16i2.327

scenario, as we discuss in Section II. reference

Table I. Coefficients of the interpolation filters defined in VVC standard.

Fig. 5 Filter Core Architecture

Table IV. Comparison of Precise and Approximate Filters

Table VII. Comparison With Related Works

[13] C. Diniz, M. Fonseca, E. da Costa, and S. Bampi, “Evaluating the use

[14] V. Afonso, H. Maich, L. Audibert, B. Zatt, M. Porto, L. Agostini, and

[15] W. Penny, M. Ucker, I. Machado, L. Agostini, D. Palomino, M. Porto,

[16] E. Kalali and I. Hamzaoglu, “Approximate hevc fractional interpola-

[17] Y. Voronenko and M. Püschel, “Multiplierless multiple constant mul-

[18] R. da Silva, Í. Siqueira, and M. Grellert, “Approximate interpolation

You might also like