Kulmala - Scalable MPEG-4 Encoder o
Kulmala - Scalable MPEG-4 Encoder o
Kulmala - Scalable MPEG-4 Encoder o
Department of Information Technology, Institute of Digital and Computer Systems, Tampere University of Technology,
P.O. Box 553, Korkeakoulunkatu 1, 33101 Tampere, Finland
Received 15 December 2005; Revised 16 May 2006; Accepted 27 June 2006
High computational requirements combined with rapidly evolving video coding algorithms and standards are a great challenge
for contemporary encoder implementations. Rapid specification changes prefer full programmability and configurability both for
software and hardware. This paper presents a novel scalable MPEG-4 video encoder on an FPGA-based multiprocessor system-
on-chip (MPSOC). The MPSOC architecture is truly scalable and is based on a vendor-independent intellectual property (IP)
block interconnection network. The scalability in video encoding is achieved by spatial parallelization where images are divided
to horizontal slices. A case design is presented with up to four synthesized processors on an Altera Stratix 1S40 device. A truly
portable ANSI-C implementation that supports an arbitrary number of processors gives 11 QCIF frames/s at 50 MHz without
processor specific optimizations. The parallelization efficiency is 97% for two processors and 93% with three. The FPGA utilization
is 70%, requiring 28 797 logic elements. The implementation effort is significantly lower compared to traditional multiprocessor
implementations.
Copyright © 2006 Ari Kulmala et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
This paper is organized as follows. Related work is re- quiring 400 kgates for HW accelerators while providing 30
viewed in Section 2. The MPSOC architecture is described QCIF frames/s at 12 MHz. Reference [16] presents FPGA im-
in Section 3. The video encoding software and our encoder plementations of hardware accelerators for an H.264 video
parallelization approach are presented in Section 4. Section 5 codec. In [17], an H.264 coder is designed and verified with
explains the integration of the HW architecture and software. an FPGA emulator platform. An interface between a host PC
In Section 6 the results are presented. Finally, Section 7 sum- and an FPGA-based MPEG-4 encoder is built in [18] en-
marizes the paper and discusses future work. abling fast prototyping and debugging.
In this section we consider the related work in two categories, Although multiprocessor systems have been researched for
parallel video encoding and FPGA-based MPSOC architec- a while, most of the work has concentrated on ASIC imple-
tures. mentations. FPGAs have only recently grown large enough
to hold such implementations, which is one reason for a low
number of reported FPGA-based MPSOCs. However, two
2.1. Parallel encoder implementations
trends of research can be identified.
Due to high computational complexity of video encoding First, FPGA multiprocessor systems are used to develop
[5], several parallel solutions have been developed in con- parallel applications. In these works, the main emphasis
trast to traditional sequential program flow [6]. There are at is usually on the application and its parallelization. The
least four general parallelization methods used: functional, hardware architectures are briefly summarized. Typical im-
temporal, data, and video-object parallelism. For functional plementations rely on vendor-dependant solutions, because
parallelism [7] different functions, such as DCT and motion they are usually easy to use. The hardware restrictions to scal-
estimation, are connected in a functional pipeline to be exe- ability or flexibility and the ease of adding or removing com-
cuted in parallel by different processing units. However, scal- ponents are often not addressed.
ing a functional parallel application requires a lot of work In [19], Martina et al. have developed a shared mem-
(high scaling effort). When each processor executes a specific ory FPGA multiprocessor system of digital signal processor
function, adding or removing processors requires a whole (DSP) cores of their own design that run basic signal pro-
system redesign to balance the computational load in the cessing algorithms. The implemented bus-interconnection is
pipeline. For temporal parallelism (i.e., parallel in time) a full not described. There are no synthesis results for the whole
frame is assigned to every CPU. The scalability of this style is system, but the system runs at 89 MHz on a Xilinx XCV1000
high. However, as the number of parallel encoders increase, it FPGA. Wang and Ziavras presented a system of six soft-
introduces a significant latency in encoding, since one frame core Nios processors [20]. Processors are interconnected
is buffered for each encoding CPU. Works in this category with a multimaster Avalon bus [21]. No figures of the area
include [8–10]. In data parallelism, the image is divided into required for the whole system are presented. However,
slices that are assigned to different CPUs. The slices are en- the maximum clock frequency is 40 MHz using an Altera
coded in parallel frame-by-frame. This approach is used in EP20K200EFC484-2x FPGA board.
[11–13]. For video-object parallelism, which is specific to Second, a hardware-oriented point of view for future
MPEG-4, arbitrary sized shapes referred to as video-objects multiprocessor requirements is presented. Reconfigurability
in the image are assigned to different CPUs. The objects can is often emphasized [22]. Also, IP-block-based systems are
be considerably unequal in size, which may lead to unbal- stressed and a need for a scalable, standard interface inter-
anced execution time between different CPUs if care is not connection network is anticipated. Kalte et al. [22] have pre-
taken. Such work is presented, for example, in [14]. sented a multilayer advanced microcontroller bus architec-
We are mainly interested in real-time encoding. Func- ture (AMBA) interconnection architecture used in an FPGA.
tional, data, and video-object parallelism are all eligible for In AMBA, access to slaves is multiplexed between masters
real-time, low-latency video encoding, because they do not and different masters can use different peripherals simulta-
require frames to be buffered. We chose data parallelism be- neously. A conceptual view of a system is depicted although
cause functional parallelism has a high scaling effort and not implemented. The interconnection architecture is syn-
video-object parallelism is strictly MPEG-4 specific. Scala- thesized separately, as is the processor. No application is de-
bility is the most feasible criterion used to compare different scribed.
architectures and parallel implementations, because reported This work combines both of the above categories, since
results typically vary in accuracy. The scalability of different an application and a working prototype is implemented on
parallelization methods is compared in Section 6. the proposed architecture. The architecture itself is con-
Contemporary FPGA designs tend to use single encoder structed of IP blocks and a general purpose interconnec-
cores with HW accelerators arranged in a functional pipeline. tion architecture with support for an OCP-IP interface [23],
Our implementation is one of the first utilizing multiple par- which is a standard interconnection interface. A standard-
allel encoders on an FPGA in a data parallel configuration. ized IP block interface along with high scalability ensures the
In [15], an FPGA-based H.263 encoder is demonstrated re- future use of the architecture.
Ari Kulmala et al. 3
Master processor
Nios processors do not have a native support for the HIBI
Avalon on-chip network. Therefore, a DMA block, Nios-to-HIBI v.2
256 B 2 KB interfaces
Timer 2 vector table boot ROM Ext. bridge (N2H2), was implemented to attach the processors to HIBI.
Slave 1 DMA minimizes CPU intervention in transfers. This allows
Slave 2
Slave 3 the CPU to execute the application while DMA transfers data
8 KB 32 bit NIOS 64 KB
LED PIO
instr. cache CPU data RAM
on the background.
N2H2 includes three Avalon interfaces. The slave inter-
face is used by a CPU to control and query N2H2. These con-
Button PIO UART Timer 1 N2H2 HIBI figuration and status registers include the state of the DMA
block, DMA transfer instructions, and DMA receiving in-
structions. Two master interfaces are used separately for re-
Figure 2: Master processor configuration. ceiving and transmitting. In order to increase the reusability,
these interfaces have been isolated from the other as much as
possible. Thus, with minimal modifications, the same block
Slave processor
can be applied to different processors.
The transmitter side is fairly straightforward. First, the
256 B 2 KB Avalon To ext.
vector table boot ROM interface SRAM CPU writes the memory address (e.g., a pointer), the amount
of data, priority, and destination address to the configuration
register. Following this, the transmitter sends the address to
8 KB NIOS II/f 64 KB the HIBI. The transmitter then reads the data from memory
LED PIO instr. cache CPU data RAM
and instantly transfers the data.
Receiving is more complicated. Data sent through HIBI
UART Timer 1 N2H2 HIBI may get fragmented. To circumvent this, we have imple-
mented multiple receiving channels that wait for a given
amount of data before interrupting the CPU. Each channel
Figure 3: Slave processor configuration. can be configured to receive data from any source and save
it to memory locations defined by the CPU. There can be
several data transfers going on simultaneously, so N2H2 uses
Nios II/f (fast) that provides 1.16 DMIPS/MHz and an area the HIBI address to distinguish between them. For example,
consumption of around 2000 LEs. Nios II is used for video if two sources are sending data simultaneously, two channels
encoding slaves with the configuration depicted in Figure 3. are used. When the expected number of data has arrived on
Unlike Nios, Nios II uses tightly-coupled memory between the a channel, the CPU is notified by an interrupt or via a poll
processor and data RAM. Data does use the Avalon [21] bus register.
that significantly reduces the memory access time. Figure 4 depicts a data transfer over HIBI. CPU1 sends
Nios natively use the Avalon bus to connect to memories four words to CPU2. On cycle 0, CPU1 gives a start com-
and peripherals. Avalon bus masters can command slaves, mand to the N2H2 DMA. IRQ is acknowledged in clock cy-
but slaves can only get the attention of a master by raising cle 1 and the transfer is started immediately. The address is
an interrupt. A typical master is a processor. A drawback is sent first and then the data. Clock cycles 3–8 are consumed
that masters cannot communicate directly with each other. by the HIBI wrapper and arbitration latency. During this de-
Both Nios processors have separate instruction and data lay, another transmission can be proceeding in HIBI, so the
buses. In Figures 2 and 3 only the data bus is drawn for latency is hidden. When the access to the HIBI is gained, the
simplicity. The instruction bus connects to the boot ROM, address and the data are sent forward in clock cycles 9–13.
instruction memory (ext. SRAM), and vector table. With The data propagates through the receiving unit and buffers
the Avalon bus, there are 20 bus master address lines, 32 until at clock cycle 15 N2H2 sees the address and determines
data lines, and waitrequest and read signals. Data bus mas- the right channel. Clock cycles 16–19 are used to store the
ter has 20 address lines, two 32-bit data buses for read and data in the memory of the CPU2. After all the data expected
write, and signals waitrequest, read, write, irq and irq num- has been received, an IRQ is given to the CPU2 at clock cycle
ber. This makes 92-signal lines in total. There are possibly 21.
other signal lines as well, but even with this practical mini-
mum, there are 146-signal lines in the buses. As a compari- 4. MPEG-4 SOFTWARE IMPLEMENTATION
son, 32-bit HIBI bus consists of only 38-signal lines (32-bit
data, 3-bit command, address valid, lock, and full). In addi- One of the key advantages of data parallel encoding methods
tion, HIBI supports interprocessor communication without is that they enable scalability by using macroblock row, mac-
restrictions. The features of Avalon are not sufficient for data roblock, or block-level image subdivision. Moreover, spatial
intensive multiprocessor communication, motivating the use data parallelization can be performed with vertical, horizon-
of HIBI. tal, rectangular, or arbitrary shaped slices. The problem of
Ari Kulmala et al. 5
Clock cycles
Module 0 1 2 3 4 5 6 7 8 9 10 11121314 15 1617 18 19 20 21
CPU1 I Notify the
Transmit ADDDD receiving CPU2
HIBI network A DDDD
Receive A DDDD
CPU2 DMA1 send Data on HIBI I
Initiate transfer DMA2 receive
I IRQ D Data
A HIBI address processing
Figure 4: An arbitrary four-word data transfer over HIBI between two CPUs.
5 MB rows
Mem-to-mem transfer
Mem-to-mem transfer
PE1 PE2 PE3
First MB row of PE2
4 MB rows
MBs completed on PE1 MV over slice boundary
Figure 5: (a) Motion vector dependency problem in vertical parallelization and (b) horizontal parallelization for distributed memory ma-
chines.
vertical parallelization, shown on the left side of Figure 5, is serting slice headers such as H.263 group-of-block (GOB) or
that predictive coding is not considered, leading to motion MPEG-4 video packet headers (VPH) in the beginning of a
vector (MV) and DQUANT (denoting changes in quantiza- slice. Clearly, this results in some overhead in the bit stream
tion parameter (QP)) dependency problems [27]. For exam- but prediction dependencies are avoided. In addition, inter-
ple, H.263/MPEG-4 vector prediction is performed by com- processor communication and extra memory is needed to
puting the median of three neighboring vectors referred to implement the overlapping.
as MV1, MV2, and MV3 in Figure 5. Due to data-dependent However, a drawback of [27] is a somewhat coarse gran-
computations and the different shape of slices, computations ularity leading to unbalanced computational loads due to the
do not proceed in a synchronized manner in different slices. unequal size of slices. For this reason, the original approach is
For this reason, a data dependency problem arises in the slice improved by subdividing images using macroblock granular-
boundaries where one of the predictor vectors may not be ity as in Figure 6. Interprocessor communication and over-
available. lapping are further avoided by exploiting a shared memory
Horizontal spatial partitioning, however, is natural to in an MPSOC type of platform. The new method is highly
raster scan macroblock (MB) coding. The right side of Figure scalable since the whole image is assignable to a single proces-
5 depicts our previous implementation on a distributed sor while the largest configuration dedicates a processor for
memory DSP using MB row granularity [27]. The recon- each MB. No interprocessor communication is needed since
structed images are made slightly overlap to allow motion data can be read directly from the global memory buffer.
vectors to point over slice boundaries. The overlapping areas The shared memories, however, are potential bottlenecks,
are also exchanged between processors after local image re- and thus efficient techniques for hiding transfer latencies are
construction. Prediction dependencies are eliminated by in- needed.
6 EURASIP Journal on Embedded Systems
Recon image (shared mem. for PE1 & PE2) The tasks of the master are illustrated in the middle of
Slice for PE1 Figure 7. To encode a frame, the master first waits for the
parameters from the host PC. Next, the PC sends the raw
image (one frame at a time). The master slices the received
VPH image, configures the slaves, and signals them to start the en-
coding. As the slaves complete, each informs the master that
Slice for PE2 it has finished. After all the slaves have completed encoding,
the master finds out the sizes of the bit streams of the slaves,
VPH
merges the bit streams, and sends the merged bit stream (en-
coded image) to the PC.
Slice for PE3 Slave tasks are illustrated on the right of Figure 7. First,
VPH = video packet header position the slave waits for the parameters from the master. Then,
(removes prediction dependencies) the slave downloads a local motion estimation (ME) window
and pixels of the corresponding image macroblock. Then, it
encodes the macroblock. This continues as long as there are
Figure 6: Horizontal data parallelization for shared memory.
macroblocks in the slice left to encode. If the local bit buffer
goes full, the bits are uploaded to the external image mem-
ory. After all the macroblocks have been encoded, the slave
The data parallelization in Figure 6 was implemented uploads the bit buffer to the external memory and begins to
with a master control program and slave MPEG-4 encoders. wait for the next slice.
In addition, a host PC program has been implemented as The video encoder can run in two different modes: first,
a user interface. It should be noted that the master and it can run in real-time, so one frame at a time is transferred
the slave refer to the encoding processors, not, for exam- forth and back. Second, it can run in buffered mode, where
ple, to the Avalon bus master or slave used for Avalon com- the PC downloads a video sequence to the master. It is en-
munication. The flow graphs and the synchronization of coded as a whole and sent back to the PC. The video sequence
SW are depicted in Figure 7 while the implementation and length is parameterizable. Buffered mode mimics a situation
the integration details of the master and the slave proces- where, for example, a video camera is attached to the system
sor are discussed in Section 5. The master processor con- and feeding the encoder.
trols and synchronizes the encoding. The tasks of the host
PC, the master, and the slaves are presented in the follow-
ing. 5. INTEGRATION
n al
Update master Update slave Sig Wait params. and
parameters parameters slice ready sync. event
MB = 0;
Wait image
Wait image upload cpu = 0; Load 48 48 pixel local ME window
capture
For (all slaves)
slice ready; Load pixels of current MB
Upload image
Host PC primary data loop
MPEG-4 encode MB
Signal
full ?
Upload bits to
Merge bits
image mem.
False
cpu++;
True MB ++;
Decode bits if (cpu < num of slaves)
if (MB < num of MBs)
False
False
Send bits Signal: master
Measure to host PC
statistics slice encoded Upload bits to
image mem.
False
True False
If (reinit) if (reinit) if (reinit)
False True
True Reinit cmd
Reinit cmd
(Sent to all slaves)
Table 2: Stratix 1S40 FPGA properties. The video encoder was configured in such a way that a
Feature Stratix 1S40 contains 64 KB of local data memory is sufficient for each proces-
sor. Small buffers and memories like 2 KB boot program
Logic elements (LEs) 41 250 ROMs were also assigned to the on-chip memory. The ex-
Embedded memory, RAM (bits) 3 423 744 ternal 16 MB SDRAM was allocated as the frame memory.
DSP blocks 14 A custom SDRAM controller was implemented with special
PLLs 12 DMA functions. General block transfer commands are im-
plemented with an option to support application-specific
commands. For example, we have currently implemented
memories as the external SRAM and external SDRAM are a command for an automatic square block retrieval, for
already utilized. Therefore, at maximum one master pro- instance an 8 × 8 block, for the video encoder application.
cessor and three slaves are used. The amount of the LEs In this way we need to commit only a single transfer instead
in the FPGA is more than sufficient to allow for scalabil- of eight separate transfers. However, the SDRAM controller
ity. is fully reusable and can also be used without the application-
Three different memories are used for the video encod- specific features. The control interface also supports sequen-
ing: the on-chip embedded memory, the external SRAM, and tial block writes and reads, increasing the efficiency com-
the external SDRAM. The flash memory is only used to con- paring to single write/read operations. The SDRAM DMA
figure the FPGA upon power-up. The same encoding soft- is configured with the highest priority messages. However, in
ware can be used for all slaves, which provides an oppor- practice, using higher priority does not have a notable effect
tunity to use the same instruction memory for all of them. on performance in this application.
As the application fits to 512 KB, the 1 MB SRAM memory
was divided evenly between the master and the slave proces- 5.2. Configurations of the components
sors. To reduce the shared memory contention, instruction
memory caches were used. Each cache utilizes 8 KB of on- The exact configurations of the processors are shown in Fig-
chip memory. ures 2 and 3. The bus to the external SRAM goes through the
8 EURASIP Journal on Embedded Systems
master processor in practice, as opposed to the architecture Read segment of slave bits
shown in Figure 1. Slaves have no access to the address space
of the master. Master, however, can access the slave portion Master’s local mem.
1.
of the memory. That gives an opportunity to reconfigure the Buffer A
slaves at run-time, changing the program on the fly. This fea- 2. Merge
ture is not utilized in the presented encoder. The data RAM in
all CPUs is dual-port, the alternate port is used by the N2H2. Buffer B
Slave UARTs are multiplexed so that the PC can monitor any 3. Write
of the slaves. Timers are used for benchmarking and profil-
ing the program. The master needs an extra timer to be used
with the Ethernet. Slave’s bit buffer
HIBI is configured as a 32-bit wide single bus and used Global bit buffer SDRAM
with the single-FIFO interface. HIBI uses priority-based ar-
bitration, in which the master processor has the highest pri-
ority. Each of the HIBI wrappers has a five words deep low Figure 8: Master’s bit stream merge task.
priority data FIFO and a three words deep high priority
data FIFO. The HIBI bus single burst transfer length is lim-
ited to 25 words. Each N2H2 has eight Rx channels, sup-
used in the “send bits” phase of the master, except no bit
ports 256 separate addresses for channels and has a maxi-
shifting is needed since the bit stream is ensured to be byte
mum packet size of 65536 32-bit words. HIBI bridges are not
aligned by the implementation.
used, because there is no need for multiple clock domains.
The MPEG-4 encoding SW exploits the MPSOC features as
explained in the following sections. 5.4. Implementation of slaves’ tasks
SDRAM
(Local copy)
48 48 sliding ME window
MB Pos (x, y) MB Pos (x + 1, y)
32 bits 16
IRQ (data received)
SDRAM
N2H2 DMA SDRAM DMA
HIBI wrap. HIBI wrap.
HIBI 32 bits
Copy rows (2, 3) to (1, 2)
(Four 8-bit pixels packet to a 32-bit word) MB Pos (x, y) MB Pos (x + 1, y)
(a) Sliding ME window (b) Window update approaches
32-bit SDRAM yields maximum of 4 bytes ∗ 133 M = Due to the presence of data/instruction caches, IRQ process-
507 MB/s, the first approach is still a viable option due to ing, and the underlying on-chip network causing deviations
DMA [27]. Hence, the second approach is well suited for sys- in the results, three benchmarking runs were made and their
tems with slow SDRAM, while fast SDRAM and DMA can be average is reported as the result. Scalability is evaluated by
used to reduce CPU utilization. In this work, we support the computing the speed-up (S(n)) as
first approach because we have an efficient DMA that can be
used to minimize the CPU load. When ME is carried out, the FPSavg (n)
S(n) = , (2)
rest of the encoding is performed similar to a nonparallel en- FPSavg (1)
coder except that each slave works on a different slice. The
where FPSavg (n) is the average frame rate of a multislave con-
slave bit output uses a local/global buffering scheme compa-
figuration and FPSavg (1) is the average frame rate with a sin-
rable to Figure 8.
gle slave. In addition, parallelization efficiency (E(n)) is com-
puted as
6. RESULTS
FPSavg (n)
The performance of our system is evaluated by measuring the E(n) = ∗ 100%. (3)
n ∗ FPSavg (1)
FPGA utilization, the encoding frame rate as a function of the
number of slaves, and the complexities of encoding tasks. All
timing and profiling procedures for measurements are im- 6.1. FPGA utilization
plemented with HW timers running at the CPU clock fre- Table 3 shows the FPGA utilization of the MPSOC HW mod-
quency, 50 MHz. The system was benchmarked in buffered ules. The area consumption is reported in terms of Logic El-
mode, since the main concern is the pure video encoding ements (LE) and mem usage is the utilization of the on-chip
speed with no Ethernet activity. RAM. The statistics have been obtained by synthesizing MP-
The measurements were carried out with two standard SOC with Quartus II 5.0 into a Stratix 1S40. Currently, the
QCIF sequences carphone and news. The encoder was con- maximum frequency (50 MHz) is dictated by the connection
figured to use IPPP. . . frame structure, where only the first to external SRAM. Total memory and area refer to the max-
frame is Intra (I) coded while the rest are motion compen- imum capacity of the FPGA chip. Memory figures are deter-
sated Inter (P) frames. All tests were done in variable bit rate mined from the theoretical maximum number of the avail-
(VBR) mode where different bit rates are realized by chang- able memory bits. However, if we also count the bits that can-
ing QP. Video sequence relative input/output frame rates not be used due to the memory block architecture, the mem-
were fixed to 30 frames/s in all test cases. ory usage rises to 87% of the available memory resources.
During a benchmarking run, 120 frames of the selected Therefore, the memory utilization restricts the system and
test sequence were encoded. The average encoding frame rate not the logic utilization. The FIFO buffers in the system have
(FPSavg ) is computed as been implemented with on-chip RAM. We have also imple-
fcpu mented an FIFO using LE flip-flops as data storage. Thus, we
FPSavg (n) = , (1) can optionally save the on-chip RAM and use the spare LEs
cframe (n)
instead.
where fcpu is the CPU frequency and Cframe (n) denotes aver- LE resources are abundant and have been exploited in the
age encoding cycles per QCIF frame with n encoding slaves. architecture. For example, in the SDRAM controller, there
10 EURASIP Journal on Embedded Systems
13 13
12 12
11 11
10 10
Frames (s)
Frames (s)
9 9
8 8
7 7
6 6
5 5
4 4
3 3
1 2 3 1 2 3
Number of encoding slaves Number of encoding slaves
Ideal Ideal
QP = 25 (26 kbps @ 30 fps) QP = 25 (21 kbps @ 30 fps)
QP = 12 (65 kbps @ 30 fps) QP = 12 (52 kbps @ 30 fps)
QP = 4 (301 kbps @ 30 fps) QP = 4 (199 kbps @ 30 fps)
Figure 10: Frame rates for sequence carphone.qcif (176 × 144). Figure 11: Frame rates for sequence news.qcif (176 × 144).
are four read and four write ports, ensuring that no CPU has information yields
to unnecessarily wait. N2H2 has eight channels to provide
flexibility for the software programmer. There are spare LEs Cimg x, y, n, Cmb = Cslice x, y, n, Cmb + Cmaster (n), (4)
on the FPGA, since only 70% have been utilized. where Cimg is the clock cycles required to encode a frame, x
and y are the width and height of the luma image in pixels,
6.2. Encoding speed and software scalability n is the number of encoding processors, and Cmb is the aver-
age clock cycles required to encode a macroblock. The term
Figures 10 and 11 present average encoding frame rates as a Cmaster denotes the master’s overhead resulting from the se-
function of QP and the number of slaves for the carphone and quentially computed parts. Cslice represents parallelized com-
news QCIF sequences. The bit rates are reported relative to putations and is the number of clock cycles required to en-
the fixed 30 frames/s sequence output rate. The straight lines code the largest slice in the system. Mathematically Cslice is
depict an ideal speed-up, which is obtained by multiplying computed as
the frame rate of a single slave with the number of slaves. The
frame rates are measured from the master’s main encoding x/16 ∗ y/16
Cslice = ∗ Cmb , (5)
loop. n
As scalability was one of our main objectives, the results
where the rounding for x and y takes care that the image is
indicate very good success. The parallelization efficiency of
always divisible to macroblocks, for example, x and y do not
carphone using two slaves is within 97% of the ideal result. If
need to be divisible by 16. The overall rounding finds the size
we further increase the number of slaves to three, the paral-
of the largest slice in case the number of macroblocks is not
lelization efficiency is 93%. As the current FPGA restricts the
evenly divisible by the number of processors.
number of processors to four (one master and three slaves),
The master’s overhead results from four subfunctions
we estimate the performance of larger configurations in the
which can be combined as
following.
Cmaster (n) = Cconfig (n) + CgetBitStreamSize (n)
(6)
6.2.1. Performance estimation for larger configurations + Cmerge (n) + Coth (n),
The complexity of image encoding task depends on slice en- where Cconfig is due to the configuration of encoding param-
coding times as well as the overhead of the master. This eters for the slaves, CgetBitStreamSize results from reading the
Ari Kulmala et al. 11
Table 4: Measured clock cycles for Master’s subfunctions. Table 5: Parameterization of linear equations for complexity mod-
eling.
f (subfunction) 1 slave CPU 2 slave CPUs 3 slave CPUs
Merge 32876.4150 38009.8750 43160.7267 f (subfunction) a( f ) b( f ) c( f ) (CPU cycles)
Config 2821.6367 5313.8333 7878.8450 Merge 0.1564 0.8436 32876.4150
GetBitStreamSize 1264.2967 2517.6617 3772.6450 Config 0.8961 0.1039 2821.6367
Oth 117016.7367 117465.5550 117982.0483 GetBitStreamSize 0.9920 0.0080 1264.2967
Oth 0.0041 0.9959 117016.7367
3.5
Normalized execution time
3
130
2.5 120
110
Figure 12: Growth rate of complexity of master’s subtasks. Figure 13: Estimated frame rate for n-processor MPSOC system.
sizes of slave bit streams from SDRAM, Cmerge is the num- Due to the simultaneous access to the shared data mem-
ber of clock cycles due to merging of slave bits streams, and ory at the beginning of each frame encoding, the slave’s start-
others are related to the IRQs of the master and an internal up latency, that is, the time to get enough data to start pro-
state management. It is pointed out that for an optimized cessing, also increases as the number of slaves increase. This
system, all Ethernet related tasks are omitted. The measured time is not included in the estimate. Each slave requires one
average clock cycles for the aforementioned subfunctions are motion estimation window, 2560 bytes (640 words), to start
presented in Table 4 as a function of the number of encoding processing. It can be assumed that this amount can be trans-
processors. In Table 4, f identifies the subfunction. ferred from SDRAM to CPU in around 1000 cycles. Thus,
For the mathematical model, it is necessary to model the since the frame encoding time is millions of clock cycles, the
growth of the master task complexity as a function of n. The impact is quite insignificant.
complexity change is illustrated in Figure 12, which is plotted Finally, the encoding frame rate estimation on the
using the values in Table 4. For each subfunction, a curve was MPEG-4 MPSOC system is computed with
obtained by plotting the clock cycles with n encoding proces-
sors divided by the clock cycles required for one encoding fcpu
FPSMPSOC x, y, n, Cmb , fcpu = , (8)
CPU. Cimge x, y, n, Cmb
The results in Figure 12 show a linear increase in com-
plexity for all subfunctions of the master. Therefore, the where fcpu is the clock frequency of 50 MHz. With bench-
complexities of each subfunction as a function of n are marking it was found that Cmb is on the average of 133394.8
approximated with clock cycles per macroblock for carphone if a QP value of 12
is used.
Figure 13 presents the predicted encoding frame rate for
C f ∈{merge, config, getBitStreamSize, oth} (n) the optimized MPSOC as a function of n for the QCIF video
(7) format. The values are obtained with (8) using the param-
= a( f ) ∗ n + b( f ) ∗ c( f ), eters in Table 5. The system scales nearly linearly when n is
smaller than 12. After 12 encoding processors, the complex-
where a is the slope of the line (the gradient) and b is the ity of the master’s sequential overhead starts to increase faster
intercept on the vertical axis in Figure 12, and c is the number than is the benefit of data parallelization and the frame rate
of clock cycles of a subfunction with one encoding processor. saturates after 24 slaves. The small variation at large n is due
In practice, the clock cycles of one encoding processor are to the unbalanced sizes of slices.
scaled with the linear model to obtain a prediction for an n One solution to smooth out the variations would be to
CPU system. The subfunction specific parameters for (7) are use a finer subdivision granularly, for example, 8 × 8 or
presented in Table 5. 4 × 4 blocks, but this is impractical from an implementation
12 EURASIP Journal on Embedded Systems
8
7
6.4. Hardware scalability
6
5 HIBI offers good scalability. For this architecture, adding or
4 removing processors is simple—it only takes minutes to pa-
3 rameterize and prepare for synthesis. HIBI also offers a con-
1 2 3
venient way to add new IP components, for example new
Number of encoding slaves
processors or hardware accelerators. It is possible to add new
QP = 12, Ideal features to the system without altering the video encoding.
QP = 12, measured A simple example is the addition of a hardware utilization
QP = 12, estimated
monitor. The addition does not require any changes to the
encoding and it is easy to plug into the system by just adding
Figure 14: Ideal, measured, and estimated frame rates for car- one HIBI wrapper to the interconnection. Encouraging pre-
phone.qcif (176 × 144).
liminary results have been obtained from attaching a hard-
ware accelerator (DCT, IDCT, quantizer, and inverse quan-
point of view since a macroblock would span two processors. tizer) to the system (around a couple of FPS increase).
In practice, advanced horizontal parallelization scales better The utilization of the HIBI bus is measured to be only
than the row-wise approach used in [27] due to finer granu- 3%. Therefore, the interconnection is not likely to become
larity. a bottleneck even with larger configurations. From a perfor-
Figure 14 shows that the model applies well to real, mea- mance point of view, the interconnection architecture should
sured performance. It is slightly lower than the measured be as invisible as possible in hiding data transmission time
frame rates with the maximum error being about 4%. The from the CPU.
model is applicable to QCIF video streams and also for larger The speed-up gained by adding processors (e.g., Figure
formats by changing the measured execution times appro- 10) shows that the interconnection architecture performs
priately. It takes into account the number of macroblocks well. Figure 15 shows that only 4% (SDRAM config + comm
and as the number of processors increases, the sizes of slices wait) of the CPU encoding time is spent on data transmis-
may not be equal in terms of number of macroblocks result- sions and waiting for data.
ing in computational unbalance. However, if we use a larger As the HIBI utilization is low, it is expected that the
video size, for example CIF, the number of macroblocks also shared data memory (SDRAM) will not become the bottle-
increases. Therefore, with larger video sizes, we can benefit neck in the future. We have determined that the intercon-
from having more processors than is currently practical for nection is capable of transmitting the required data for a
QCIF. CIF image, which is four times larger at 25 frames/s using
a clock frequency of 28 MHz without forming a significant
6.3. Relative complexity of encoding tasks performance bottleneck. The application can perform data
fetches from memory in parallel with computation. There-
The complexities of different CPU tasks show how comput- fore, memory latencies can be tolerated.
ing power is shared within one processor. For the master pro- Shared instruction memory utilization, via the Avalon
cessor, up to 96% of the time is spent waiting the slaves to bus, is at most 20% of available bandwidth. We have inves-
complete. In buffered mode, frame transmissions over the tigated the effect of shared instruction memory on perfor-
Ethernet are not executed until the whole video sequence is mance and preliminary results indicate that it is currently
encoded and thus, this time is not included in the utilization negligible with respect to total frame encoding time. How-
figures. In this case, the master operates as a state machine. ever, four processors could share one instruction memory
However, the master processor is designed to handle all the port. The absolute maximum for sufficient performance [30]
external tasks that are not directly related to the video en- is ten processors per port.
coding. This can include I/O handling (e.g., Ethernet proto-
col stack), audio coding, and user interfaces, like changing 6.5. Comparison of scalability to related work
quantization parameters at run-time. The greatest require-
ment is fast response time for the user interface processor. In Figure 16, the scalability of different video encoders is pre-
Double buffering could also be used. sented. The optimal situation would be a gain of 100% par-
Figure 15 illustrates how the execution time is divided for allel efficiency, that is, a speed-up of 2.0 for 2 CPUs and 3.0
one slave processor to encode one macroblock. Motion esti- for 3 CPUs. Of the four general categories of parallelization,
mation (ME) is by far the most computationally challenging results from three are presented. No publications were found
task. Other time consuming tasks are interpolation, quanti- that describe a functional parallel encoder that scales with a
zation (Q), inverse quantization (IQ), DCT, and IDCT. The varying number of processors.
Ari Kulmala et al. 13
Comm. wait
0.88% Others Intra LD/ST
3.46% 0.06%
Q + IQ
9.02%
SDRAM config.
3.39%
DCT + IDCT
6.76%
MasterWait + Poll
6.88%
dcPred
2.97%
Interpolation
10.26%
vlcMB
1.6%
Intra/inter ?
3.42%
Block sub/add
3.47%
ME (SAD)
47.23%
Copy 16 16
0.59%
any changes when the number of processors is altered, thus [11] Q. Peng and Y. Zhao, “Study on parallel approach in H.26L
the scaling effort is very low. video encoder,” in 4th International Conference on Parallel and
The scalability and flexibility of the MPSOC architec- Distributed Computing, Applications and Technologies (PDCAT
ture is gained by using an HIBI on-chip network and soft- ’03), pp. 834–837, Chengdu, China, August 2003.
core processors in a plug-and-play fashion. The performance [12] S. M. Akramullah, I. Ahmad, and M. L. Liou, “Performance
of software-based MPEG-2 video encoder on parallel and dis-
and area usage can be flexibly compromised by changing the
tributed systems,” IEEE Transactions on Circuits and Systems
number of processors. It takes only minutes to change the for Video Technology, vol. 7, no. 4, pp. 687–695, 1997, Transac-
number of processors and then the new system is ready to tion Briefs.
be synthesized. Since the architecture is implemented in an [13] N. H. C. Yung and K.-K. Leung, “Spatial and temporal data
FPGA, the amount of on-chip memory becomes a limiting parallelization of the H.261 video coding algorithm,” IEEE
factor. Transactions on Circuits and Systems for Video Technology,
In the future, platform flexibility will be demonstrated vol. 11, no. 1, pp. 91–104, 2001.
with the use of different soft-core processors and hardware [14] Y. He, I. Ahmad, and M. L. Liou, “MPEG-4 based interactive
accelerators. The connection to external instruction mem- video using parallel processing,” in International Conference
ory will be replaced with an HIBI interface to achieve bet- on Parallel Processing (ICPP ’98), pp. 329–336, Minneapolis,
ter clock frequencies. Furthermore, scalability will be further Minn, USA, August 1998.
[15] M. J. Garrido, C. Sanz, M. Jiménez, and J. M. Menasses, “An
evaluated with larger FPGA chips and by connecting several
FPGA implementation of a flexible architecture for H.263
ones together to form a larger system. video coding,” IEEE Transactions on Consumer Electronics,
vol. 48, no. 4, pp. 1056–1066, 2002.
REFERENCES [16] R. Kordasiewicz and S. Shirani, “Hardware implementation of
the optimized transform and quantization blocks of H.264,”
[1] E. Salminen, A. Kulmala, and T. D. Hämäläinen, “HIBI-based in Canadian Conference on Electrical and Computer Engineer-
multiprocessor SoC on FPGA,” in IEEE International Sym- ing (CCECE ’04), vol. 2, pp. 943–946, Niagara Falls, Ontario,
posium on Circuits and Systems (ISCAS ’05), pp. 3351–3354, Canada, May 2004.
Kobe, Japan, May 2005. [17] Q. Peng and J. Jing, “H.264 codec system-on-chip design and
[2] O. Lehtoranta, E. Salminen, A. Kulmala, M. Hännikäinen, and verification,” in 5th International Conference on ASIC (ASI-
T. D. Hämäläinen, “A parallel MPEG-4 encoder for FPGA CON ’03), vol. 2, pp. 922–925, Beijing, China, October 2003.
based multiprocessor SoC,” in 15th International Conference [18] Y.-L. Lin, C.-P. Young, Y.-J. Chang, Y.-H. Chung, and A. W.
on Field Programmable Logic and Applications (FPL ’05), pp. Y. Su, “Versatile PC/FPGA based verification/fast prototyping
380–385, Tampere, Finland, August 2005. platform with multimedia applications,” in Proceedings of the
[3] Altera Corporation, Nios Development Board: Reference Man- 21st IEEE Instrumentation and Measurement Technology Con-
ual, Stratix Professional Edition, Ver. 1.1, July 2003. ference (IMTC ’04), vol. 2, pp. 1490–1495, Como, Italy, May
[4] E. Salminen, T. Kangas, J. Riihimäki, V. Lahtinen, K. Ku- 2004.
usilinna, and T. D. Hämäläinen, “HIBI v.2 communication [19] M. Martina, A. Molino, and F. Vacca, “FPGA system-on-chip
network for system-on-chip,” in Computer Systems: Architec- soft IP design: a reconfigurable DSP,” in Proceedings of the 45th
tures, Modeling, and Simulation, A. D. Pimentel and S. Vassil- Midwest Symposium on Circuits and Systems (MWSCAS ’02),
iadis, Eds., vol. 3133 of Lecture Notes in Computer Science, pp. vol. 3, pp. 196–199, Tulsa, Okla, USA, August 2002.
412–422, Springer, Berlin, Germany, July 2004. [20] X. Wang and S. G. Ziavras, “Parallel direct solution of lin-
ear equations of FPGA-based machines,” in IEEE International
[5] P. Kuhn, Algorithms, Complexity Analysis and VLSI Architec-
Parallel & Distributed Processing Symposium (IPDPS ’03), pp.
tures for MPEG-4 Motion Estimation, Kluwer Academic, Dor-
113–120, Nice, France, April 2003.
drecht, The Netherlands, 1999.
[21] Altera Corporation, “Avalon Interface Specification,” Ver. 2.4,
[6] I. Ahmad, Y. He, and M. L. Liou, “Video compression with
January 2004.
parallel processing,” Parallel Computing, vol. 28, no. 7-8, pp.
[22] H. Kalte, D. Langen, E. Vonnahme, A. Brinkmann, and
1039–1078, 2002.
U. Rückert, “Dynamically reconfigurable system-on-pro-
[7] O. Cantineau and J.-D. Legat, “Efficient parallelisation of an grammable-chip,” in 10th Euromicro Workshop on Parallel,
MPEG-2 codec on a TMS320C80 video processor,” in IEEE Distributed and Network-based Processing (PDP ’02), pp. 235–
International Conference on Image Processing (ICIP ’98), vol. 3, 242, Canary Islands, Spain, January 2002.
pp. 977–980, Chicago, Ill, USA, October 1998. [23] OCP-IP Alliance, “Open Core Protocol specification,” Release
[8] J. Nang and J. Kim, “An Effective parallelizing scheme of 2.0, 2003.
MPEG-1 video encoding on ethernet-connected worksta- [24] Altera Corporation, “Nios 3.0 CPU datasheet,” October 2004.
tions,” in Proceedings of the Conference on Advances in Parallel [25] Altera Corporation, “Nios II Processor Reference Handbook,”
and Distributed Computing (APDC ’97), pp. 4–11, Shanghai, May 2005.
China, March 1997. [26] Altera Corporation, Nios II, Site visited 28.11.2005, http://
[9] I. Agi and R. Jagannathan, “A portable fault-tolerant paral- www.altera.com/products/ip/processors/nios2/ni2-index.
lel software MPEG-1 encoder,” Multimedia Tools and Applica- html.
tions, vol. 2, no. 3, pp. 183–197, 1996. [27] O. Lehtoranta, T. D. Hämäläinen, V. Lappalainen, and J. Mu-
[10] D. M. Barbosa, J. P. Kitajima, and W. Weira Jr., “Parallelizing stonen, “Parallel implementation of video encoder on quad
MPEG video encoding using multiprocessors,” in Proceedings DSP system,” Microprocessors and Microsystems, vol. 26, no. 1,
of the XII Brazilian Symposium on Computer Graphics and Im- pp. 1–15, 2002.
age Processing (SIBGRAPI ’99), pp. 215–222, Sao Paulo, Brazil, [28] Altera Corporation, “Stratix Device Handbook,” January
October 1999. 2005.
Ari Kulmala et al. 15